my_digital_garden/4a1s/CP/PL - Aula 7.md

592 lines
8.7 KiB
Markdown

---
dg-publish: true
---
25 Outubro 2023 - #CP
---
# Ficha 6
https://learn.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-clauses?view=msvc-170
## Ex 1
Versão original, result:
```
T1:i50 w=11
T1:i51 w=12
T1:i52 w=13
T1:i53 w=14
T1:i54 w=15
T1:i55 w=16
T0:i0 w=10
T0:i1 w=18
T0:i2 w=19
T0:i3 w=20
T0:i4 w=21
T0:i5 w=22
T1:i56 w=17
T0:i6 w=23
T1:i57 w=24
T0:i7 w=25
T1:i58 w=26
T0:i8 w=27
T1:i59 w=28
T1:i60 w=30
T1:i61 w=31
T1:i62 w=32
T0:i9 w=29
T1:i63 w=33
T0:i10 w=34
T1:i64 w=35
T0:i11 w=36
T1:i65 w=37
T0:i12 w=38
T0:i13 w=40
T0:i14 w=41
T0:i15 w=42
T1:i66 w=39
T0:i16 w=43
T1:i67 w=44
T1:i68 w=45
T0:i17 w=46
T1:i69 w=47
T1:i70 w=49
T1:i71 w=50
T1:i72 w=51
T0:i18 w=48
T1:i73 w=52
T0:i19 w=53
T1:i74 w=54
T0:i20 w=55
T1:i75 w=56
T0:i21 w=57
T1:i76 w=58
T0:i22 w=59
T1:i77 w=60
T0:i23 w=61
T1:i78 w=62
T0:i24 w=63
T1:i79 w=64
T0:i25 w=65
T1:i80 w=66
T0:i26 w=67
T1:i81 w=68
T0:i27 w=69
T0:i28 w=71
T0:i29 w=72
T0:i30 w=73
T1:i82 w=70
T1:i83 w=75
T1:i84 w=76
T1:i85 w=77
T0:i31 w=74
T1:i86 w=78
T0:i32 w=79
T1:i87 w=80
T0:i33 w=81
T1:i88 w=82
T0:i34 w=83
T1:i89 w=84
T0:i35 w=85
T1:i90 w=86
T0:i36 w=87
T1:i91 w=88
T0:i37 w=89
T1:i92 w=90
T0:i38 w=91
T1:i93 w=92
T0:i39 w=93
T1:i94 w=94
T0:i40 w=95
T0:i41 w=96
T0:i42 w=97
T0:i43 w=98
T0:i44 w=99
T0:i45 w=100
T0:i46 w=101
T0:i47 w=102
T0:i48 w=103
T0:i49 w=104
T1:i95 w=105
T1:i96 w=106
T1:i97 w=107
T1:i98 w=108
T1:i99 w=109
w=110
```
Enquanto o for se desenlaça, as threads 0 ou 1 vão "apanhando" aleatoriamente o print e o w é sempre incrementado como uma uma variável global conforme a iteração do for. No final w=110.
## 1.1 Versão com private(w)
```
T1:i50 w=0
T1:i51 w=1
T1:i52 w=2
T1:i53 w=3
T1:i54 w=4
T1:i55 w=5
T0:i0 w=0
T0:i1 w=1
T0:i2 w=2
T0:i3 w=3
T0:i4 w=4
T1:i56 w=6
T1:i57 w=7
T1:i58 w=8
T1:i59 w=9
T1:i60 w=10
T0:i5 w=5
T0:i6 w=6
T0:i7 w=7
T0:i8 w=8
T1:i61 w=11
T0:i9 w=9
T1:i62 w=12
T0:i10 w=10
T1:i63 w=13
T0:i11 w=11
T1:i64 w=14
T0:i12 w=12
T1:i65 w=15
T0:i13 w=13
T1:i66 w=16
T0:i14 w=14
T1:i67 w=17
T0:i15 w=15
T1:i68 w=18
T0:i16 w=16
T1:i69 w=19
T0:i17 w=17
T1:i70 w=20
T0:i18 w=18
T1:i71 w=21
T0:i19 w=19
T1:i72 w=22
T0:i20 w=20
T1:i73 w=23
T1:i74 w=24
T1:i75 w=25
T1:i76 w=26
T1:i77 w=27
T0:i21 w=21
T1:i78 w=28
T0:i22 w=22
T1:i79 w=29
T0:i23 w=23
T1:i80 w=30
T0:i24 w=24
T1:i81 w=31
T0:i25 w=25
T0:i26 w=26
T0:i27 w=27
T0:i28 w=28
T1:i82 w=32
T0:i29 w=29
T1:i83 w=33
T1:i84 w=34
T0:i30 w=30
T1:i85 w=35
T0:i31 w=31
T1:i86 w=36
T0:i32 w=32
T1:i87 w=37
T0:i33 w=33
T1:i88 w=38
T1:i89 w=39
T1:i90 w=40
T1:i91 w=41
T0:i34 w=34
T1:i92 w=42
T0:i35 w=35
T1:i93 w=43
T0:i36 w=36
T1:i94 w=44
T0:i37 w=37
T1:i95 w=45
T0:i38 w=38
T1:i96 w=46
T0:i39 w=39
T1:i97 w=47
T0:i40 w=40
T1:i98 w=48
T0:i41 w=41
T1:i99 w=49
T0:i42 w=42
T0:i43 w=43
T0:i44 w=44
T0:i45 w=45
T0:i46 w=46
T0:i47 w=47
T0:i48 w=48
T0:i49 w=49
w=10
```
Enquanto o for se desenlaça, as threads 0 ou 1 vão "apanhando" aleatoriamente o print, mas o w de cada thread é da sua própria stack (iniciando-se a 0 para cada stack pois cada thread é "privada"). Após o término das threads, o w mantém-se com o seu valor definido no global (pois as threads são privadas).
## 1.2 Versão com firstprivate(w)
```
T1:i50 w=10
T1:i51 w=11
T1:i52 w=12
T1:i53 w=13
T1:i54 w=14
T1:i55 w=15
T0:i0 w=10
T0:i1 w=11
T0:i2 w=12
T0:i3 w=13
T1:i56 w=16
T0:i4 w=14
T1:i57 w=17
T0:i5 w=15
T1:i58 w=18
T0:i6 w=16
T1:i59 w=19
T0:i7 w=17
T1:i60 w=20
T0:i8 w=18
T1:i61 w=21
T0:i9 w=19
T1:i62 w=22
T0:i10 w=20
T1:i63 w=23
T0:i11 w=21
T1:i64 w=24
T0:i12 w=22
T1:i65 w=25
T0:i13 w=23
T1:i66 w=26
T0:i14 w=24
T1:i67 w=27
T0:i15 w=25
T1:i68 w=28
T0:i16 w=26
T1:i69 w=29
T0:i17 w=27
T1:i70 w=30
T0:i18 w=28
T1:i71 w=31
T0:i19 w=29
T1:i72 w=32
T0:i20 w=30
T1:i73 w=33
T0:i21 w=31
T1:i74 w=34
T0:i22 w=32
T1:i75 w=35
T0:i23 w=33
T1:i76 w=36
T0:i24 w=34
T1:i77 w=37
T0:i25 w=35
T1:i78 w=38
T0:i26 w=36
T1:i79 w=39
T0:i27 w=37
T1:i80 w=40
T0:i28 w=38
T1:i81 w=41
T0:i29 w=39
T1:i82 w=42
T0:i30 w=40
T1:i83 w=43
T0:i31 w=41
T1:i84 w=44
T0:i32 w=42
T1:i85 w=45
T0:i33 w=43
T1:i86 w=46
T0:i34 w=44
T1:i87 w=47
T0:i35 w=45
T1:i88 w=48
T0:i36 w=46
T1:i89 w=49
T0:i37 w=47
T1:i90 w=50
T0:i38 w=48
T1:i91 w=51
T0:i39 w=49
T1:i92 w=52
T0:i40 w=50
T1:i93 w=53
T0:i41 w=51
T1:i94 w=54
T0:i42 w=52
T1:i95 w=55
T0:i43 w=53
T1:i96 w=56
T0:i44 w=54
T1:i97 w=57
T0:i45 w=55
T1:i98 w=58
T0:i46 w=56
T1:i99 w=59
T0:i47 w=57
T0:i48 w=58
T0:i49 w=59
w=10
```
Enquanto o for se desenlaça, as threads 0 ou 1 vão "apanhando" aleatoriamente o print, mas o w de cada thread inicia-se com o valor definido para o definido w global. Dito isto, cada thread trabalhará um w próprio da stack a partir daquele momento. Após o término das threads, o w mantém-se com o seu valor definido no global pois existe antes da paralelização do programa.
## 1.2 Versão com lastprivate(w)
```
T1:i50 w=0
T1:i51 w=1
T1:i52 w=2
T1:i53 w=3
T1:i54 w=4
T1:i55 w=5
T1:i56 w=6
T0:i0 w=0
T0:i1 w=1
T0:i2 w=2
T0:i3 w=3
T1:i57 w=7
T0:i4 w=4
T1:i58 w=8
T1:i59 w=9
T1:i60 w=10
T1:i61 w=11
T0:i5 w=5
T1:i62 w=12
T0:i6 w=6
T1:i63 w=13
T0:i7 w=7
T1:i64 w=14
T0:i8 w=8
T1:i65 w=15
T0:i9 w=9
T1:i66 w=16
T0:i10 w=10
T1:i67 w=17
T0:i11 w=11
T1:i68 w=18
T0:i12 w=12
T1:i69 w=19
T0:i13 w=13
T1:i70 w=20
T0:i14 w=14
T1:i71 w=21
T0:i15 w=15
T1:i72 w=22
T0:i16 w=16
T1:i73 w=23
T0:i17 w=17
T1:i74 w=24
T0:i18 w=18
T1:i75 w=25
T0:i19 w=19
T1:i76 w=26
T0:i20 w=20
T1:i77 w=27
T0:i21 w=21
T1:i78 w=28
T0:i22 w=22
T1:i79 w=29
T0:i23 w=23
T1:i80 w=30
T0:i24 w=24
T0:i25 w=25
T0:i26 w=26
T0:i27 w=27
T1:i81 w=31
T0:i28 w=28
T1:i82 w=32
T0:i29 w=29
T1:i83 w=33
T1:i84 w=34
T1:i85 w=35
T1:i86 w=36
T0:i30 w=30
T1:i87 w=37
T0:i31 w=31
T1:i88 w=38
T0:i32 w=32
T1:i89 w=39
T0:i33 w=33
T1:i90 w=40
T0:i34 w=34
T1:i91 w=41
T0:i35 w=35
T1:i92 w=42
T0:i36 w=36
T1:i93 w=43
T0:i37 w=37
T1:i94 w=44
T0:i38 w=38
T1:i95 w=45
T0:i39 w=39
T1:i96 w=46
T0:i40 w=40
T1:i97 w=47
T0:i41 w=41
T1:i98 w=48
T0:i42 w=42
T1:i99 w=49
T0:i43 w=43
T0:i44 w=44
T0:i45 w=45
T0:i46 w=46
T0:i47 w=47
T0:i48 w=48
T0:i49 w=49
w=50
```
Enquanto o for se desenlaça, as threads 0 ou 1 vão "apanhando" aleatoriamente o print, mas o w de cada thread inicia-se com o valor 0 independentemente do valor do w global. Nas threads o valor irá ser incrementado e após estas, o w terá o valor do w da stack da thread realizada por último (ou a última secção - \#pragama section).
## 1.4 Versão com reduction(+:w)
```
T1:i50 w=0
T1:i51 w=1
T1:i52 w=2
T1:i53 w=3
T1:i54 w=4
T1:i55 w=5
T1:i56 w=6
T1:i57 w=7
T0:i0 w=0
T0:i1 w=1
T0:i2 w=2
T1:i58 w=8
T0:i3 w=3
T1:i59 w=9
T0:i4 w=4
T1:i60 w=10
T1:i61 w=11
T1:i62 w=12
T1:i63 w=13
T0:i5 w=5
T1:i64 w=14
T0:i6 w=6
T1:i65 w=15
T0:i7 w=7
T1:i66 w=16
T0:i8 w=8
T0:i9 w=9
T0:i10 w=10
T0:i11 w=11
T1:i67 w=17
T0:i12 w=12
T1:i68 w=18
T0:i13 w=13
T1:i69 w=19
T0:i14 w=14
T1:i70 w=20
T1:i71 w=21
T1:i72 w=22
T1:i73 w=23
T0:i15 w=15
T1:i74 w=24
T0:i16 w=16
T1:i75 w=25
T0:i17 w=17
T1:i76 w=26
T0:i18 w=18
T0:i19 w=19
T0:i20 w=20
T0:i21 w=21
T1:i77 w=27
T0:i22 w=22
T1:i78 w=28
T0:i23 w=23
T1:i79 w=29
T0:i24 w=24
T1:i80 w=30
T1:i81 w=31
T1:i82 w=32
T1:i83 w=33
T1:i84 w=34
T0:i25 w=25
T1:i85 w=35
T1:i86 w=36
T1:i87 w=37
T1:i88 w=38
T0:i26 w=26
T1:i89 w=39
T0:i27 w=27
T1:i90 w=40
T0:i28 w=28
T0:i29 w=29
T0:i30 w=30
T0:i31 w=31
T1:i91 w=41
T0:i32 w=32
T1:i92 w=42
T0:i33 w=33
T1:i93 w=43
T0:i34 w=34
T1:i94 w=44
T1:i95 w=45
T1:i96 w=46
T1:i97 w=47
T0:i35 w=35
T1:i98 w=48
T0:i36 w=36
T1:i99 w=49
T0:i37 w=37
T0:i38 w=38
T0:i39 w=39
T0:i40 w=40
T0:i41 w=41
T0:i42 w=42
T0:i43 w=43
T0:i44 w=44
T0:i45 w=45
T0:i46 w=46
T0:i47 w=47
T0:i48 w=48
T0:i49 w=49
w=110
```
Enquanto o for se desenlaça, as threads 0 ou 1 vão "apanhando" aleatoriamente o print conforme a iteração, e o w inicia-se com o valor 0 para cada thread. Os w privados das stacks das threads irão ser incrementados independentemente, sendo que no final irá ser somado os valores finais dos ws das threads com o valor global do w. Este valor ser o mesmo do original é pura sorte.
___
## Ex. 2
Versão original:
```
Dot is 1.2020569031095949
```
### a+b)
Run \#1 (--cpus-per-task=2):
```
Dot is 1.2020569029595982
```
Run \#n (--cpus-per-task=2):
```
Dot is 0.0000000001499965
```
With the same cpus per task (cores), the result may vary depending on how each threads randomly "picks up" the iterations of the for.
Run \#2 (--cpus-per-task=4):
```
Dot is 0.0000000005999720
```
Run \#3 (--cpus-per-task=6):
```
Dot is 0.0000000013500135
```
Run \#4 (--cpus-per-task=8):
```
Dot is 0.0000000000719980
```
Varies with the number of cpus per task (cores) because depending on the number of threads, the distribution of the instruction ```dot += a[i]*b[i];``` per thread will vary, aka the value ```i``` will constantly vary, making the memory space alternate between versions.
### c)
We can change the code to:
And the results will always be:
### d)
We can use _reduction_ to do that (and it's more efficient than 'critical').