2023-10-04 12:23:31 +01:00
|
|
|
---
|
|
|
|
dg-publish: true
|
|
|
|
---
|
2023-10-04 11:13:31 +01:00
|
|
|
4 de Outubro 2023 - #CP
|
|
|
|
|
|
|
|
## Ex. 2
|
2023-10-04 11:23:31 +01:00
|
|
|
#### a) Limitações vetoriais
|
|
|
|
A -> consecutive elements in a row -> consecutive access in the vector
|
|
|
|
C -> same element
|
|
|
|
B -> consecutive elements in a collumn
|
|
|
|
|
|
|
|
Não vai ser vetorizável.
|
|
|
|
|
|
|
|
|
2023-10-04 12:13:31 +01:00
|
|
|
#### b) Enable vectorization
|
2023-10-04 11:33:31 +01:00
|
|
|
result of change cycles to i, k , j :
|
2023-10-04 11:43:31 +01:00
|
|
|
A -> same element
|
2023-10-04 11:33:31 +01:00
|
|
|
C -> consecutive elements in a row -> consecutive access in the vector
|
|
|
|
B -> consecutive elements in a row -> consecutive access in the vector
|
|
|
|
|
|
|
|
i k j
|
|
|
|
0 0 1
|
|
|
|
|
2023-10-04 11:43:31 +01:00
|
|
|
Vai ser vetorizável.
|
|
|
|
|
|
|
|
128b
|
|
|
|
8B -> 64b
|
|
|
|
2 elements
|
|
|
|
|
2023-10-04 12:03:31 +01:00
|
|
|
Without vectorization:
|
|
|
|
![[Pasted image 20231004115725.png]]
|
|
|
|
|
|
|
|
With vectorization:
|
2023-10-04 11:53:31 +01:00
|
|
|
![[Pasted image 20231004115135.png]]
|
2023-10-04 12:13:31 +01:00
|
|
|
Estimated: ( n^3 / 2 )* 8
|
|
|
|
#### c)Measure and analyze results
|
2023-10-04 12:03:31 +01:00
|
|
|
|
|
|
|
| N | Version | Time | CPI | \#I |
|
|
|
|
| --- | -------- | ---- | --- | --- |
|
|
|
|
| 512 | base_v()| 0.492484818 | 0.91 | 1113554887 |
|
|
|
|
| 512 | vect() | 0.081604350 | 2.88 | 578275097 |
|
|
|
|
|
|
|
|
>[!note]- Commands run
|
|
|
|
>module load gcc/9.3.0
|
|
|
|
>gcc -O2 -ftree-vectorize -msse4 mmult.c
|
|
|
|
>srun --partition=cpar perf stat -e cycles,instructions ./a.out
|
|
|
|
|
|
|
|
|
2023-10-04 12:13:31 +01:00
|
|
|
#### d) Vectorization fine-tuning
|
|
|
|
Ganhos de 4 vezes mais.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Ex. 3
|
2023-10-04 12:23:31 +01:00
|
|
|
#### a) Peak Performance
|
|
|
|
2 operações em FP
|
|
|
|
4 elementos de cada vez em cada ciclo
|
|
|
|
2.5 GHz -> 2.5 billion cycles per second
|
|
|
|
|
|
|
|
conclusion: 20 GFlop/s
|
|
|
|
|
|
|
|
#### b)
|
2023-10-04 12:33:31 +01:00
|
|
|
peak with vectorization: continuous 20 GFlop/s
|
|
|
|
peak without vectorization: continuous 5 GFlop/s
|
|
|
|
memory bandwith limitation: ***see alinea d)***
|
|
|
|
real achievable performance:***see alinea c)***
|
2023-10-04 12:23:31 +01:00
|
|
|
measured performance:
|
|
|
|
|
2023-10-04 12:33:31 +01:00
|
|
|
#### d)
|
2023-10-04 12:23:31 +01:00
|
|
|
memory bandwith limitation
|
|
|
|
|
|
|
|
| GFlop/s | Flop/Byte |
|
|
|
|
| ------- | --------- |
|
2023-10-04 12:33:31 +01:00
|
|
|
| 0.125 | 2.5 |
|
|
|
|
| 0.25 | 5 |
|
2023-10-04 12:23:31 +01:00
|
|
|
| 0.5 | 10 |
|
|
|
|
| 1 | 20 |
|
2023-10-04 12:33:31 +01:00
|
|
|
| 2 | 40 |
|
|
|
|
| 4 | 80 |
|
|
|
|
| 8 | 160 |
|
|
|
|
|
|
|
|
#### c)
|
|
|
|
2 FOP (operações vírgula flutuante) -> 2 doubles (16B)
|
|
|
|
1 operation/8B -> 0.125
|
|
|
|
|
|
|
|
| GFlop/s | Flop/Byte |
|
|
|
|
| ------- | --------- |
|
|
|
|
| 0.125 | 2.5 |
|
|
|
|
|
|
|
|
|
|
|
|
#### d)
|
|
|
|
AVX -> 256b -> 4 doubles
|