my_digital_garden/4a1s/CP/PL - Aula 4.md

81 lines
1.7 KiB
Markdown

---
dg-publish: true
---
4 de Outubro 2023 - #CP
## Ex. 2
#### a) Limitações vetoriais
A -> consecutive elements in a row -> consecutive access in the vector
C -> same element
B -> consecutive elements in a collumn
Não vai ser vetorizável.
#### b) Enable vectorization
result of change cycles to i, k , j :
A -> same element
C -> consecutive elements in a row -> consecutive access in the vector
B -> consecutive elements in a row -> consecutive access in the vector
i k j
0 0 1
Vai ser vetorizável.
128b
8B -> 64b
2 elements
Without vectorization:
![[Pasted image 20231004115725.png]]
With vectorization:
![[Pasted image 20231004115135.png]]
Estimated: ( n^3 / 2 )* 8
#### c)Measure and analyze results
| N | Version | Time | CPI | \#I |
| --- | -------- | ---- | --- | --- |
| 512 | base_v()| 0.492484818 | 0.91 | 1113554887 |
| 512 | vect() | 0.081604350 | 2.88 | 578275097 |
>[!note]- Commands run
>module load gcc/9.3.0
>gcc -O2 -ftree-vectorize -msse4 mmult.c
>srun --partition=cpar perf stat -e cycles,instructions ./a.out
#### d) Vectorization fine-tuning
Ganhos de 4 vezes mais.
## Ex. 3
#### a) Peak Performance
2 operações em FP
4 elementos de cada vez em cada ciclo
2.5 GHz -> 2.5 billion cycles per second
conclusion: 20 GFlop/s
#### b)
peak with vectorization: 20 GFlop/s
peak without vectorization: 5 GFlop/s
memory bandwith limitation: 20 GFlop/s
real achievable performance:
measured performance:
#### c)
memory bandwith limitation
| GFlop/s | Flop/Byte |
| ------- | --------- |
| 0.125 | |
| 0.25 | |
| 0.5 | 10 |
| 1 | 20 |
| 2 | |
| 4 | |
| 8 | |