--- dg-publish: true --- 4 de Outubro 2023 - #CP ## Ex. 2 #### a) Limitações vetoriais A -> consecutive elements in a row -> consecutive access in the vector C -> same element B -> consecutive elements in a collumn Não vai ser vetorizável. #### b) Enable vectorization result of change cycles to i, k , j : A -> same element C -> consecutive elements in a row -> consecutive access in the vector B -> consecutive elements in a row -> consecutive access in the vector i k j 0 0 1 Vai ser vetorizável. 128b 8B -> 64b 2 elements Without vectorization: ![[Pasted image 20231004115725.png]] With vectorization: ![[Pasted image 20231004115135.png]] Estimated: ( n^3 / 2 )* 8 #### c)Measure and analyze results | N | Version | Time | CPI | \#I | | --- | -------- | ---- | --- | --- | | 512 | base_v()| 0.492484818 | 0.91 | 1113554887 | | 512 | vect() | 0.081604350 | 2.88 | 578275097 | >[!note]- Commands run >module load gcc/9.3.0 >gcc -O2 -ftree-vectorize -msse4 mmult.c >srun --partition=cpar perf stat -e cycles,instructions ./a.out #### d) Vectorization fine-tuning Ganhos de 4 vezes mais. ---- ## Ex. 3 #### a) Peak Performance 2 operações em FP 4 elementos de cada vez em cada ciclo 2.5 GHz -> 2.5 billion cycles per second conclusion: 20 GFlop/s >[!note]- Redoing the math >AVX -> 256b -> 4 doubles >machine is superscalar with 2 FOP units >4x2= 8 double-perations > >freq = 2.5 GHz >8x2.5= 20 GFlop/s > ^ cpu limitiation #### b) peak with vectorization: continuous 20 GFlop/s peak without vectorization: continuous 5 GFlop/s memory bandwith limitation: ***see alinea d)*** real achievable performance:***see alinea c)*** measured performance: #### d) Memory bandwidth limitation 1 FOP -> 2B | GFlop/s | Flop/Byte | | ------- | --------- | | 0.125 | 2.5 | | 0.25 | 5 | | 0.5 | 10 | | 1 | 20 | | 2 | 40 | | 4 | 80 | | 8 | 160 | #### c) 2 FOP (operações vírgula flutuante) -> 2 doubles (16B) 1 operation/8B -> 0.125 | GFlop/s | Flop/Byte | | ------- | --------- | | 2.5 | 0.125 |