my_digital_garden/4a1s/CP/PL - Aula 4.md

1.7 KiB

dg-publish
true

4 de Outubro 2023 - #CP

Ex. 2

a) Limitações vetoriais

A -> consecutive elements in a row -> consecutive access in the vector C -> same element B -> consecutive elements in a collumn

Não vai ser vetorizável.

b) Enable vectorization

result of change cycles to i, k , j : A -> same element C -> consecutive elements in a row -> consecutive access in the vector B -> consecutive elements in a row -> consecutive access in the vector

i k j 0 0 1

Vai ser vetorizável.

128b 8B -> 64b 2 elements

Without vectorization: !Pasted image 20231004115725.png

With vectorization: !Pasted image 20231004115135.png Estimated: ( n^3 / 2 )* 8

c)Measure and analyze results

N Version Time CPI #I
512 base_v() 0.492484818 0.91 1113554887
512 vect() 0.081604350 2.88 578275097

[!note]- Commands run module load gcc/9.3.0 gcc -O2 -ftree-vectorize -msse4 mmult.c srun --partition=cpar perf stat -e cycles,instructions ./a.out

d) Vectorization fine-tuning

Ganhos de 4 vezes mais.

Ex. 3

a) Peak Performance

2 operações em FP 4 elementos de cada vez em cada ciclo 2.5 GHz -> 2.5 billion cycles per second

conclusion: 20 GFlop/s

b)

peak with vectorization: 20 GFlop/s peak without vectorization: 5 GFlop/s memory bandwith limitation: 20 GFlop/s real achievable performance: measured performance:

c)

memory bandwith limitation

GFlop/s Flop/Byte
0.125
0.25
0.5 10
1 20
2
4
8