my_digital_garden/4a1s/CP/PL - Aula 4.md at main - Alice/my_digital_garden

Alice 96d4075be9 vault backup: 2023-10-04 18:52:59

2023-10-04 18:52:59 +01:00

2.1 KiB

Raw Permalink Blame History

dg-publish
true

4 de Outubro 2023 - #CP

Ex. 2

a) Limitações vetoriais

A -> consecutive elements in a row -> consecutive access in the vector C -> same element B -> consecutive elements in a collumn

Não vai ser vetorizável.

b) Enable vectorization

result of change cycles to i, k , j : A -> same element C -> consecutive elements in a row -> consecutive access in the vector B -> consecutive elements in a row -> consecutive access in the vector

i k j 0 0 1

Vai ser vetorizável.

128b 8B -> 64b 2 elements

Without vectorization: !

With vectorization: ! Estimated: ( n^3 / 2 )* 8

c)Measure and analyze results

N	Version	Time	CPI	#I
512	base_v()	0.492484818	0.91	1113554887
512	vect()	0.081604350	2.88	578275097

[!note]- Commands run module load gcc/9.3.0 gcc -O2 -ftree-vectorize -msse4 mmult.c srun --partition=cpar perf stat -e cycles,instructions ./a.out

d) Vectorization fine-tuning

Ganhos de 4 vezes mais.

Ex. 3

a) Peak Performance

2 operações em FP 4 elementos de cada vez em cada ciclo 2.5 GHz -> 2.5 billion cycles per second

conclusion: 20 GFlop/s

[!note]- Redoing the math AVX -> 256b -> 4 doubles machine is superscalar with 2 FOP units 4x2= 8 double-perations

freq = 2.5 GHz 8x2.5= 20 GFlop/s ^ cpu limitiation

b)

peak with vectorization: continuous 20 GFlop/s peak without vectorization: continuous 5 GFlop/s memory bandwith limitation: see alinea d) real achievable performance:see alinea c) measured performance:

d) Memory bandwidth limitation

1 FOP -> 2B

GFlop/s	Flop/Byte
0.125	2.5
0.25	5
0.5	10
1	20
2	40
4	80
8	160

c)

2 FOP (operações vírgula flutuante) -> 2 doubles (16B) 1 operation/8B -> 0.125

GFlop/s	Flop/Byte
2.5	0.125

2.1 KiB Raw Permalink Blame History