my_digital_garden/4a1s/CP/PL - Aula 4.md

100 lines
2.1 KiB
Markdown
Raw Permalink Normal View History

2023-10-04 12:23:31 +01:00
---
dg-publish: true
---
2023-10-04 11:13:31 +01:00
4 de Outubro 2023 - #CP
## Ex. 2
2023-10-04 11:23:31 +01:00
#### a) Limitações vetoriais
A -> consecutive elements in a row -> consecutive access in the vector
C -> same element
B -> consecutive elements in a collumn
Não vai ser vetorizável.
2023-10-04 12:13:31 +01:00
#### b) Enable vectorization
2023-10-04 11:33:31 +01:00
result of change cycles to i, k , j :
2023-10-04 11:43:31 +01:00
A -> same element
2023-10-04 11:33:31 +01:00
C -> consecutive elements in a row -> consecutive access in the vector
B -> consecutive elements in a row -> consecutive access in the vector
i k j
0 0 1
2023-10-04 11:43:31 +01:00
Vai ser vetorizável.
128b
8B -> 64b
2 elements
2023-10-04 12:03:31 +01:00
Without vectorization:
![[Pasted image 20231004115725.png]]
With vectorization:
2023-10-04 11:53:31 +01:00
![[Pasted image 20231004115135.png]]
2023-10-04 12:13:31 +01:00
Estimated: ( n^3 / 2 )* 8
#### c)Measure and analyze results
2023-10-04 12:03:31 +01:00
| N | Version | Time | CPI | \#I |
| --- | -------- | ---- | --- | --- |
| 512 | base_v()| 0.492484818 | 0.91 | 1113554887 |
| 512 | vect() | 0.081604350 | 2.88 | 578275097 |
>[!note]- Commands run
>module load gcc/9.3.0
>gcc -O2 -ftree-vectorize -msse4 mmult.c
>srun --partition=cpar perf stat -e cycles,instructions ./a.out
2023-10-04 12:13:31 +01:00
#### d) Vectorization fine-tuning
Ganhos de 4 vezes mais.
2023-10-04 12:43:31 +01:00
----
2023-10-04 12:13:31 +01:00
## Ex. 3
2023-10-04 12:23:31 +01:00
#### a) Peak Performance
2 operações em FP
4 elementos de cada vez em cada ciclo
2.5 GHz -> 2.5 billion cycles per second
conclusion: 20 GFlop/s
2023-10-04 12:43:31 +01:00
>[!note]- Redoing the math
>AVX -> 256b -> 4 doubles
>machine is superscalar with 2 FOP units
>4x2= 8 double-perations
>
>freq = 2.5 GHz
>8x2.5= 20 GFlop/s
> ^ cpu limitiation
2023-10-04 12:23:31 +01:00
#### b)
2023-10-04 12:33:31 +01:00
peak with vectorization: continuous 20 GFlop/s
peak without vectorization: continuous 5 GFlop/s
memory bandwith limitation: ***see alinea d)***
real achievable performance:***see alinea c)***
2023-10-04 12:23:31 +01:00
measured performance:
2023-10-04 12:43:31 +01:00
#### d) Memory bandwidth limitation
1 FOP -> 2B
2023-10-04 12:23:31 +01:00
| GFlop/s | Flop/Byte |
| ------- | --------- |
2023-10-04 12:33:31 +01:00
| 0.125 | 2.5 |
| 0.25 | 5 |
2023-10-04 12:23:31 +01:00
| 0.5 | 10 |
| 1 | 20 |
2023-10-04 12:33:31 +01:00
| 2 | 40 |
| 4 | 80 |
| 8 | 160 |
#### c)
2 FOP (operações vírgula flutuante) -> 2 doubles (16B)
1 operation/8B -> 0.125
| GFlop/s | Flop/Byte |
| ------- | --------- |
2023-10-04 18:52:59 +01:00
| 2.5 | 0.125 |
2023-10-04 12:33:31 +01:00