---
dg-publish: true
---
4 de Outubro 2023 - #CP 

## Ex. 2
#### a) Limitações vetoriais
A -> consecutive elements in a row -> consecutive access in the vector
C -> same element
B ->  consecutive elements in a collumn 

Não vai ser vetorizável.


#### b) Enable vectorization
result of change cycles to i, k , j  :
A -> same element
C -> consecutive elements in a row -> consecutive access in the vector
B -> consecutive elements in a row -> consecutive access in the vector

i  k  j
0 0 1

Vai ser vetorizável.

128b
8B -> 64b
2 elements

Without vectorization:
![[Pasted image 20231004115725.png]]

With vectorization:
![[Pasted image 20231004115135.png]]
Estimated: ( n^3 / 2 )* 8
#### c)Measure and analyze results

| N   | Version  | Time | CPI | \#I  |
| --- | -------- | ---- | --- | --- | 
| 512 | base_v()|  0.492484818    |  0.91   |  1113554887   |   
| 512 | vect() |  0.081604350    |   2.88  |  578275097   |       

>[!note]- Commands run
>module load gcc/9.3.0
>gcc -O2 -ftree-vectorize -msse4 mmult.c
>srun --partition=cpar perf stat -e cycles,instructions ./a.out


#### d) Vectorization fine-tuning
Ganhos de 4 vezes mais.


----
## Ex. 3
#### a) Peak Performance
2 operações em FP
4 elementos de cada vez em cada ciclo
2.5 GHz -> 2.5 billion cycles per second

conclusion: 20 GFlop/s

>[!note]- Redoing the math
>AVX -> 256b -> 4 doubles
>machine is superscalar with 2 FOP units
>4x2= 8 double-perations 
>
>freq = 2.5 GHz
>8x2.5= 20 GFlop/s
>            ^ cpu limitiation

#### b)
peak with vectorization: continuous 20 GFlop/s
peak without vectorization: continuous 5 GFlop/s
memory bandwith limitation: ***see alinea d)***
real achievable performance:***see alinea c)***
measured performance:

#### d) Memory bandwidth limitation
1 FOP -> 2B

| GFlop/s | Flop/Byte |
| ------- | --------- |
| 0.125   | 2.5       |
| 0.25    | 5         |
| 0.5     | 10        |
| 1       | 20        |
| 2       | 40        |
| 4       | 80        |
| 8       | 160       |

#### c)
2 FOP (operações vírgula flutuante) -> 2 doubles (16B)
1 operation/8B -> 0.125

| GFlop/s | Flop/Byte |
| ------- | --------- |
|  2.5 |    0.125     |