Analyzing the STREAM Benchmark

The Stream benchmark is designed to measure sustainable memory bandwidth and computation rate for simple vector kernels.

What the benchmark does

The stream benchmark runs four kernels above arrays of 10 million FP64 elements. Bandwidth is calculated as the time from loading the first element until storing the last element of each kernel. The best bandwidth achieved is reported.

The kernels are called copy, scale, sum and triad and perform the following operations:

Copy Kernel

The copy kernel (Line 315) performs the operation c[j] = a[j]. It performs no arithmetic but involves the movement of 16 Byte of data, 8 Byte load and 8 Byte store.

Scale Kernel

The scale kernel (Line 325) performs the operation b[j] = scalar * c[j]. It consists of one arithmetic floating point operation per 24 Byte movement of data. 16 Byte are loaded, 8 Byte are stored.

Sum Kernel

The Sum kernel (Line 335) performs the operation c[j] = a[j] + b[j]. It consists of one arithmetic operations per 24 Byte of moved data. 16 Byte are loaded, 8 Byte are stored.

Triad Kernel

The triad kernel (Line 345) performs the operation a[j] = b[j] + scalar * c[j]. It consists of two arithmetic operations per 32 Byte of moved data. 24 Byte are loaded, 8 Byte are stored.

These kernels share two characteristics of low arithmetic intensity and spatial locality. They allow prefetchers to maximize the data rate when loading data from memory into the caches.
The data set is chosen large enough to not fit inside the last level cache. Each arrays is 8 Byte/element * 10e6 elements = 80 MB. They exceed typical L3 Cache sizes of 32 MB (Ryzen 7000).
Each kernel is run 10 times to remove noise, resulting in each kernel operation being run a total of 100 million (10e8) times.

What the CPU executes

I compiled the stream source with gcc and the recommended first optimization level with the following command:
gcc -O -o stream stream.c

To analyze the used instructions I ran the binary with Intels Software Development Emulator in version 9.33 with the following command:
sde -rpl -mix -- stream

STREAM uses 127 different instructions. The top ten instructions account for almost 99 % of all 2.7 billion instructions executed. They are the following:

stream benchmark top ten instructions.png

The top 7 instructions belong to the kernels used to measure the memory bandwidth. They account 96 % of all instructions. The most used instruction is MOVSD_XMM which is used both to load data from memory into a register and to store data from a register into memory. Even though STREAM is described as measuring bandwidth by executing vector kernels, with the recommend optimization level -O gcc doesn't use any vector instructions. Even though MOVSD, ADDSD, MULSD and MOVAPD utilize xmm vector register, they only operate on scalar values.

To better understand how these instructions are used, I created a listing file and analyzed the assembly of the kernels.
I created the listing file using the same optimization level -O and inserted debug symbols. The full command looks the following:
gcc -g -O -Wa,-adhln -masm=intel stream.c > stream.lst

Assembly of Copy Kernel (Line 315): c[j] = a[j]

Below is the assembly of copy kernel generated from the source code c[j] = a[j].

movsd xmm0, QWORD PTR [r12+rax*8]     ; load a into xmm0
movsd QWORD PTR [rbx+rax*8], xmm0     ; store xmm0 where c points at
add rax, 1                            ; j++
cmp rax, 10000000                     ; set status flags depending on j == 10e6 
jne .L85                              ; if j != 10e6 jump to the beginning of the copy kernel

Assembly of Scale Kernel (Line 325): b[j] = scalar * c[j]

Below is the assembly of scale kernel generated from the source code b[j] = scalar * c[j].

movsd xmm0, QWORD PTR .LC4[rip]      ; load scalar value into xmm0
mulsd xmm0, QWORD PTR [rbx+rax*8]    ; xmm0 = scalar * c[j]
movsd QWORD PTR 0[rbp+rax*8], xmm0   ; store scalar * c[j]
add	  rax, 1                         ; j++
cmp	  rax, 10000000                  ; set status flags depending on j == 10e6 
jne	  .L86                           ; if j != 10e6 jump to the beginning of the scale kernel

Note how the scalar is loaded at each iteration from memory, even though it is a constant value. This is necessary because the next instruction mulsd writes its result in the same register from where it read the first operand. This is a limitation of the instruction that supports only two operands. Its unclear to me why gcc did not allocated a different register like xmm1 or chose the vmulsd which would have allowed a third operand for the destination register. ARM and RISC-V instructions have three operands, and thus allow the destination register to be different from the source registers.

Assembly of Sum Kernel (Line 335): c[j] = a[j] + b[j]

Below is the assembly of the Sum Kernel generated from the source code c[j] = a[j] + b[j].

movsd xmm0, QWORD PTR [r12+rax*8]   ; load a[j] into xmm0
addsd xmm0, QWORD PTR 0[rbp+rax*8]  ; c[j] = a[j] + b[j] (reg xmm0 used to load operand and to store result)
movsd QWORD PTR [rbx+rax*8], xmm0   ; store c[j]
add	  rax, 1                        ; j++
cmp	  rax, 10000000                 ; set status flags depending on j == 10e6 
jne	  .L87                          ; if j != 10e6 jump to the beginning of the sum kernel

Assembly of Triad Kernel (Line 345): a[j] = b[j] + scalar * c[j]

Below is the assembly of Triad Kernel generated from the source code a[j] = b[j] + scalar * c[j].

movapd xmm0, xmm1                   ; move scalar into xmm0, (just before the loop, scalar was loaded into xmm1)
mulsd  xmm0, QWORD PTR [rbx+rax*8]  ; xmm0 = scalar * c[j]
addsd  xmm0, QWORD PTR 0[rbp+rax*8] ; xmm0 = b[j] + scalar * c[j]
movsd  QWORD PTR [r12+rax*8], xmm0  ; a[j] = xmm0
add	   rax, 1                       ; j++
cmp	   rax, 10000000                ; set status flags depending on j == 10e6 
jne	   .L88                         ; if j != 10e6 jump to the beginning of the triad kernel

Notice that addition and multiplication are performed as two separate instructions and not as a single fused multiply add (FMA) instruction.

The most frequently used instructions on places two to four (CMP,ADD andJNZ) are used in every kernel to update the running variable j and determine when to exit.

Memory Bandwidth Measurement

My System comprises a CPU AMD 7840U that is connected to dual channel 6400 MT/s DDR5 RAM.
The theoretical max bandwidth is the following:
BW = 2* 64 bit * 6.4 Gbit/s * 1 Byte / 8 bit = 102 GByte/s

Running the STREAM benchmark outputs the following measurements:

jscha@thinkpad:~/dvp/intel-sde$ stream
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           30165.4     0.005837     0.005304     0.009728
Scale:          26246.2     0.006210     0.006096     0.006791
Add:            30146.8     0.008057     0.007961     0.008529
Triad:          29947.7     0.008326     0.008014     0.009746
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

The first observation is that best case practical bandwidth achieved is only 1/3 of the theoretical maximum. The instructions fetched are not counted in the bandwidth measurement. However because they are small enough to fit inside the instruction cache, they are only fetched once from memory and should have almost no impact on the measurement.