How compute is accelerated on CPUs and GPUs

Jan-Eric Schäfrich

25 Feb 2023 • 3 min read

In simulation, finance and or graphics rendering its often the case that the same operation is performaned for multiple data elements of a data set. For example the operation could be the addition of two vectors. There each vector would be the data set and each dimension the data element. To compute the result an addition has to performed for each element of the vector.

CPU

Traditional instructions

With a traditional C program one could instruct the CPU the following way:

int main (){
	float a[100], b[100], c[100];
	for (int i = 0; i < 100; i++) {
		c[i] = a[i] + b[i];
	}

	return 0;
}

Compiled without any extensions the CPU would execute the program doing the following steps 100 times:

loop 100:
	load a; 
	load b;
	add a, b;
	store c;

For each data element there is an instruction which operation should be performed (single instruction, single data; SISD).

Vector instructions

Modern CPUs have special instructions that allows them to execute the same instruction on multiple data elements (single instruction, multiple data; SIMD). The ISA extension is called advanced vector extensions (AVX). For example AVX2 was introduced 2013 with Intel Haswell. Its instructions operate on 256 bit registers. In the case of 32 bit floating point numbers, each instruction would operate on 8 data elements at the same time.

Compiled with AVX instructions the CPU would execute the same program by doing the following only 13 times:

loop 13:
	vload a;
	vload b;
	vadd a, b;
	vstore c;

GPU

Using a GPU the computation would be performed differently. The CPU would copy the vectors a and b to GPU memory, instruct the GPU to perform a vector addition and copy the result vector c back to CPU memory, see.

Any function, like a vector addition, that is executed on a GPU is called a kernel.
The GPU creates one thread for each addition. In our case there will be 100 threads that independently execute the same kernel in parallel.

By grouping the threads to a block, the programmer can force the execution off all threads on the same Streaming Multiprocessor.

Pseudocode that runs on the CPU and instructs the GPU to perform the vector addition:

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c)
{
    // Get our global thread ID`
    int id = blockIdx.x*blockDim.x+threadIdx.x;`
     c[id] = a[id] + b[id];
}

int main (){
	
	float a[100], b[100], c[100];
	
	memcpyCPUtoGPU(a, b);
	//vecAdd<<<blockCount, threadsPerBlock>>>(a, b, c, threadCount);
	vecAdd<<<1, 100>>>(a, b, c, 100);
	memcpyGPUtoCPU(c);
	
	// do stuff with vector c
	
	return 0;
}

Inside the GPU the vector addition happens the following way:
The streaming multiprocessor takes the assigned threads and creates groups of 32 threads, called warps. Warps are then executed independently from each other, using what ever warps are ready. Each warp is executed in SIMD fashion, just like a vector instruction in a CPU. To run all 100 threads, 4 warps will be created.

Execution of the vector addition with different hardware. The more compute is done in parallel, the faster the vector addition completes.

Note that since Volta NVIDIAs streaming multiprocessors contain four processing blocks, that each process a different warp. So an SM could work an all 100 threads at the same time. Further note that throughput inside the processing block depends on the number of CUDA cors/load/store units. With Ada there are only four load/store units but 32 floating-point-capable CUDA cores per processing block. In terms of throughput, memory access is limited to four threads/cycle, while compute is limited 32 threads/cycle. Dispatching all 32 threads to execute load/store takes 8 cycles, while dispatching all 32 threads for the addition takes only one cycle.