Analysis Of Current CPU and GPU Design

CPUs and GPUs are both processors used to process data. The concept of a central processing unit is being a general purpose application processor. It should run a wide variety of software, from operating systems, over web-browsers to image manipulation software fast. The concept of a GPU is that of special purpose accelerator. Its used in addition to a CPU to accelerate certain tasks like the rendering of games and movies, running scientific simulations for medicine, biology or engineering or neural networks.

This article compares the design of current high end CPUs and GPUs. It shows how the different processor concepts, general purpose vs special purpose, impacts design decisions.
On the CPU side it takes a look at the AMD Ryzen 9 7950X and the Intel Core i9-13900K. On the GPU side it takes a look at the AMD Radeon RX 7900 XTX and the NVIDIA RTX 4090.

Aspect Ryzen 7950X i9-13900K RTX 4090 RX 7900 XTX
Basic Processing Unit Core Core Streaming Multiprocessor (SM) Workgroup Processor
No. of Processing Units 16 24 128 48
Clockspeed (all core) 5,1 GHz 5,5 GHZ 2,5 GHz 2,5 GHz
TFOPS (FP32 FMA)

Architecture

CPUs and GPUs are build around a basic processing unit. On CPUs its called core, and on GPUs its called Streaming Multiprocessor (NVIDIA) or Workgroup Processor (AMD). They follow the von-Neumann architecture of computing: data is taken from memory, processed and the results are written back to memory. Performance is than scaled by replicating this processing unit, as can be seen in the following table:

Aspect Ryzen 7950X i9-13900K RTX 4090 RX 7900 XTX
Basic Processing Unit Core Core Streaming Multiprocessor (SM) Workgroup Processor (WGP)
No. of Processing Units 16 24 128 48
TFLOPS (FP32 FMA) 2,6 2,9 82,5 61,4
FP32 Operations / Cycle 16x32=512 8x24+16x12=384 128x128=16384 48x256=12288
FP32 Operations / (Proc Unit * Cycle) 4x8=32 P: 24 4x32=128 2x4x32=256
INT Operations / (Proc Unit * Cycle) 4x1=4 5x1=5 4x16=64 4x32=128
LD/ST Operations / (Proc Unit * Cycle) 3 6 4x4=16 2

The floating point capabilities of performance efficiency cores of the 13900K are the same as with the P and E cores inside the 12900K. A 13900K P-core has three 256-bit wide floating point vector execution units, an E-core has three 128-bit wide floating point vector execution units. Which results in 24 and 12 operations performed per clock respectively.

How well performance scales with the number of processing units depends on how many independent threads the workload consists of. With typical application software many parts are single or only lightly threaded. Think of your word processor, E-Mail client, web-browser, Music-player. To run fast they rely fast individual processing units. Current generation highend desktop CPUs from AMD and Intel feature 16 or 24 cores in the case of the 7950X or 13900K respectively.

However there are workloads that consists of thousands of individual threads. The benefit from having lots of processing units. GPUs are designed for such workloads. Current generation high end GPUs feature 48 or 128 processing units in the case of the 7900 XTX or RTX 4090 respectively.

Speed Up Computation Through Parallelism

CPUs and GPUs speed up code execution by, among other things, exploiting parallelism. Generally speaking, CPUs are optimized for execution of multiple independent instructions in parallel (exploiting instruction level parallelism) while GPUs are optimized for execution on multiple independent data elements in parallel (exploiting data level parallelism).

To understand how these concepts translate into design we take a look at the basic processing units of a 7950X and a RTX 4090: the Zen4 core and Ada SM. A block diagram of their design is given below:

Block diagram of Zen4 core and Ada Streaming Multiprocessor.

The basic processing units can be coursly divided into front end and backend.

Frontend

In the frontend instructions get fetched from memory and decoded.

A Zen4 core uses a single 32 KiB instruction cache. From that a branch predictor tries to predict the program flow and fetches instruction in advance. The complex x86 instructions are broken down by the decoder into smaller micro operations. Up to six micro ops can be dispatched each cycle.

An Ada streaming multiprocessor (SM) is structured into four partitions that act independently from each other. Each partitions dispatches one instruction per cycle from its own 32 KiB instruction cache.

Backend

In the backend instructions are scheduled for execution, executed and their results are written back to memory.

Zen4 provides separate execution paths for integer and floating point instructions. They first visit the rename stage. Here different a number of different optimizations are performed. Moves and no operations are eliminated and where there no dependencies instructions are allocated to different schedulers to be executed in parallel.

The operands needed for execution are stored inside the register file. Zen4 has two separate register files for integer and floating point operands. The integer register file consists of 224 64-bit registers while the floating point consists of 192 512-bit registers for a total of 14 kiB. The floating point register are bigger because that part of the core supports single instruction multiple data (SIMD) execution via vector extensions, similar to what a GPU does. A 512-bit register can store up to 8 FP64 or 16 FP32 operands. The integer side computes according to the single instructions single data (SISD) principle. Each instruction generally has one 64-bit operand.

The SM is capable of scheduling up to four instructions per cycle. One fourth of the theoretical maximum compared to Zen4. However the GPU computes with SIMD. With each instructions an Ada SM performs up to 32 32-bit operations.

In the SM each partition has its own register file that stores 16,384 32-bit integer and floating point operands, for a total of 4x16384x32 = 262 kiB. Adas register file is over 18 times larger compared to Zen4, providing space for up to 48 concurrent warps that can be assigned to an SM. Compare that to a maximum of two concurrent threads that can be assigned to Zen4 core.

At peak Zen4 is capable of scheduling up to 16 instructions per cycle for execution, ten on the integer part and six on the floating part of the core. Zen4 is capable of 4 INT64/INT32 operations per cycle and 32 FP32 operations per cycle (under usage of vector instructions).

Meanwhile Ada SM is capable of performing up to 64 INT32 and 64 FP32 or 128 FP32 operations, 16 times or 4 times as many as Zen4.

Zen4 is capable of performing 3 memory operations (max 3 loads, 2 stores) per cycle, while an Ada SM is capable of 16 memory operations per cycle.

For AI workloads Ada SMs are equipped with 4th Gen Tensor processing units that perform matrix multiply accumulate operations with FP16, BF16, FP8 and INT8 operands.

Memory

CPUs and GPUs are designed according to the von-Neumann architecture. They take information from memory, process it and write the results back. They both use synchronous dynamic random access memory (SDRAM). However they are of different types, optimized to achieve different goals.
CPUs use DDR memory, while GPUs use GDDR.

With SDRAM the main aspects of interest from a performance perspective are latency and bandwidth. Latency meaning how long it takes for the first data to arrive. Bandwidth meaning how much data will arrive per unit of time. Ideally there would be zero latency with infinite bandwidth. In reality one has to trade off one against the other. CPUs and GPUs have different requirements for their memory.

Aspect Ryzen 7950X Core 13900K RTX 4090 7900XTX
Type DDR5-5600 DDR5-7200 / DDR4 GDDR6X GDDR6
Mem BW 83,2 GB/s 89,6 GB/s 1008 GB/s 960 GB/s
Mem Latency ~ 70 ns ~ 70 ns ~ 250 ns ~ 250 ns
Mem Latency ~ 360 cylces ~ 380 cycles ~ 625 cylces ~ 625 cycles
L1 I-Cache / Proc Unit 32 KB 32 KB 32 KB
L1 D-Cache / Proc Unit 32 KB 48 KB 128 KB
L2 Cache / Proc Unit 1 MB 2 MB / 4 MB 6 MB
LL Cache 64 MB 36 MB 72 MB 96 MB

Modern CPUs have latencies of around 70 ns and memory bandwidth of around 80 GB/s. A CPU is expected to react quickly to unpredictable user input. Software is usually composed of blocks of arithmetic/logic/io-related statements that are separated by conditional statements. Depending on the results of the arithmetic/logic statements or user input the programs execute linearly or branch to different parts. The latency to copy data from memory into the CPU core cache is with fast DDR5 RAM around 70 ns. That may sound small, but results in a CPU core running at 5 GHz (0.2 ns per cycle) to wait for 350 cycles. To hide the latency and avoid waiting the CPU tries to predict the program flow and fetch the right data from memory in advance. It uses for example prefetchers and branch predictors to copy data into its L1 cache and have it available in a few cycles. Despite all of this latency is more valuable than bandwidth.

GPUs on the other hand use GDDR memory. With that 7900 XTX and RTX 4090 achieve 10 times better bandwidths of around 1000 GB/s vs 90 GB/s, while their latency is 3 times worse (250 ns vs 70 ns) compared to 7950X and 13900K.
That however is a worthwhile trade off for the GPU. Workloads that run on GPUs only consist of few branches and mostly of arithmetic. What kind of compute the GPU should do has already been calculated by the CPU. Further more the arithmetic is mostly independent of each other. The GPU computes with what ever data will be available. It dosn't have wait for data from some specific memory adress. which is why bandwidth is more valuable to a GPU than latency.

Physical Design

Being optimized for different workloads translates also in the physical design differences between CPUs and GPUs. Physical design being the process of geometrically placing the transistors for manufacturing.

One way CPUs achieve high single threaded performance is to run at high clock speeds. That however requires high voltage and results in high power usage and heat generation which has to be dissipated. The negative effects can be compensated for by reducing transistor density.

With GPU performance scaling very well with number of processing units available the design is optimized to have as many as possible. This can be achieved by designing dies with a bigger area prioritizing transistor density over frequency.

Some key metrics are given in the table below:

Aspect Ryzen 7950X Core 13900K RTX 4090 7900 XTX
Transistor Count 13,1 billion 14,2 billion 76,3 billion 58 billion
Die area 264,7 mm² 257 mm² 608 mm² 520 mm²
Node IO: TSMC N6, Cores: TSMC N5 Intel 7 TSMC N4X TSMC N6 and N5
Transistor Density 49,6 MTran/mm² 55,3 MTran/mm² 125,5 MTran/mm² 111,5 MTran/mm²
Power 230 W 253 W 450 W 355 W
Power Density 869 mW/mm² 984 mW/mm² 740 mW/mm² 683 mW/mm²

It can be observed that the 4090 and 7900 are build on dies of around double the size with around double the transistor density using over four times as many transistors as a 7950X or 13900K.