How GPUs Took over Artificial Intelligence

How GPUs Took over Artificial Intelligence

The chips that power modern AI were not designed for artificial intelligence… They were designed to render video game graphics. The fact that same hardware that was designed to go into an Xbox ended up leading the neural network revolution, is one of the most consequential accidents in the history of computing.

 

Today, virtually all AI development runs on GPUs. The models behind ChatGPT, image generators, and self-driving car systems were all trained on clusters of graphics cards. NVIDIA, a company that spent most of its existence selling to gamers, is now worth over $3 trillion and sits at the center of a global competition for AI compute – and that is not for their ability to render graphics. This is a fascinating story, and learning how this happened means understanding what GPUs actually do… and why that matters for machine learning.

 

What GPUs Were Built For

A graphics processing unit is a specialized chip designed to handle the complex math required for rendering images. When a video game draws a frame, it needs to calculate the color of millions of pixels. Each pixel’s color depends on lighting, textures, geometry, and various effects like shadows or reflections. These calculations are relatively simple individually, but there are an enormous number of them, and they all need to happen fast enough to produce 60 or more frames per second.

 

The key insight behind GPU design is that most of these calculations are independent of each other. (For those who are experienced in neural network design – that should be your ah ha moment!) The color of one pixel doesn’t depend on the color of the pixel next to it. They can all be computed at the same time if you have enough processing units working in parallel.

 

This is fundamentally different from how CPUs work. A central processing unit is designed to handle complex, varied tasks sequentially. It has a small number of powerful cores (typically 4 to 16 in consumer chips) that can each execute sophisticated instructions. CPUs are good at tasks where each step depends on the result of the previous step, where the work involves branching logic and varied operations. For example, calculating an Excel file means that each cell is dependent on the calculations of other cells. Anything in parallel results in a “circular calculation” error, in part because of how CPUs operate. 

 

GPUs take the opposite approach. Instead of a few powerful cores, they have thousands of smaller, simpler cores. These cores are not individually impressive, and actually handle very little. They certainly cannot handle complex branching logic that your Excel workbook needs. But what they can do – coordinate the execution of the same instruction at the same time on different pieces of data. This is called SIMD (single instruction, multiple data) or, more broadly, parallel processing.

The Parallel Processing Connection

Neural networks, as it turns out, have a computational structure that looks a lot like graphics rendering. This is completely accidental, but advantageous for GPUs.

Training a neural network involves repeatedly passing data through layers of artificial neurons, calculating errors, and adjusting the connections between neurons to reduce those errors. The core mathematical operation is matrix multiplication. You take a matrix of input values, multiply it by a matrix of weights, apply some function to the result, and pass it to the next layer. Then you do this again, across millions or billions of parameters, for millions of training examples.

These matrix operations are independent of each other (mathematicians call it “embarrassingly parallel”). If you need to do a million multiplications, none of them depends on the result of any other calculation. They can all happen at the same time. The individual multiplications and additions don’t depend on each other. If you’re multiplying a 1000×1000 matrix by another 1000×1000 matrix, you’re doing a million independent calculations that can all happen at the same time.

This is exactly what GPUs are built for. The same architecture that lets a GPU calculate the color of a million pixels simultaneously lets it perform a million matrix operations simultaneously. A calculation that might take a CPU hours to complete sequentially can finish in minutes on a GPU running in parallel.

Researchers recognized this connection in the late 2000s. Neural networks had existed for decades, and the basic algorithms for training them (like backpropagation) had been known since the 1980s. But running these algorithms at any meaningful scale was computationally prohibitive. The math was straightforward. The problem was doing enough of it fast enough.

 

The key insight behind GPU design is that most of these calculations are independent of each other. (For those who are experienced in neural network design – that should be your ah ha moment!) The color of one pixel doesn’t depend on the color of the pixel next to it. They can all be computed at the same time if you have enough processing units working in parallel.

 

This is fundamentally different from how CPUs work. A central processing unit is designed to handle complex, varied tasks sequentially. It has a small number of powerful cores (typically 4 to 16 in consumer chips) that can each execute sophisticated instructions. CPUs are good at tasks where each step depends on the result of the previous step, where the work involves branching logic and varied operations. For example, calculating an Excel file means that each cell is dependent on the calculations of other cells. Anything in parallel results in a “circular calculation” error, in part because of how CPUs operate. 

 

GPUs take the opposite approach. Instead of a few powerful cores, they have thousands of smaller, simpler cores. These cores are not individually impressive, and actually handle very little. They certainly cannot handle complex branching logic that your Excel workbook needs. But what they can do – coordinate the execution of the same instruction at the same time on different pieces of data. This is called SIMD (single instruction, multiple data) or, more broadly, parallel processing.

The Breakthrough Moment

The moment the field shifted came just over 10 years ago. In 2012, a team of researchers from the University of Toronto entered the ImageNet Large Scale Visual Recognition Challenge, an annual competition to classify images into 1,000 categories. The team consisted of Alex Krizhevsky, Ilya Sutskever, and their supervisor Geoffrey Hinton.

 

Their entry, a deep convolutional neural network that came to be called AlexNet, was trained on two NVIDIA GTX 580 graphics cards. These were consumer gaming GPUs, not specialized research hardware. The network had about 60 million parameters and was trained on 1.2 million images.

 

AlexNet won the competition with an error rate of 15.3%, compared to the runner-up who had an error rate of 26.2%. This was not a marginal improvement. The result demonstrated something that had been theorized but not proven at scale: deep neural networks actually worked, if you could throw enough computation at them. Deeper networks with more parameters, trained on more data, produced dramatically better results. And GPUs made that computation accessible to a research lab rather than requiring a supercomputer.

 

The field moved quickly after that. Within a few years, GPU-trained neural networks had surpassed previous approaches in speech recognition, machine translation, and game playing. The 2012 ImageNet result is now commonly cited as the beginning of the modern AI era. The hardware that made it possible was a pair of graphics cards designed to run video games.