How Neural Networks Learn: Backpropagation Explained

Intermediate 9 min read February 07, 2026

Learn about how neural networks learn: backpropagation explained

neural-networks training backpropagation

Photo by Generated by NVIDIA FLUX.1-schnell

How Neural Networks Learn: Backpropagation Explained 🚨

Imagine you’ve just watched a toddler take their first wobbly steps, fall, then adjust their balance to try again. Now imagine doing that millions of times simultaneously across hundreds of artificial neurons, but instead of scraped knees, we’re talking about partial derivatives. That’s backpropagation—the algorithm that turned neural networks from curious mathematical toys into the image-generating, language-understanding powerhouses we use today. I still remember the first time I truly got backprop; it felt like suddenly understanding how a magician’s trick works, except the magician was calculus and the trick was intelligence itself.

Prerequisites

While this guide builds naturally from our previous exploration of what makes a neural network tick (weights, biases, and that delightful forward pass), I’ve designed it to stand on its own two feet—just like that toddler. If you know that neural networks make predictions by passing inputs through layers of neurons with weights and activation functions, you’re golden. If terms like “loss function” or “gradient” sound like gym equipment, don’t worry—we’ll unpack them as we go.

The Learning Dilemma: Who Gets the Blame?

Here’s the puzzle that kept early AI researchers up at night: when your network makes a wrong prediction (say, calling a cat a toaster), which specific weight caused the mistake?

In a simple linear regression with one variable, this is easy. But in a deep neural network with thousands of weights scattered across multiple layers, it’s like trying to figure out which musician in a symphony orchestra hit the wrong note… while the concert is happening in another city… and you’re just reading the sheet music.

This is called the credit assignment problem, and solving it is exactly what makes backpropagation so elegant.

🎯 Key Insight: Backpropagation isn’t “learning” itself—it’s the messenger system. It tells each weight exactly how much it contributed to the final error, so each weight knows how to change.

The Chain Rule: Calculus’s Greatest Party Trick

To understand backprop, we need to borrow one idea from calculus: the chain rule. Don’t panic—I promise this is friendlier than it sounds!

Think of a neural network as a massive pipeline of functions. Your input goes through Layer 1 (function 1), then Layer 2 (function 2), and so on until you get a prediction (the output).

The chain rule basically says: if you want to know how changing something early in the pipeline affects something at the end, just multiply the changes step-by-step.

I like to think of it like a game of telephone. If I whisper something to you, you whisper to Sarah, and Sarah shouts it to the room, the chain rule tells us how my original whisper affected what Sarah shouted. We just multiply how much my voice changed your hearing, by how much your hearing changed Sarah’s hearing, by how much Sarah’s hearing changed her shout.

Mathematically, if we want to know how changing weight $w$ affects the loss $L$, and $w$ feeds into $z$ which feeds into $a$ which feeds into $L$, we calculate:

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \times \frac{\partial a}{\partial z} \times \frac{\partial z}{\partial w}\]

💡 Pro Tip: The “partial derivative” symbol ($\partial$) just means “how much does this output change when I nudge this specific input, holding everything else constant?” Think of it as sensitivity analysis.

Backpropagation in Action: A Journey Backward

Now for the magic. Let’s walk through what actually happens when we train a network on a single example—say, teaching it to recognize handwritten digits.

Step 1: Forward Pass (The Setup)

We already covered this in Part 1, but briefly: the image pixels flow forward, get multiplied by weights, add biases, squashed through activation functions, and voilà—we get a prediction. Maybe our network thinks this “7” is a “3” with 80% confidence. Oops.

Step 2: Calculate the Loss

We compare our prediction to the truth using a loss function (like cross-entropy or MSE). Let’s say our loss is 2.5—a number representing “how wrong we are.”

Step 3: The Backward Pass (The Main Event)

Here’s where backprop earns its name. We work backward from the loss:

Output Layer: We first ask: “How sensitive is our loss to changes in the final output?” This gives us the error signal for the last layer.
Hidden Layers: We pass this error backward through each layer. At each stop, we calculate two things:
- How much the loss changes if we tweak the weights coming into this neuron
- How much the loss changes if we tweak the inputs to this neuron (so we can pass the error further back)
Weight Gradients: For every single weight in the network, we now have a number telling us: “If you increase this weight slightly, will the loss go up or down, and by how much?”

⚠️ Watch Out: People often think backprop “fixes” the weights. It doesn’t! It only calculates the gradients—the direction and magnitude of change needed. The actual fixing happens in the next step via gradient descent (or your optimizer of choice).

Updating Weights: The Learning Moment

Once backpropagation has delivered its report card to every weight, we perform the actual update. This is typically done via gradient descent:

\[\text{new weight} = \text{old weight} - \text{learning rate} \times \text{gradient}\]

The learning rate is our “how big of a step should we take” parameter. Too large, and we overshoot the solution; too small, and we train until the heat death of the universe.

I find it beautifully poetic that learning in machines mirrors learning in brains—at least conceptually. In biological neurons, strengthening or weakening synaptic connections based on error signals is called Hebbian learning. In our artificial networks, we’re doing gradient descent on a loss landscape. Different mechanisms, same goal: reduce error.

💡 Pro Tip: Modern frameworks like PyTorch and TensorFlow handle backpropagation automatically with loss.backward() or tape.gradient(). But understanding why it works makes you infinitely better at debugging when your network refuses to learn!

Real-World Examples: Where the Magic Happens

Computer Vision Training When you train a neural network to detect tumors in medical images, backpropagation is what allows the network to discover that certain pixel patterns (edges, textures) in early layers combine into complex features (shapes, textures) in deeper layers. I’ve always been fascinated by how the gradients naturally flow to strengthen connections that catch circular shapes when learning to identify tumors, almost as if the math itself “wants” to find meaningful patterns.

Language Models and Next-Token Prediction GPT models predict the next word in a sentence. When they predict “banana” instead of “apple” in the context of “The doctor ate the _____,” backpropagation sends error signals all the way back through the transformer layers. This adjusts attention weights so the model learns that “doctor” contexts usually prefer edible objects over electronic ones. It’s remarkable that this simple error-correction mechanism, repeated billions of times, gives rise to understanding context and nuance.

Recommendation Systems When Netflix suggests a movie you hate and you thumbs-down it, that feedback triggers backpropagation through their neural nets (simplified, but conceptually similar). The gradients flow back and adjust embeddings—those mysterious vectors representing movies and users—so that next time, movies with similar embeddings to the one you hated are pushed further away from your user vector.

Try It Yourself

Ready to make this concrete? Here are three ways to get your hands dirty:

1. The Spreadsheet Method Create a tiny neural network with just 2 inputs, 1 hidden neuron (with sigmoid), and 1 output. Pick random weights, calculate the forward pass manually for input [0.5, 0.3], then calculate the loss if the target is 1. Now, use the chain rule to calculate $\frac{\partial Loss}{\partial w_1}$ step by step. I did this once with Google Sheets and watching the gradients update when I tweaked inputs gave me that “aha!” moment.

2. Andrej Karpathy’s Micrograd Clone Micrograd—a tiny automatic differentiation engine written in pure Python. It’s only about 100 lines of code! Trace through how the backward() method recursively applies the chain rule. When you see that a scalar value knows its own gradient and how it contributed to the final loss, you’ll understand exactly what PyTorch is doing under the hood.

3. Visualize the Gradients If you’re using Python, train a simple network on the MNIST dataset, but add print statements to watch the gradients of specific weights in the first layer over time. You’ll notice something beautiful: early in training, gradients are large and chaotic (the network is confused). As training progresses, they become smaller and more stable (the network is converging). It’s like watching a student go from panicking to confidently solving problems.

Key Takeaways

Backpropagation is the credit assignment algorithm—it determines exactly how much each weight contributed to the final error
It relies entirely on the chain rule from calculus, working backward from the loss to calculate gradients for every parameter
It doesn’t update weights—it only calculates how to update them; optimizers like gradient descent perform the actual updates
The backward pass is computationally efficient because it reuses calculations from the forward pass, making training deep networks feasible
Understanding backprop helps you debug issues like vanishing gradients (when early layers stop learning) or exploding gradients (when weights update too aggressively)