Teaching AI Models: Gradient Descent
Post 1b/N
In the last post, we established the big idea: machine learning is about finding patterns from data instead of writing rules by hand. But we skipped a critical question — how does the machine actually find the patterns?
When someone says “we trained a model,” what physically happened? When Word2Vec “nudges” word vectors closer together, what’s doing the nudging? When a neural network “learns” to recognize cats in photos, what does “learns” actually mean?
The answer is the same in every case. Whether it’s a simple linear regression from the 1800s or GPT-4 with 1.8 trillion parameters — they all learn using the same core engine: gradient descent. This is the single most important algorithm in all of machine learning. Once you understand it, every model you encounter — from XGBoost to transformers to diffusion models — becomes a variation on the same theme.
The Setup: What Does a Model Need to Learn?
Before we get to gradient descent, let’s be precise about what “learning” means for a machine.
Every ML model has parameters — numbers that control its behavior. A simple model might have 2 parameters. GPT-4 has roughly 1.8 trillion. These parameters start at random values. Learning is the process of adjusting these parameters until the model produces good outputs.
But how does the model know whether its current parameters are good or bad? It needs a scoring system — a way to measure how wrong it is. This is called the loss function (also called the cost function or error function). The loss function takes the model’s prediction, compares it to the correct answer, and outputs a single number: the loss. Higher loss means worse predictions. Lower loss means better predictions.
Mental Model: The Golf Score
Think of the loss function like a golf score. In golf, lower is better. A score of 72 means you’re playing well. A score of 110 means you’re struggling. You don’t need to know the exact rules of golf to understand the goal: make the number go down. That’s all the model is trying to do — make the loss go down.
Model's Prediction: "This house costs $350,000"
Actual Price: $320,000
Loss: ($350K - $320K)² = $900,000,000 (big number = big mistake)
Model's Prediction: "This house costs $322,000"
Actual Price: $320,000
Loss: ($322K - $320K)² = $4,000,000 (smaller number = getting closer)
The entire goal of training is: find the parameter values that minimize the loss.
Gradient Descent: Finding the Bottom of the Valley
Now we arrive at the core idea. You have a model with parameters that are currently wrong. You have a loss function that tells you how wrong. You need to adjust the parameters to make the loss smaller. How?
Mental Model: The Blindfolded Hiker
Imagine you’re dropped onto a mountain landscape, blindfolded. Your goal is to reach the lowest point in the valley. You can’t see the terrain, but you can feel the slope under your feet. At each step, you check: “Which direction is downhill?” Then you take a step in that direction. You repeat this — feel the slope, step downhill, feel the slope, step downhill — until you reach a point where every direction feels flat or uphill. You’ve found a valley floor.
That’s gradient descent. The “mountain landscape” is the loss function plotted across all possible parameter values. The “slope” is the gradient — a mathematical measure of which direction increases the loss fastest. You step in the opposite direction (downhill) to decrease the loss. Repeat until you reach the bottom.
The Loss Landscape (simplified to 1 parameter):
Loss
↑
| *
| *
| *
| * *
| * *
| * *
| * ← You want to end up here (minimum loss)
+------------------------→ Parameter Value
Start somewhere random.
Feel the slope (compute the gradient).
Step downhill (adjust the parameter).
Repeat.
The word “gradient” just means slope — specifically, the direction and steepness of the slope at your current position. In one dimension, the gradient is just “uphill to the left” or “uphill to the right.” In a million dimensions (a model with a million parameters), the gradient points in the steepest uphill direction across all million axes simultaneously. You step the opposite way.
That’s it. That’s the entire algorithm. Compute the gradient, step opposite to it, repeat. Every AI model you’ve ever heard of learns this way.
The Learning Rate: How Big Are Your Steps?
There’s one critical knob that controls the whole process: the learning rate — how big a step you take each time.
Mental Model: The Blindfolded Hiker’s Stride
If the hiker takes enormous strides, they’ll move fast but might overshoot the valley — stepping right past the bottom and ending up on the other side. If they take tiny baby steps, they’ll be very precise but might take forever to reach the bottom — or get stuck in a small dip that isn’t the real valley floor.
Learning Rate Too High:
Loss
| * * * * ← Bouncing back and forth
| * * * * * * ← Never converges
| * * * * * *
| ← Overshooting the minimum
Learning Rate Too Low:
Loss
| *
| *
| *
| *
| * ← Crawling toward the minimum
| * ← Will get there... eventually
| * ← (or get stuck in a local dip)
Learning Rate Just Right:
Loss
| *
| *
| *
| *
| *
| * ← Steady descent to the minimum
In practice, finding the right learning rate is one of the most important (and frustrating) parts of training a model. Too high and the model never converges — the loss bounces around chaotically. Too low and training takes forever or gets stuck. Most modern training systems use adaptive learning rates that start larger and get smaller as training progresses — like a hiker who takes big strides on the open mountainside and switches to careful steps near the valley floor.
The PM Takeaway: When your ML team says “training is unstable” or “the model isn’t converging,” the learning rate is one of the first suspects. When they say “training is taking too long,” it might be set too low. You don’t need to tune it yourself, but knowing what it is helps you follow the conversation.
The Problem with Vanilla Gradient Descent
The version of gradient descent we just described has a practical problem: it’s too slow for real-world datasets.
Here’s why. To compute the gradient accurately, you need to run the model on every single example in your training set, add up all the errors, and only then take one step. If your training set has 10 million examples, that’s 10 million forward passes before a single parameter update.
Mental Model: The Restaurant Review
Imagine you’re a chef trying to improve your pasta recipe. Vanilla gradient descent says: cook the pasta for every single customer in the restaurant (10,000 people), collect all their feedback, average it, and then make one adjustment to the recipe. Then cook for all 10,000 again. Then make one adjustment. This is incredibly thorough — you’re basing each change on complete information — but it’s absurdly slow. You’d make maybe one recipe change per week.
This is called Batch Gradient Descent, and for modern datasets with millions or billions of examples, it’s computationally impractical.
Stochastic Gradient Descent: Learning from One Example at a Time
The fix is elegant: instead of computing the gradient over the entire dataset, compute it on one randomly selected example and take a step immediately. Then pick another random example and take another step. This is Stochastic Gradient Descent (SGD).
Mental Model: The Food Truck Chef
Instead of cooking for 10,000 people and waiting for all their reviews, you run a food truck. You serve one customer, hear their feedback (“too salty”), and immediately adjust the recipe. Serve the next customer, hear feedback (“needs more garlic”), adjust again. Each individual piece of feedback might be noisy — one person’s “too salty” is another person’s “perfect” — but over hundreds of customers, the adjustments average out and you converge on a great recipe. And you’re improving continuously instead of waiting for a massive batch of feedback.
Batch Gradient Descent (Vanilla):
See ALL 10,000,000 examples → Compute average gradient → Take 1 step
See ALL 10,000,000 examples → Compute average gradient → Take 1 step
(Accurate direction, but painfully slow)
Stochastic Gradient Descent (SGD):
See example #4,721 → Compute gradient → Take 1 step
See example #891,003 → Compute gradient → Take 1 step
See example #52 → Compute gradient → Take 1 step
(Noisy direction, but incredibly fast)
The path of SGD is noisier than batch gradient descent — each individual example might push the parameters in a slightly wrong direction. But this noise is actually helpful: it prevents the model from getting stuck in shallow local valleys (small dips that aren’t the true minimum). The randomness gives the hiker enough jitter to bounce out of minor dips and keep searching for the real valley floor.
Mini-Batch SGD: The Best of Both Worlds
In practice, nobody uses pure SGD (one example at a time) or pure batch (all examples). The standard approach is mini-batch SGD — compute the gradient on a small random batch of examples (typically 32, 64, or 256) and take a step.
Mental Model: The Focus Group
Instead of surveying one person (too noisy) or all 10,000,000 people (too slow), you survey a focus group of 64 random people. Their averaged feedback is reasonably representative and you can collect it quickly. You run hundreds of focus groups per day, each with a different random group of 64, and adjust your recipe after each one.
| Approach | Examples per Step | Speed | Accuracy per Step | Used In Practice? |
|---|---|---|---|---|
| Batch GD | All (millions) | Very slow | Very accurate | Rarely |
| SGD | 1 | Very fast | Very noisy | Rarely alone |
| Mini-batch SGD | 32-256 | Fast | Good enough | Almost always |
The PM Takeaway: When your team talks about “batch size,” this is what they mean — how many examples the model sees before each parameter update. Larger batches are more stable but need more memory (GPU RAM). Smaller batches train faster per step but are noisier. It’s a practical tradeoff, not a theoretical one — often constrained by how much GPU memory you can afford.
Local Minima and Saddle Points: Why the Landscape Is Treacherous
The blindfolded hiker analogy works well for simple landscapes with one valley. Real loss landscapes — especially for deep neural networks with millions of parameters — are far more complex.
Mental Model: The Mountain Range
Instead of one valley, imagine an entire mountain range with thousands of valleys, ridges, and plateaus. Some valleys are deep (good solutions). Some are shallow (mediocre solutions that the model can get stuck in). And some are saddle points — spots that feel flat in every direction, like sitting on the middle of a horse saddle. You’re at a low point along one axis but a high point along another. The hiker feels “stuck” even though there’s a better valley nearby.
Loss Landscape (2D slice):
Loss
| *
| * *
| * * ← Local minimum (shallow valley — "pretty good" but not best)
| * *
| * *
| * *
| * * ← Global minimum (deepest valley — best solution)
+-----------------------------→ Parameter Value
This is where the “stochastic” in SGD earns its keep. The noise from random sampling acts like occasional earthquakes that shake the hiker out of shallow valleys. A hiker doing smooth, precise batch gradient descent might settle into the nearest shallow valley and stay there forever. A hiker doing SGD stumbles around enough to escape shallow traps and find deeper, better valleys.
Modern optimizers like Adam (used in nearly all LLM training) combine the core idea of gradient descent with clever tricks: they keep a running memory of past gradients to build momentum (like a ball rolling downhill that can coast over small bumps) and they adapt the learning rate separately for each parameter. But under the hood, it’s still gradient descent — compute the slope, step downhill.
Gradient Descent in the Wild: What’s Actually Happening When Models Train
Let’s connect this back to the models we’ve discussed:
Linear Regression
Parameters: a slope and an intercept (2 parameters). Loss: mean squared error between predicted and actual house prices. Gradient descent adjusts the slope and intercept until the line fits the data. This typically converges in seconds.
Word2Vec
Parameters: the embedding vector for every word in the vocabulary (300 dimensions × 100,000 words = 30 million parameters). Loss: log likelihood of predicting context words. Gradient descent nudges word vectors closer when words co-occur and farther apart when they don’t. This is the “nudging” we described in Post 2a.
Sentence-BERT
Parameters: transformer weights that produce sentence embeddings (~110 million parameters). Loss: contrastive loss — penalize when matching pairs are far apart or non-matching pairs are close. Gradient descent adjusts the transformer weights to produce an embedding space where similarity reflects meaning. This is the “seating penalty” from Post 2c.
GPT / LLMs
Parameters: billions of transformer weights (1.8 trillion for GPT-4). Loss: cross-entropy loss on next-token prediction. Gradient descent adjusts all weights so the model gets better at predicting the next word. This runs for weeks on thousands of GPUs.
The Universal Pattern:
1. Initialize parameters randomly
2. Feed in training data
3. Compute loss (how wrong is the model?)
4. Compute gradient (which direction makes it more wrong?)
5. Step in the opposite direction (make it less wrong)
6. Repeat steps 2-5 millions or billions of times
7. Stop when the loss is low enough
This is the same for ALL of the above models.
The only differences: the architecture, the loss function, and the scale.
Backpropagation: How the Gradient Gets Computed
One more piece to complete the picture. In a deep neural network with many layers, how does the model know which parameters to adjust and by how much?
Mental Model: The Assembly Line Blame Game
Imagine a car factory with 50 stations on the assembly line. A defective car rolls off the end. The quality inspector (loss function) says: “This car has a crooked bumper.” Now someone has to figure out which station caused the problem. Was it station 47 where the bumper was attached? Or station 12 where the frame was bent slightly, causing everything downstream to be off?
Backpropagation is the process of tracing the error backward through every station, assigning each one a share of the blame proportional to how much it contributed to the final defect. Station 47 might get 60% of the blame (direct cause). Station 12 might get 25% (indirect cause). Station 3 might get 1% (barely related). Each station then adjusts its process proportionally.
This is why it’s called backpropagation — the error signal propagates backward from the output through every layer of the network, computing the gradient for each parameter along the way. Without backpropagation, gradient descent would be impractical for deep networks — you’d have no efficient way to compute the gradient for millions of parameters simultaneously.
The PM Takeaway: Backpropagation is why deep neural networks can train at all. It’s also why training is computationally expensive — every single training step requires a forward pass (make a prediction), a loss computation (measure the error), and a backward pass (trace the blame through every layer). For a model with billions of parameters, this is an enormous amount of computation per step, multiplied by millions of steps.
Common Misconceptions
“The model finds the best possible solution.” Almost never. Gradient descent finds a good enough solution — a local minimum that performs well. With billions of parameters, the loss landscape is so complex that nobody knows where the true global minimum is, and it probably doesn’t matter. In practice, there are many good solutions, and SGD is excellent at finding one of them.
“More training is always better.” No — this leads to overfitting (remember the portrait artist from Post 1). At some point, the model starts memorizing the training data instead of learning general patterns. The loss on training data keeps going down, but performance on new data starts getting worse. This is why teams monitor validation loss — loss on data the model hasn’t trained on — and stop training when it starts rising.
“Gradient descent is slow.” The algorithm itself is simple and fast. What’s slow is the scale — computing gradients across billions of parameters on billions of training examples. The algorithm is efficient. The hardware bill is the problem.
“AI learning is like human learning.” Superficially, yes — both involve making mistakes and adjusting. But gradient descent is purely mechanical: compute a number, adjust a number, repeat. There’s no understanding, no insight, no “aha” moment. It’s pure optimization — finding parameter values that minimize a score. Powerful, but fundamentally different from how humans learn.
The Mental Models — Your Cheat Sheet
| Concept | Mental Model | One-Liner |
|---|---|---|
| Loss Function | The Golf Score | One number that measures how wrong you are — lower is better |
| Gradient Descent | The Blindfolded Hiker | Feel the slope, step downhill, repeat |
| Learning Rate | The Hiker’s Stride | Too big = overshoot, too small = stuck |
| Batch Gradient Descent | The Restaurant Review | Survey everyone, then make one change (slow but thorough) |
| Stochastic Gradient Descent | The Food Truck Chef | One customer’s feedback, immediate adjustment (fast but noisy) |
| Mini-Batch SGD | The Focus Group | Survey 64 random people, adjust, repeat (the practical sweet spot) |
| Local Minima | Shallow Valleys in a Mountain Range | Good-enough solutions the model can get trapped in |
| Backpropagation | The Assembly Line Blame Game | Trace the error backward to assign each layer its share of blame |
Final Thought
Gradient descent is the unifying thread that ties all of AI together. Every model you’ll encounter — from a two-parameter linear regression to a trillion-parameter LLM — learns by the same core loop: make a prediction, measure the error, compute the slope, step downhill, repeat.
Three things to carry with you:
- Learning = optimization. When someone says a model “learned” something, what they mean is: gradient descent adjusted the parameters until the loss function reached a low value. There’s no magic, no understanding — just a relentless, mechanical search for parameter values that minimize error.
- SGD’s noise is a feature, not a bug. The randomness of stochastic gradient descent prevents models from getting stuck in shallow valleys and helps them find better solutions. This is why mini-batch SGD is the universal standard — it’s fast enough to be practical and noisy enough to be effective.
- Everything else is details. The architecture (transformer, CNN, tree ensemble), the loss function (MSE, cross-entropy, contrastive loss), and the optimizer (SGD, Adam, AdaGrad) all vary. But the core loop is always the same. Once you see this, the entire field of ML becomes variations on a single theme.
In the next post, we’ll use this foundation to understand embeddings — how machines translate words, images, and concepts into points in space. You’ll see gradient descent in action: Word2Vec uses it to nudge word vectors. Sentence-BERT uses it to shape semantic spaces. And the loss functions we just covered are the exact “scoring systems” that drive the learning. The engine is the same — only the vehicle changes.