The curious case of R-Squared: Keep Guessing
Most explanations of R-squared start with a formula:
R² = 1 − (SS_res / SS_tot)
Then they say something like “the proportion of variance explained by the model” and move on. And you nod, and you write it down, and somewhere in the back of your head a small voice says: what does “explained” actually mean?
I went down this rabbit hole recently, and the answer turned out to be much simpler — and much more useful — than the textbook framing. So here it is, the way I wish someone had told me on day one.
The Question Nobody Asks
Mean Squared Error is straightforward. For every house in your dataset, you take the model’s predicted price, subtract the actual price, square it, and average. The smaller the number, the closer your predictions are to reality. Done.
And we take square for MSE to ensure we don’t artificially balance negative errors against positive errors and then to bring it back to the regular world metrics, we may choose to do a square root again, to get RMSE – root mean squared error.
But here’s the thing: an MSE of 1,800 means nothing on its own. Is that good? Bad? It depends entirely on what you’re predicting. An MSE of 1,800 on house prices in dollars is excellent. An MSE of 1,800 on shoe sizes is a disaster.
MSE has no built-in sense of scale. It tells you how wrong you are in absolute terms, but it can’t tell you whether your model is doing anything clever — or whether you’d have done about the same by flipping a coin.
That’s the gap R-squared fills.
The Naive Baseline
Imagine you have ten houses and you have to predict their prices, but you’re not allowed to look at any features — no square footage, no bedrooms, no zip code. Nothing.
What’s the best you can do?
Guess the average. For every single house, you predict the mean price of all ten. It’s a lazy, dumb prediction — but it’s the best you can do without information. Some houses you’ll overshoot, some you’ll undershoot, and the total squared error you rack up is exactly the variance of your dataset.
That number — the squared error you get from blindly predicting the mean — is your floor. It’s what life looks like with zero intelligence.
Now you build a real model. It looks at features. It learns weights. It produces a new MSE. And the only question worth asking is:
How much better is my model than the lazy version that just guesses the average?
That’s R-squared. That’s the whole thing.
R² = 1 − (model's MSE / variance of the data)
\_______________/
"what fraction of the dumb baseline error
did the model manage to get rid of?"
If the model cuts the lazy baseline error by 70%, R² = 0.7. If it cuts it by 36%, R² = 0.36. If it does no better than the mean, R² = 0. If it’s worse than the mean (yes, this happens), R² goes negative.
The Circle Analogy
Here’s the picture I find easiest to hold in my head.
N
*
|
|
W *-------+-------* E ← four houses on a circle of radius r
| around the true center (the mean)
|
*
S
Imagine four houses sitting at the four cardinal points of a circle, radius r, centered on the average price. If your “model” is just predict the mean for everyone, then every house is r away from your prediction. Each squared error is r², and the total error across the four houses is 4r². That is your total variance. That is the floor.
Now a real model shows up. For each house, it predicts a point on a smaller, concentric circle of radius r/2 — closer to each true house than the center was. Each squared error is now (r/2)² = r²/4. Across four houses, the total is r².
Plug it in:
R² = 1 − (r² / 4r²) = 1 − 0.25 = 0.75
Your model explains 75% of the spread. Meaning: of the original 4r² of “wrongness” you’d have eaten by guessing the mean, the model has eliminated three-quarters of it. The remaining quarter is what the model still can’t account for.
Suddenly “explained variance” stops sounding like jargon and starts sounding like what it actually is: how much of the original mess did we manage to clean up? Or how much closer is our estimate to the real values compared to just using mean for each estimate.
A Detail Worth Noticing
Here’s a fun follow-up. What if the model’s predictions sit on a circle of radius 1.5r — bigger than the original spread, on the opposite side of each true house?
Each squared error is still (0.5r)² = r²/4. Total is still r². R² is still 0.75.
R-squared doesn’t care which side of the truth your prediction sits on, or whether your prediction is bigger or smaller than the actual value. It only cares about the size of the gap. Two very different-looking models can have identical R² scores. Worth remembering before you celebrate one.
So How Do MSE and R² Actually Relate?
They’re not competing metrics. They’re answering two different questions about the same numbers.
| Metric | Question it answers |
|---|---|
| MSE | On average, how far off are my predictions, in the units of the thing I’m predicting? |
| R² | Compared to the dumbest possible prediction (the mean), what fraction of the error did I get rid of? |
MSE is the raw measurement. R² is MSE graded on a curve — the curve being the worst you’d do without any model at all. One is absolute, the other is relative. You usually want both: MSE tells you what your error looks like in dollars or degrees or grams; R² tells you whether the model is actually pulling its weight.
The One-Liner
R² is just asking: how much better is your model than guessing the average?
That’s it. Not “proportion of variance explained.” Not a ratio of sum-of-squares. Just: did the model beat the laziest possible baseline, and by how much?
Once that clicks, the formula is no longer something you memorize. It’s something you can re-derive on the back of a napkin.
P.S. – We bring MSE back to real world as RMSE, by doing a square root, we can also bring R squared back to real world with the same formula – it becomes “r” -> Pearson Correlation Coefficient.