ML Foundations – Linear Combinations to Logistic Regression
Post 1a/N
Every machine learning model — from simple house price predictors to neural networks with billions of parameters — starts with the same fundamental building block: the linear combination. Take some inputs, multiply each by a weight, and add them up. That’s it. Everything else is variations on this theme.
In the previous post, we established the paradigm shift: machine learning finds patterns from data instead of following human-written rules. But we kept it abstract — “the model finds patterns.” This post makes it concrete. You’ll see the actual mechanics of the simplest ML models: how they combine inputs, how they make predictions, and how they measure their own mistakes. No heavy math — just mental models and small numbers you can verify in your head.
Part 1: Linear Combination — The Building Block
A linear combination is the atomic unit of machine learning: take some inputs, multiply each by a weight, and add them up.
Mental Model: The Recipe
Think of baking a cake. Your recipe says: 2 cups flour, 1 cup sugar, 0.5 cups butter. The final batter is a “combination” of ingredients, each multiplied by an amount (the weight). Change the weights, change the outcome. More sugar makes it sweeter. More butter makes it richer. A linear combination works exactly the same way — each input contributes to the output proportionally to its weight.
For predicting house prices:
Price = (w1 × square_feet) + (w2 × bedrooms) + (w3 × location_score) + bias
The model’s job is to find the right weights that make this formula predict accurately. The bias is a baseline value — the price a house would have even with zero square footage and zero bedrooms (obviously unrealistic, but mathematically necessary as a starting point for the line).
Why “Linear”?
It’s called “linear” because there’s no squaring, cubing, or other transformations of the inputs. Each input contributes proportionally to the output. Double the square footage, double its contribution to price (assuming the weight stays constant).
This simplicity is both a strength and a limitation. Strength: easy to understand, fast to compute, and surprisingly effective for many real-world problems. Limitation: can’t capture curved or complex relationships without help. We’ll see how logistic regression and neural networks address this limitation later in this post.
Part 2: Linear Regression — Finding the Best Line
Linear regression takes the linear combination idea and asks: given a bunch of real data points, what weights produce the best predictions?
Mental Model: The Line Through the Scatter
Imagine you have data on 1,000 houses — their square footage and sale prices. Plot them on a graph. The points form a rough upward trend, but they’re scattered — no two houses with the same square footage sold for the exact same price. Linear regression finds the single straight line that best captures this trend. Not a line that hits every point (that’s impossible with real data), but the line that minimizes the overall distance between itself and all the points.
Price ($)
| *
| * *
| * * ← Each * is a real house sale
| * * *
| * *
| * *
+----------------------→ Square Feet
Linear regression finds the best straight line through these points.
But Which Line?
With just two points, you can draw exactly one line connecting them. But with 1,000 points, you could draw millions of different lines. Which one is “best”?
We measure “best” by how wrong the predictions are. For each house, the line predicts a price based on square footage. The actual sale price is known. The difference is the error. The best line minimizes the total error across all houses. How we measure that total error is the job of the loss function — covered in Part 4.
What Linear Regression Produces
After training, linear regression gives you concrete weights:
Price = $200 × square_feet + $15,000 × bedrooms + $50,000
Weights learned:
- Each square foot adds $200
- Each bedroom adds $15,000
- Base price (bias) is $50,000
Now you can predict prices for houses you’ve never seen by plugging in their features. That’s the “pattern” the model found — the mathematical relationship between features and price.
Where It’s Used: Demand forecasting at Walmart, pricing models at Zillow, revenue prediction at SaaS companies, ad bid optimization at Google. Any time you need to predict a continuous number from structured inputs, linear regression is the first model to try.
Trade-offs:
- Only captures straight-line relationships (can’t model curves without adding polynomial terms)
- Sensitive to outliers — one mansion in a dataset of starter homes will pull the line off course
- Assumes features contribute independently (can’t capture interactions like “bedrooms matter more in suburban zip codes”)
Part 3: Logistic Regression — When You Need Yes or No
Linear regression predicts continuous values: price, temperature, revenue. But what if you need to predict yes/no outcomes?
- Will this customer churn? (yes/no)
- Is this email spam? (yes/no)
- Will this patient develop diabetes? (yes/no)
You need a probability — a number between 0 and 1 representing the chance of “yes.”
The Problem with Linear Combinations for Probability
A linear combination can output any number: -500, 0, 42, 10,000. But probability must be between 0 and 1. If your model predicts “120% chance of spam,” that’s nonsense. You need something that takes any input and squeezes it into the 0-1 range.
The Sigmoid Function: The Compressor
Logistic regression solves this by adding a compression step after the linear combination.
Mental Model: The Toothpaste Tube
The linear combination is the pressure you apply to a toothpaste tube — it can be any amount, from a gentle squeeze to stomping on it. The sigmoid function is the tiny opening at the top. No matter how hard you squeeze, only a controlled amount comes out — always between 0 and 1. Light pressure (small positive input) gives a small amount (probability near 0.5). Heavy pressure (large positive input) gives the maximum (probability near 1.0). Pulling back (negative input) gives almost nothing (probability near 0.0).
The sigmoid is an S-shaped curve that maps any number to the 0-1 range:
Sigmoid Function:
Output
1.0 | _______________
| /
0.5 | / ← Zero input → exactly 0.5
| /
0.0 |_____________/
+-----|---------|---------|→ Input
-5 0 5
Large negative → near 0.0 (confident "no")
Zero → exactly 0.5 (uncertain)
Large positive → near 1.0 (confident "yes")
The two-step process:
Step 1: Compute linear combination
z = (w1 × age) + (w2 × blood_pressure) + (w3 × BMI) + bias
z could be any number: -3.2, 0, 7.5, etc.
Step 2: Apply sigmoid to squeeze into probability
probability = sigmoid(z)
Always between 0 and 1.
From Probability to Decision
Logistic regression outputs a probability. To make a yes/no decision, you apply a threshold:
- If probability > 0.5 → Predict “Yes”
- If probability ≤ 0.5 → Predict “No”
The threshold can be adjusted based on your use case. For cancer detection, you might lower it to 0.3 — catch more real cases even if it means more false alarms. For spam filtering, you might raise it to 0.7 — only block emails you’re highly confident about.
The PM Takeaway: Choosing the right threshold is a product decision, not a data science decision. It’s about the cost of false positives versus false negatives in your specific use case. A PM who understands this can have a much more productive conversation with their ML team than one who just asks “is the model accurate?”
Where It’s Used: Credit scoring at banks, churn prediction at subscription companies, spam filtering at Gmail, click-through rate prediction in ads systems, medical diagnosis triage.
Part 4: Loss Functions — Measuring How Wrong You Are
Both linear and logistic regression need a way to measure “how wrong” the model is with its current weights. This measurement is called the loss function (also called cost function or error function).
The loss function answers one question: given my current weights, how far off are my predictions? During training, the model adjusts weights to minimize this loss. The loss function is the scorecard, and the model is trying to get the best score.
Mental Model: The Golf Score
Think of the loss function like a golf score — lower is better. A score of 72 means you’re playing well. A score of 110 means you’re struggling. You don’t need to know the exact rules of golf to understand the goal: make the number go down. That’s all the model is doing during training — making the loss go down.
Mean Squared Error (MSE) — For Continuous Predictions
Used with linear regression when predicting continuous values like prices or temperatures.
For each prediction: measure the gap between predicted and actual value, square it (to make all errors positive and punish big errors more), then average across all data points.
House 1: Predicted $300K, Actual $320K → Error = $20K → Squared = 400
House 2: Predicted $450K, Actual $440K → Error = $10K → Squared = 100
House 3: Predicted $280K, Actual $350K → Error = $70K → Squared = 4,900
MSE = (400 + 100 + 4,900) / 3 = 1,800
Why square the error? Two reasons. First, direction doesn’t matter — predicting $20K too high is as bad as $20K too low. Second, big errors are punished disproportionately: a $70K error contributes 49× more to the loss than a $10K error (4,900 vs. 100). This forces the model to focus on fixing its worst predictions first.
Cross-Entropy Loss — For Probability Predictions
Used with logistic regression. MSE doesn’t work well for probabilities because it doesn’t punish “confidently wrong” answers enough. Cross-entropy loss fixes this.
Mental Model: The Confidence Penalty
Cross-entropy asks: how surprised should I be by this prediction, given the true answer? If you’re a weather forecaster who says “99% chance of rain” and it rains — no surprise, tiny penalty. If you say “51% chance of rain” and it rains — some surprise, moderate penalty. If you say “1% chance of rain” and it rains — you were confidently wrong, and the penalty is enormous.
| Prediction (True = Spam) | MSE Loss | Cross-Entropy Loss |
|---|---|---|
| 0.99 (confident, right) | 0.0001 | 0.01 |
| 0.51 (barely right) | 0.24 | 0.67 |
| 0.01 (confident, WRONG) | 0.98 | 4.6 (huge!) |
Cross-entropy punishes confident wrong answers far more harshly than MSE. This is why it’s the standard loss function for classification problems — it forces the model to be confident only when it’s actually right.
The PM Takeaway: When your data science team says “we’re using cross-entropy loss,” they mean the model is being trained on a classification task (yes/no, spam/not-spam, buy/don’t-buy) with a penalty system that heavily punishes confident mistakes. If they switch to MSE, ask why — it might mean they’re treating a classification problem as a regression problem, which is often a mistake.
Part 5: How It All Connects
Here’s the complete flow from raw data to predictions:
TRAINING PHASE:
1. Start with random weights
2. For each data point:
a. Compute linear combination (weighted sum of inputs)
b. For logistic regression: apply sigmoid to get probability
c. Compare prediction to actual value
d. Compute loss (MSE for continuous, cross-entropy for probability)
3. Adjust weights to reduce loss (this is gradient descent — covered in the next post)
4. Repeat until loss stops improving
PREDICTION PHASE:
1. Take new input (new house, new customer)
2. Apply learned weights via linear combination
3. For logistic regression: apply sigmoid
4. Output prediction (price, probability, class)
The Regression Family
Linear and logistic regression are the workhorses, but several variations exist for different situations:
| Type | When to Use | How It Differs |
|---|---|---|
| Linear Regression | Predict continuous values (price, revenue) | Output is an unbounded number |
| Logistic Regression | Binary classification (yes/no) | Sigmoid squeezes output to 0-1 |
| Polynomial Regression | Curved relationships | Adds squared/cubed terms to capture non-linearity |
| Ridge / Lasso | Preventing overfitting | Adds a penalty for large weights to keep the model simple |
| Softmax Regression | Multi-class classification (3+ categories) | Extends sigmoid to multiple classes, probabilities sum to 1 |
Note the last one — softmax. It’s the generalization of sigmoid to multiple classes, and it shows up again in a major way when we cover transformers and the attention mechanism. It’s also the function behind the “attention budget” we’ll discuss in the context window and needle-in-a-haystack posts. Keep it in mind.
Common Misconceptions
“Linear regression is too simple to be useful.” It’s the workhorse of industry. At most companies, linear regression and logistic regression handle the majority of prediction tasks — churn, pricing, demand forecasting, click-through rates. The bias toward complex models (especially LLMs) is real, but a well-tuned logistic regression often outperforms a poorly implemented neural network. Start simple.
“Logistic regression is only for two categories.” Binary logistic regression handles yes/no. Softmax regression (also called multinomial logistic regression) handles any number of categories — and it’s essentially the same idea extended. “Is this image a cat, dog, or bird?” uses the same linear combination → activation → loss function pipeline.
“The model knows the weights are right.” The model has no concept of “right.” It only knows the loss is lower than before. A model with a loss of 0.001 might still be making systematically wrong predictions if the loss function doesn’t capture what you actually care about. Choosing the right loss function is a design decision, not an automatic one.
“More features always help.” Adding irrelevant features (like the color of the house for price prediction) can actually hurt performance by introducing noise. The model might assign a non-zero weight to house color by coincidence in the training data, then make worse predictions on new data. This is overfitting — the model memorized noise instead of learning signal.
The Mental Models — Your Cheat Sheet
| Concept | Mental Model | One-Liner |
|---|---|---|
| Linear Combination | The Recipe | Ingredients × amounts = output |
| Linear Regression | The Best Line Through the Scatter | Find weights that minimize error on continuous data |
| Logistic Regression | Linear combination + toothpaste tube | Squeeze any number into a 0-1 probability |
| Sigmoid Function | The Toothpaste Tube | Any pressure in, controlled amount out (0 to 1) |
| MSE Loss | Squared distance from target | Punishes big errors disproportionately |
| Cross-Entropy Loss | The Confidence Penalty | Punishes confident wrong answers exponentially |
| Decision Threshold | The dial you set | Where you cut probability into yes/no — a product decision |
Final Thought
Every ML model — from the simplest regression to the deepest neural network — builds on the four concepts in this post:
- Linear combination. Multiply inputs by weights, add them up. This is the atomic operation. Neural networks are just many linear combinations stacked together with non-linear activations between layers.
- Activation functions. Sigmoid compresses outputs into probabilities. Softmax extends this to multiple classes. Other activations (ReLU, tanh) serve similar purposes in neural networks. They add the non-linearity that linear combinations alone can’t provide.
- Loss functions. MSE for continuous predictions, cross-entropy for classification. The loss function defines what “wrong” means. Choose the wrong loss function and the model optimizes for the wrong thing — a subtle but devastating mistake.
- Training. Start with random weights, measure error, adjust, repeat. The specific algorithm for “adjust” is gradient descent — and that’s exactly what the next post covers. We’ll see how the model uses the loss function as a compass to navigate toward better weights, one step at a time.
When someone says “we’re using cross-entropy loss with a softmax output,” you now know they’re doing multi-class classification with a penalty that punishes confident wrong answers. When they say “the model’s MSE is 0.003,” you know they’re measuring how far off continuous predictions are from reality. That vocabulary alone puts you ahead of most people in the room.