Skip to content
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
Recent Posts
March 1, 2026
Teaching AI Models: Gradient Descent
March 1, 2026
Needle in the Haystack: Embedding Training and Context Rot
March 1, 2026
Measuring Meaning: Cosine Similarity
February 28, 2026
AI Paradigm Shift: From Rules to Patterns
February 16, 2026
Seq2Seq Models: Basics behind LLMs
February 16, 2026
Word2Vec: Start of Dense Embeddings
February 13, 2026
Advertising in the Age of AI
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 2
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 1
February 8, 2026
ML Foundations – Linear Combinations to Logistic Regression
February 2, 2026
Privacy Enhancing Technologies – Introduction
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 3
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 2
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 1
February 2, 2026
An Intuitive Guide to CNNs and RNNs
February 2, 2026
Making Sense Of Embeddings
November 9, 2025
How CNNs Actually Work
August 17, 2025
How Smart Vector Search Works
Machine Learning Basics

Word2Vec: Start of Dense Embeddings

Post 2a/N When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the…

Privacy Tech

Privacy Enhancing Technologies – Introduction

Every time you browse a website, click an ad, make a purchase, or train an ML model, data flows through systems.…

Privacy Tech

Privacy Enhancing Technologies (PETs) — Part 2

Secure Collaboration Without Sharing Raw Data In Part 1, we covered how individual organizations protect data…

Machine Learning Basics Model Intuition

Teaching AI Models: Gradient Descent

Post 1b/N In the last post, we established the big idea: machine learning is about finding patterns from data instead…

Machine Learning Basics

ML Foundations – Linear Combinations to Logistic Regression

Post 1a/N Every machine learning model — from simple house price predictors to neural networks with billions of…

Model Intuition

An Intuitive Guide to CNNs and RNNs

When your phone recognizes “Hey Siri,” a CNN is probably listening. When Google Translate converts your sentence into…

Home/Machine Learning Basics/ML Foundations – Linear Combinations to Logistic Regression
Machine Learning Basics

ML Foundations – Linear Combinations to Logistic Regression

By Archit Sharma
11 Min Read
0
Updated on March 1, 2026

Post 1a/N

Every machine learning model — from simple house price predictors to neural networks with billions of parameters — starts with the same fundamental building block: the linear combination. Take some inputs, multiply each by a weight, and add them up. That’s it. Everything else is variations on this theme.

In the previous post, we established the paradigm shift: machine learning finds patterns from data instead of following human-written rules. But we kept it abstract — “the model finds patterns.” This post makes it concrete. You’ll see the actual mechanics of the simplest ML models: how they combine inputs, how they make predictions, and how they measure their own mistakes. No heavy math — just mental models and small numbers you can verify in your head.


Part 1: Linear Combination — The Building Block

A linear combination is the atomic unit of machine learning: take some inputs, multiply each by a weight, and add them up.

Mental Model: The Recipe

Think of baking a cake. Your recipe says: 2 cups flour, 1 cup sugar, 0.5 cups butter. The final batter is a “combination” of ingredients, each multiplied by an amount (the weight). Change the weights, change the outcome. More sugar makes it sweeter. More butter makes it richer. A linear combination works exactly the same way — each input contributes to the output proportionally to its weight.

For predicting house prices:

Price = (w1 × square_feet) + (w2 × bedrooms) + (w3 × location_score) + bias

The model’s job is to find the right weights that make this formula predict accurately. The bias is a baseline value — the price a house would have even with zero square footage and zero bedrooms (obviously unrealistic, but mathematically necessary as a starting point for the line).

Why “Linear”?

It’s called “linear” because there’s no squaring, cubing, or other transformations of the inputs. Each input contributes proportionally to the output. Double the square footage, double its contribution to price (assuming the weight stays constant).

This simplicity is both a strength and a limitation. Strength: easy to understand, fast to compute, and surprisingly effective for many real-world problems. Limitation: can’t capture curved or complex relationships without help. We’ll see how logistic regression and neural networks address this limitation later in this post.


Part 2: Linear Regression — Finding the Best Line

Linear regression takes the linear combination idea and asks: given a bunch of real data points, what weights produce the best predictions?

Mental Model: The Line Through the Scatter

Imagine you have data on 1,000 houses — their square footage and sale prices. Plot them on a graph. The points form a rough upward trend, but they’re scattered — no two houses with the same square footage sold for the exact same price. Linear regression finds the single straight line that best captures this trend. Not a line that hits every point (that’s impossible with real data), but the line that minimizes the overall distance between itself and all the points.

Price ($)
    |                    *
    |               * *
    |            * *       ← Each * is a real house sale
    |        * * *
    |      * *
    |   * *
    +----------------------→ Square Feet

Linear regression finds the best straight line through these points.
But Which Line?

With just two points, you can draw exactly one line connecting them. But with 1,000 points, you could draw millions of different lines. Which one is “best”?

We measure “best” by how wrong the predictions are. For each house, the line predicts a price based on square footage. The actual sale price is known. The difference is the error. The best line minimizes the total error across all houses. How we measure that total error is the job of the loss function — covered in Part 4.

What Linear Regression Produces

After training, linear regression gives you concrete weights:

Price = $200 × square_feet + $15,000 × bedrooms + $50,000

Weights learned:
  - Each square foot adds $200
  - Each bedroom adds $15,000
  - Base price (bias) is $50,000

Now you can predict prices for houses you’ve never seen by plugging in their features. That’s the “pattern” the model found — the mathematical relationship between features and price.

Where It’s Used: Demand forecasting at Walmart, pricing models at Zillow, revenue prediction at SaaS companies, ad bid optimization at Google. Any time you need to predict a continuous number from structured inputs, linear regression is the first model to try.

Trade-offs:

  • Only captures straight-line relationships (can’t model curves without adding polynomial terms)
  • Sensitive to outliers — one mansion in a dataset of starter homes will pull the line off course
  • Assumes features contribute independently (can’t capture interactions like “bedrooms matter more in suburban zip codes”)

Part 3: Logistic Regression — When You Need Yes or No

Linear regression predicts continuous values: price, temperature, revenue. But what if you need to predict yes/no outcomes?

  • Will this customer churn? (yes/no)
  • Is this email spam? (yes/no)
  • Will this patient develop diabetes? (yes/no)

You need a probability — a number between 0 and 1 representing the chance of “yes.”

The Problem with Linear Combinations for Probability

A linear combination can output any number: -500, 0, 42, 10,000. But probability must be between 0 and 1. If your model predicts “120% chance of spam,” that’s nonsense. You need something that takes any input and squeezes it into the 0-1 range.

The Sigmoid Function: The Compressor

Logistic regression solves this by adding a compression step after the linear combination.

Mental Model: The Toothpaste Tube

The linear combination is the pressure you apply to a toothpaste tube — it can be any amount, from a gentle squeeze to stomping on it. The sigmoid function is the tiny opening at the top. No matter how hard you squeeze, only a controlled amount comes out — always between 0 and 1. Light pressure (small positive input) gives a small amount (probability near 0.5). Heavy pressure (large positive input) gives the maximum (probability near 1.0). Pulling back (negative input) gives almost nothing (probability near 0.0).

The sigmoid is an S-shaped curve that maps any number to the 0-1 range:

Sigmoid Function:

  Output
  1.0  |                    _______________
       |                  /
  0.5  |                /       ← Zero input → exactly 0.5
       |              /
  0.0  |_____________/
       +-----|---------|---------|→ Input
            -5        0         5

  Large negative → near 0.0 (confident "no")
  Zero           → exactly 0.5 (uncertain)
  Large positive → near 1.0 (confident "yes")

The two-step process:

Step 1: Compute linear combination
        z = (w1 × age) + (w2 × blood_pressure) + (w3 × BMI) + bias
        z could be any number: -3.2, 0, 7.5, etc.

Step 2: Apply sigmoid to squeeze into probability
        probability = sigmoid(z)
        Always between 0 and 1.
From Probability to Decision

Logistic regression outputs a probability. To make a yes/no decision, you apply a threshold:

  • If probability > 0.5 → Predict “Yes”
  • If probability ≤ 0.5 → Predict “No”

The threshold can be adjusted based on your use case. For cancer detection, you might lower it to 0.3 — catch more real cases even if it means more false alarms. For spam filtering, you might raise it to 0.7 — only block emails you’re highly confident about.

The PM Takeaway: Choosing the right threshold is a product decision, not a data science decision. It’s about the cost of false positives versus false negatives in your specific use case. A PM who understands this can have a much more productive conversation with their ML team than one who just asks “is the model accurate?”

Where It’s Used: Credit scoring at banks, churn prediction at subscription companies, spam filtering at Gmail, click-through rate prediction in ads systems, medical diagnosis triage.


Part 4: Loss Functions — Measuring How Wrong You Are

Both linear and logistic regression need a way to measure “how wrong” the model is with its current weights. This measurement is called the loss function (also called cost function or error function).

The loss function answers one question: given my current weights, how far off are my predictions? During training, the model adjusts weights to minimize this loss. The loss function is the scorecard, and the model is trying to get the best score.

Mental Model: The Golf Score

Think of the loss function like a golf score — lower is better. A score of 72 means you’re playing well. A score of 110 means you’re struggling. You don’t need to know the exact rules of golf to understand the goal: make the number go down. That’s all the model is doing during training — making the loss go down.

Mean Squared Error (MSE) — For Continuous Predictions

Used with linear regression when predicting continuous values like prices or temperatures.

For each prediction: measure the gap between predicted and actual value, square it (to make all errors positive and punish big errors more), then average across all data points.

House 1: Predicted $300K, Actual $320K → Error = $20K  → Squared = 400
House 2: Predicted $450K, Actual $440K → Error = $10K  → Squared = 100
House 3: Predicted $280K, Actual $350K → Error = $70K  → Squared = 4,900

MSE = (400 + 100 + 4,900) / 3 = 1,800

Why square the error? Two reasons. First, direction doesn’t matter — predicting $20K too high is as bad as $20K too low. Second, big errors are punished disproportionately: a $70K error contributes 49× more to the loss than a $10K error (4,900 vs. 100). This forces the model to focus on fixing its worst predictions first.

Cross-Entropy Loss — For Probability Predictions

Used with logistic regression. MSE doesn’t work well for probabilities because it doesn’t punish “confidently wrong” answers enough. Cross-entropy loss fixes this.

Mental Model: The Confidence Penalty

Cross-entropy asks: how surprised should I be by this prediction, given the true answer? If you’re a weather forecaster who says “99% chance of rain” and it rains — no surprise, tiny penalty. If you say “51% chance of rain” and it rains — some surprise, moderate penalty. If you say “1% chance of rain” and it rains — you were confidently wrong, and the penalty is enormous.

Prediction (True = Spam)MSE LossCross-Entropy Loss
0.99 (confident, right)0.00010.01
0.51 (barely right)0.240.67
0.01 (confident, WRONG)0.984.6 (huge!)

Cross-entropy punishes confident wrong answers far more harshly than MSE. This is why it’s the standard loss function for classification problems — it forces the model to be confident only when it’s actually right.

The PM Takeaway: When your data science team says “we’re using cross-entropy loss,” they mean the model is being trained on a classification task (yes/no, spam/not-spam, buy/don’t-buy) with a penalty system that heavily punishes confident mistakes. If they switch to MSE, ask why — it might mean they’re treating a classification problem as a regression problem, which is often a mistake.


Part 5: How It All Connects

Here’s the complete flow from raw data to predictions:

TRAINING PHASE:
  1. Start with random weights
  2. For each data point:
     a. Compute linear combination (weighted sum of inputs)
     b. For logistic regression: apply sigmoid to get probability
     c. Compare prediction to actual value
     d. Compute loss (MSE for continuous, cross-entropy for probability)
  3. Adjust weights to reduce loss (this is gradient descent — covered in the next post)
  4. Repeat until loss stops improving

PREDICTION PHASE:
  1. Take new input (new house, new customer)
  2. Apply learned weights via linear combination
  3. For logistic regression: apply sigmoid
  4. Output prediction (price, probability, class)
The Regression Family

Linear and logistic regression are the workhorses, but several variations exist for different situations:

TypeWhen to UseHow It Differs
Linear RegressionPredict continuous values (price, revenue)Output is an unbounded number
Logistic RegressionBinary classification (yes/no)Sigmoid squeezes output to 0-1
Polynomial RegressionCurved relationshipsAdds squared/cubed terms to capture non-linearity
Ridge / LassoPreventing overfittingAdds a penalty for large weights to keep the model simple
Softmax RegressionMulti-class classification (3+ categories)Extends sigmoid to multiple classes, probabilities sum to 1

Note the last one — softmax. It’s the generalization of sigmoid to multiple classes, and it shows up again in a major way when we cover transformers and the attention mechanism. It’s also the function behind the “attention budget” we’ll discuss in the context window and needle-in-a-haystack posts. Keep it in mind.


Common Misconceptions

“Linear regression is too simple to be useful.” It’s the workhorse of industry. At most companies, linear regression and logistic regression handle the majority of prediction tasks — churn, pricing, demand forecasting, click-through rates. The bias toward complex models (especially LLMs) is real, but a well-tuned logistic regression often outperforms a poorly implemented neural network. Start simple.

“Logistic regression is only for two categories.” Binary logistic regression handles yes/no. Softmax regression (also called multinomial logistic regression) handles any number of categories — and it’s essentially the same idea extended. “Is this image a cat, dog, or bird?” uses the same linear combination → activation → loss function pipeline.

“The model knows the weights are right.” The model has no concept of “right.” It only knows the loss is lower than before. A model with a loss of 0.001 might still be making systematically wrong predictions if the loss function doesn’t capture what you actually care about. Choosing the right loss function is a design decision, not an automatic one.

“More features always help.” Adding irrelevant features (like the color of the house for price prediction) can actually hurt performance by introducing noise. The model might assign a non-zero weight to house color by coincidence in the training data, then make worse predictions on new data. This is overfitting — the model memorized noise instead of learning signal.


The Mental Models — Your Cheat Sheet

ConceptMental ModelOne-Liner
Linear CombinationThe RecipeIngredients × amounts = output
Linear RegressionThe Best Line Through the ScatterFind weights that minimize error on continuous data
Logistic RegressionLinear combination + toothpaste tubeSqueeze any number into a 0-1 probability
Sigmoid FunctionThe Toothpaste TubeAny pressure in, controlled amount out (0 to 1)
MSE LossSquared distance from targetPunishes big errors disproportionately
Cross-Entropy LossThe Confidence PenaltyPunishes confident wrong answers exponentially
Decision ThresholdThe dial you setWhere you cut probability into yes/no — a product decision

Final Thought

Every ML model — from the simplest regression to the deepest neural network — builds on the four concepts in this post:

  1. Linear combination. Multiply inputs by weights, add them up. This is the atomic operation. Neural networks are just many linear combinations stacked together with non-linear activations between layers.
  2. Activation functions. Sigmoid compresses outputs into probabilities. Softmax extends this to multiple classes. Other activations (ReLU, tanh) serve similar purposes in neural networks. They add the non-linearity that linear combinations alone can’t provide.
  3. Loss functions. MSE for continuous predictions, cross-entropy for classification. The loss function defines what “wrong” means. Choose the wrong loss function and the model optimizes for the wrong thing — a subtle but devastating mistake.
  4. Training. Start with random weights, measure error, adjust, repeat. The specific algorithm for “adjust” is gradient descent — and that’s exactly what the next post covers. We’ll see how the model uses the loss function as a compass to navigate toward better weights, one step at a time.

When someone says “we’re using cross-entropy loss with a softmax output,” you now know they’re doing multi-class classification with a penalty that punishes confident wrong answers. When they say “the model’s MSE is 0.003,” you know they’re measuring how far off continuous predictions are from reality. That vocabulary alone puts you ahead of most people in the room.

Related Posts:

  • Teaching AI Models: Gradient Descent
  • AI Paradigm Shift: From Rules to Patterns
  • Needle in the Haystack: Embedding Training and Context Rot
  • How CNNs Actually Work
  • Measuring Meaning: Cosine Similarity
  • Making Sense Of Embeddings

Tags:

aiartificial-intelligencedata-sciencelinear-combinationlogistic-regressionmachine-learningregressiontechnology
Author

Archit Sharma

Follow Me
Other Articles
Previous

Privacy Enhancing Technologies – Introduction

Next

Breaking the “Unbreakable” Encryption – Part 1

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

icons8 pencil 100
ML Basics

Back to the basics

screenshot 1
Model Intuition

Build model intuition

icons8 lock 100 (1)
Encryption

How encryption works

icons8 gears 100
Privacy Tech

What protects privacy

screenshot 4
Musings

Writing is thinking

Recent Posts

  • Teaching AI Models: Gradient Descent
  • Needle in the Haystack: Embedding Training and Context Rot
  • Measuring Meaning: Cosine Similarity
  • AI Paradigm Shift: From Rules to Patterns
  • Seq2Seq Models: Basics behind LLMs
  • Word2Vec: Start of Dense Embeddings
  • Advertising in the Age of AI
  • Breaking the “Unbreakable” Encryption – Part 2
  • Breaking the “Unbreakable” Encryption – Part 1
  • ML Foundations – Linear Combinations to Logistic Regression
Copyright 2026 — Building AI Intuition. All rights reserved. Blogsy WordPress Theme