[ML 2] Making Sense Of Embeddings

9 Min Read

Updated on March 3, 2026

When you search on Amazon for “running shoes,” the system doesn’t just look for those exact words – it also shows you “jogging sneakers,” “athletic footwear,” and “marathon trainers.” When Spotify recommends a song you’ve never heard, it’s not because you searched for it – it’s because your listening history is close to people who love that song.

This magic happens because of embeddings. Embeddings are how AI systems understand that things are similar – even when they look completely different on the surface. This post will help you visualize what embeddings are, how they’re created, and why they power almost every modern AI application you use. No heavy math – just mental models you can carry with you.

What Is an Embedding, Really?

An embedding is a way to represent something – a word, a sentence, a product, a song, a user – as a point in space. Not physical space, but a mathematical space where closeness means similarity.

Mental Model: The Music Festival Seating Chart

Imagine you’re organizing a massive music festival and you need to seat 10,000 attendees. You want people who would enjoy talking to each other to sit nearby. You ask each person two questions: “How much do you like rock?” and “How much do you like electronic?” (Scale 1-10). Now each person can be placed on a 2D grid. Rock fans cluster in one corner. EDM fans cluster in another. Pop fans who like both end up in the middle.

You’ve just created an embedding space. Each person is now a point – a coordinate – based on their preferences. And proximity in this space means similarity.

        ^ Electronic
    10  |     * EDM fans
        |   *   * 
     5  |          * Pop fans
        |  
     1  | * *  Rock fans
        +------------------→ Rock
          1    5    10

Where It’s Used: Every search engine, recommendation system, and LLM-powered product you interact with runs on embeddings under the hood. Google Search, Spotify Discover Weekly, Netflix recommendations, Amazon product suggestions – all of them convert things into points in space and find what’s nearby.

Why Two Dimensions Aren’t Enough

The music festival example used two dimensions (rock preference, electronic preference). But two questions can’t capture the full complexity of someone’s taste.

Now imagine you add more questions: How much do you like jazz? Country? Hip-hop? With 5 questions, each person becomes a point in 5-dimensional space. You can’t visualize 5 dimensions, but the math works the same way – people with similar answers across all 5 questions will be “close” in this 5D space.

Real embedding systems use hundreds or thousands of dimensions. OpenAI’s embeddings use 1,536 dimensions. Each dimension captures some aspect of meaning – though unlike our music festival example, these dimensions aren’t hand-picked by humans. The AI learns them automatically.

Mental Model: The Questionnaire

Think of each dimension as one question on a massive personality quiz. Two dimensions is like judging someone’s music taste with just two questions – you’ll get the broad strokes but miss the nuance. 1,536 dimensions is like a 1,536-question quiz. The more questions, the more precisely you can place someone in “personality space.” Two people who answer all 1,536 questions similarly are almost certainly alike.

How AI Learns Embedding Dimensions

In our music festival, we chose the dimensions ourselves: rock preference, electronic preference. But AI systems don’t have humans hand-picking dimensions. Instead, they learn dimensions from context.

Mental Model: The New Employee

Imagine you’re a new employee at a company and you don’t speak the language. You can’t understand what anyone is saying, but you can observe who hangs out with whom. Over months, you notice: Sarah, Mike, and Priya always eat lunch together and carry laptops. John, Carlos, and Wei always eat together and carry hard hats. The first group walks toward the office building. The second group walks toward the construction site.

Without understanding a single word, you’ve learned that Sarah, Mike, and Priya are probably office workers, and John, Carlos, and Wei are construction workers. You figured out the structure from context, not from labels.

This is exactly how embedding models learn. They observe which words appear near each other in millions of sentences. Words that appear in similar contexts get placed close together in embedding space.

Consider these sentences from training data:

"I ate an apple for breakfast"
"I ate an orange for breakfast"
"I ate a banana for breakfast"
"I drove my car to work"
"I parked my car in the garage"

The model notices that “apple,” “orange,” and “banana” all appear after “ate” and before “for breakfast.” It notices “car” appears after “drove” and “parked” and near “work” and “garage.” From this, it learns: apple, orange, and banana belong together. Car is different.

The model doesn’t know that apples are fruits or that cars have wheels. It just knows that words used in similar ways should be close together. This is the key insight: embeddings don’t capture dictionary definitions. They capture usage patterns.

The Apple-Orange-Ball Problem: Why Embeddings Are Subtle

Here’s something that reveals the real power of embeddings. Consider three items: apple, orange, and ball.

Visually, an orange and a ball look more similar – both are round, roughly the same size. But in embedding space, apple and orange are much closer together. Why?

"She threw the ball across the yard"     — ball appears with "threw"
"She ate the apple at lunch"             — apple appears with "ate"
"She ate the orange at lunch"            — orange appears with "ate"

Apple and orange share contexts (eating, breakfast, fruit salad, grocery store). Ball shares contexts with throw, catch, play, sports. The model places apple and orange close together, and ball farther away – even though ball and orange look similar physically.

This is the power of embeddings: they capture semantic similarity, not visual similarity.

        ^ "edible/food context"
        |
    10  |  * apple    * orange
        |     * banana
     5  |                    * ball
        |  
     1  |                           * car  * truck
        +-----------------------------------→ "transportation context"
          1         5         10

What Dimensions Actually Represent

In learned embeddings, the dimensions are abstract – they don’t have clean labels like “rock preference.” But researchers have found that certain directions in embedding space capture meaningful concepts.

The famous example:

king - man + woman = queen

This shows that there’s a “gender direction” in the embedding space. If you take the point for “king,” subtract the direction for “man,” and add the direction for “woman,” you end up near “queen.”

Similarly:

Paris - France + Italy = Rome

There’s a “capital city” direction. The relationship between Paris and France is similar to the relationship between Rome and Italy. These directions weren’t programmed – they emerged from the patterns in language.

The PM Takeaway: You’ll never need to inspect individual dimensions. But knowing that directions in embedding space encode real relationships explains why vector arithmetic works for analogies, why bias shows up in embeddings (if the training data is biased, the geometry will be too), and why embeddings are the foundation of every semantic feature your team builds.

How Embeddings Power Real Applications

Once everything lives in embedding space, powerful operations become trivially easy.

Semantic Search (Google, Amazon, Notion)

Old keyword search: “running shoes” only matches documents containing those exact words. Embedding search: “running shoes” becomes a point in space. The system finds documents whose embeddings are close to that point – including documents about “jogging sneakers” and “marathon footwear.”

Query: "running shoes"  →  Point Q in embedding space
Documents: Each doc     →  Point D in embedding space
Results: Return docs where distance(Q, D) is smallest

Recommendation Systems (Spotify, Netflix, TikTok)

Your listening history becomes an embedding. Songs become embeddings. Recommend songs that are close to your user embedding.

Your taste: [rock=8, electronic=3, jazz=5, ...]  →  Point U
Song A:     [rock=7, electronic=4, jazz=4, ...]  →  Point A  →  distance = 2.4  →  Recommend!
Song B:     [rock=2, electronic=9, jazz=1, ...]  →  Point B  →  distance = 8.7  →  Skip

Clustering and Segmentation (Marketing, Product Analytics)

Embed all your users. Users who cluster together probably have similar behaviors. Name the clusters: “power users,” “casual browsers,” “deal hunters.”

Duplicate Detection (Customer Support, Data Cleaning)

Two support tickets might use completely different words but mean the same thing: “My order hasn’t arrived” and “Package delivery is late.” Embed both. If they’re close in embedding space, they’re probably duplicates or should be routed to the same team.

RAG for LLMs (ChatGPT, Claude)

When an LLM answers questions about your documents, it embeds your question, finds document chunks whose embeddings are close, and feeds those chunks to the LLM as context. This is why the AI can “know” about your specific documents without being trained on them. We’ll cover RAG architecture in detail in Posts 5 and 6.

The Training Loop: How Embeddings Are Created

The model starts with random embeddings – every word is assigned a random point in space. Then it trains on billions of sentences with a simple game.

Mental Model: The Prediction Game

Take a sentence: “The cat sat on the ___.” Mask one word. Ask the model to predict it. If it predicts wrong, adjust the embeddings so words that should predict each other are closer. After billions of these predictions, “cat” and “dog” end up close (both appear in “The ___ sat on the mat”) and “cat” and “car” end up far apart (they never appear in similar contexts).

The embeddings organize themselves so that prediction becomes easier. Similarity emerges as a side effect of prediction.

Embeddings vs. One-Hot Encoding

Before embeddings, the standard approach was one-hot encoding — giving each word a unique binary vector:

Vocabulary: [apple, orange, banana, car, truck]

apple  = [1, 0, 0, 0, 0]
orange = [0, 1, 0, 0, 0]
banana = [0, 0, 1, 0, 0]
car    = [0, 0, 0, 1, 0]
truck  = [0, 0, 0, 0, 1]

The problem: every word is equally distant from every other word. Apple is as different from orange as it is from car. No notion of similarity at all.

Embeddings fix this by learning a dense representation where similar things are close:

apple  = [0.8, 0.2, 0.9, ...]   ← close to orange
orange = [0.7, 0.3, 0.8, ...]   ← close to apple  
car    = [0.1, 0.9, 0.2, ...]   ← far from fruits

Factor	One-Hot Encoding	Dense Embeddings
Vector Size	Vocabulary size (100K+)	Fixed (300-1,536)
Similarity	Every word equidistant	Similar words are close
Storage	Sparse, wasteful	Dense, compact
Meaning	No semantic info	Captures relationships
Learned	No training needed	Requires training on data

Common Embedding Models You’ll Encounter

Model	Dimensions	Best For
Word2Vec	300	Classic word embeddings, fast and lightweight
Sentence-BERT	768	Sentence-level similarity and search
OpenAI Ada	1,536	General-purpose text embeddings
Cohere Embed	1,024	Multilingual, search-optimized
CLIP	512	Images and text in the same space

More dimensions generally means more nuance, but also more storage and computation. Choosing the right model is a cost-accuracy tradeoff your team will navigate for every feature.

Common Misconceptions

“Embeddings understand meaning like humans do.” They don’t. They understand usage patterns. If the training data consistently uses a word in a biased way, the embedding will encode that bias. Embeddings are a mirror of the data, not a source of truth.

“Higher dimensions are always better.” Not necessarily. More dimensions capture more nuance but increase storage, latency, and cost. For many product use cases, 384 or 768 dimensions are more than enough. Don’t default to the biggest model.

“You need to build your own embeddings.” For most product teams, pre-trained embedding models (OpenAI, Cohere, Sentence-BERT) work out of the box. Fine-tuning or training from scratch only makes sense when your domain is highly specialized (medical, legal, internal jargon).

The Mental Models – Your Cheat Sheet

Concept	Mental Model	One-Liner
Embedding	Music Festival Seating	Similar preferences sit together
High Dimensions	The 1,536-Question Quiz	More questions = more precise placement
Learning Dimensions	The New Employee	Figure out structure from context, not labels
Semantic vs. Visual Similarity	Apple-Orange-Ball	Usage patterns trump appearances
Vector Directions	King – Man + Woman = Queen	Relationships emerge as directions in space
Training Process	The Prediction Game	Similarity emerges from predicting neighbors
One-Hot vs. Dense	Every seat equidistant vs. grouped by taste	Dense captures similarity, one-hot can’t

Final Thought

Embeddings are how AI systems understand that things are related – even when they look completely different on the surface. They work by observing patterns: words that appear in similar contexts get placed close together in a high-dimensional space.

The mental model to carry with you:

Everything becomes a point in space. Words, sentences, images, users, products – anything can be embedded.
Closeness means similarity. The entire point of embedding space is that distance equals meaning. Nearby points are semantically related.
Dimensions are learned from patterns, not programmed. The AI discovers the structure of meaning by observing billions of examples, not by humans labeling axes.
Once things are embedded, finding similar things is just finding nearby points. Search, recommendations, deduplication, RAG – they’re all the same operation: find what’s close.

The next time someone says “we’re using vector embeddings for search,” picture a vast coordinate system where every document has a GPS location, and search is just finding the closest locations to your query. That’s all embeddings are – and that mental model will serve you well.

In the next post, we’ll zoom into Word2Vec – the model that started the dense embeddings revolution – and see exactly how the “prediction game” works under the hood.

Tags:

[ML 2] Making Sense Of Embeddings

What Is an Embedding, Really?

Why Two Dimensions Aren’t Enough

How AI Learns Embedding Dimensions

The Apple-Orange-Ball Problem: Why Embeddings Are Subtle

What Dimensions Actually Represent

How Embeddings Power Real Applications

Semantic Search (Google, Amazon, Notion)

Recommendation Systems (Spotify, Netflix, TikTok)

Clustering and Segmentation (Marketing, Product Analytics)

Duplicate Detection (Customer Support, Data Cleaning)

RAG for LLMs (ChatGPT, Claude)

The Training Loop: How Embeddings Are Created

Embeddings vs. One-Hot Encoding

Common Embedding Models You’ll Encounter

Common Misconceptions

The Mental Models – Your Cheat Sheet

Final Thought

Related Posts:

Tags:

Archit Sharma

Other Articles

[ML 2.a] Word2Vec: Start of Dense Embeddings

[ML 1.b] Teaching AI Models: Gradient Descent

No Comment! Be the first one.

Leave a Reply Cancel reply

ML Basics

Model Intuition

Encryption

Privacy Tech

Musings