Measuring Meaning: Cosine Similarity
Post 2b/N
In the previous posts, we established that embeddings turn everything into points in space and that Word2Vec showed how to learn those points from context. But we glossed over something critical: how do you actually measure “closeness”?
When your search engine says “jogging sneakers” is similar to “running shoes,” what math is happening? When your RAG pipeline retrieves the “most relevant” document chunks, how does it decide what’s relevant? And when someone says CLIP puts images and text “in the same space” — what does that actually mean?
This post fills those gaps. We’ll cover how similarity is measured, how embeddings evolved from single words to entire sentences, and the powerful unifying concept of latent spaces — the idea that connects text search, image generation, recommendation engines, and everything in between.
How Similarity Is Actually Measured: Cosine Similarity
You have two points in embedding space. You need a number that says “how similar are these?” There are several ways to measure distance, but the one that dominates modern AI is cosine similarity.
Mental Model: The Compass Bearing
Forget about how far apart two ships are on the ocean. Instead, ask: are they sailing in the same direction? Two ships that are miles apart but heading due north are more “similar” than two ships that are close together but sailing in opposite directions. Cosine similarity measures the angle between two vectors, not the distance. If they point in the same direction, the similarity is 1. If they’re perpendicular (unrelated), it’s 0. If they point in opposite directions, it’s -1.
Why angle instead of distance? Because embedding vectors can have very different magnitudes depending on word frequency and training dynamics. A common word like “the” might have a long vector, and a rare word like “serendipity” might have a short one — but that doesn’t mean “the” is more meaningful. Cosine similarity ignores length and focuses purely on direction, which captures semantic relationships more reliably.
Cosine Similarity:
Same direction Perpendicular Opposite direction
(similarity = 1) (similarity = 0) (similarity = -1)
↗ ↗ ↗ ↗
→ ↙
"running shoes" "running shoes" "running shoes"
vs. vs. vs.
"jogging sneakers" "quantum physics" "sitting still"
A simple example:
"king" = [0.9, 0.1, 0.8]
"queen" = [0.85, 0.15, 0.75]
"car" = [0.1, 0.9, 0.2]
cosine("king", "queen") = 0.97 ← Very similar (nearly parallel)
cosine("king", "car") = 0.35 ← Not similar (very different angle)
You don’t need to compute this by hand — every vector database and embedding library does it automatically. But knowing what’s happening under the hood helps you debug when search results feel “off.”
Where It’s Used: Every semantic search system, every RAG retrieval step, every recommendation engine that uses embeddings. When Pinecone, Weaviate, or ChromaDB returns “top-k similar results,” they’re ranking by cosine similarity (or a close variant like dot product).
Trade-offs:
- Cosine similarity treats all dimensions equally — it can’t tell you why two things are similar, just that they are
- It works well for normalized vectors but can behave oddly with very sparse or very short vectors
- For some use cases (like retrieval with relevance scoring), dot product works better because it preserves magnitude as a signal of “importance”
From Words to Sentences: The Evolution of Embeddings
Word2Vec gave us a coordinate for every word. But products don’t operate on single words — they operate on sentences, paragraphs, and documents. The evolution from word embeddings to sentence embeddings is one of the most important shifts in applied AI.
The Problem with Averaging
The naive approach: take the Word2Vec vectors for every word in a sentence and average them. “The dog bit the man” and “The man bit the dog” would get the same embedding — because the same words are present, just in different order. That’s clearly wrong. Order matters.
Mental Model: The Smoothie Problem
If you throw strawberries, bananas, and yogurt into a blender, you get a smoothie. If you throw the same ingredients in a different order, you get the exact same smoothie. Averaging word vectors is like making a smoothie — you lose all structure. You can’t taste which ingredient went in first. A good sentence embedding needs to be more like a layered cake, where the order and structure of ingredients matters.
Sentence-BERT: Context-Aware Sentence Embeddings
Sentence-BERT (2019) solved this by using a transformer-based architecture that processes the entire sentence at once, capturing word order, grammar, and context. Instead of averaging word vectors, it produces a single vector that represents the meaning of the whole sentence.
Word2Vec averaging:
"The dog bit the man" → average of [the, dog, bit, the, man] → [0.4, 0.5, 0.3]
"The man bit the dog" → average of [the, man, bit, the, dog] → [0.4, 0.5, 0.3]
Same embedding! Wrong.
Sentence-BERT:
"The dog bit the man" → [0.7, 0.2, 0.8, ...] → captures "dog is the biter"
"The man bit the dog" → [0.3, 0.7, 0.4, ...] → captures "man is the biter"
Different embeddings. Correct.
Why This Matters for Products: If you’re building semantic search, a support ticket classifier, or a RAG pipeline, you almost certainly want sentence-level embeddings, not word-level. The jump from Word2Vec to Sentence-BERT is the jump from “toy demo” to “production-ready.”
| Model | Embeds | Captures Order? | Best For |
|---|---|---|---|
| Word2Vec | Single words | No | Word analogies, simple similarity |
| Sentence-BERT | Full sentences | Yes | Search, classification, RAG |
| OpenAI Ada-002 | Sentences/paragraphs | Yes | General-purpose, API-first teams |
| Cohere Embed v3 | Sentences/paragraphs | Yes | Multilingual, search-optimized |
Latent Spaces: The Unifying Concept
So far we’ve talked about embedding spaces for text. But the same idea works for anything — images, audio, user behavior, products, molecules. The general term for this is a latent space.
Mental Model: The Universal Filing System
Imagine a massive warehouse with an infinite number of shelves, organized not by category labels but by similarity. If you walk in any direction, things gradually change. Walk north and images get brighter. Walk east and music gets faster. Walk up and text gets more formal. Every item in the warehouse — a photo, a song, a sentence, a product — has a specific shelf location based on its properties.
A latent space is this warehouse. “Latent” means hidden — the organizing dimensions aren’t visible or labeled, but they’re real and consistent. The model discovered them from data.
The word “latent” is important: these dimensions are hidden and learned, not designed. A latent space for faces might have dimensions that correspond to things like “smiling,” “age,” or “lighting angle” — but nobody programmed those. They emerged from the training data.
Why Latent Spaces Matter for Products
When you understand that everything can live in a latent space, several powerful capabilities unlock:
Cross-modal search. CLIP (by OpenAI) puts images and text into the same latent space. You can search for an image by typing “sunset over mountains” — the text query and the matching images are nearby in the shared space, even though one is words and the other is pixels.
Shared Latent Space (CLIP):
"sunset over mountains" (text) ●
↖ close!
[photo of sunset over mountains] ●
"corporate spreadsheet" (text) ● far away
Generative AI. Diffusion models (Stable Diffusion, DALL-E) and VAEs work by learning a latent space of images. Generating a new image means picking a point in latent space and decoding it into pixels. Nearby points produce similar images. Moving smoothly between two points produces a smooth visual transition.
Recommendation across types. If users, products, and content all live in the same latent space, you can recommend products to users, content to products, or users to users — all with the same “find nearest neighbors” operation.
Dimensionality Reduction: Seeing the Invisible
Real embedding spaces have hundreds or thousands of dimensions. Humans can see in 2D (a screen) or 3D (with depth). So how do you visualize or inspect a 1,536-dimensional space?
Dimensionality reduction techniques compress high-dimensional data into 2D or 3D for visualization while trying to preserve the neighborhood relationships — things that are close in 1,536D should still be close in 2D.
Mental Model: The Shadow on the Wall
Imagine a complex 3D sculpture hanging from the ceiling. You shine a flashlight on it and look at the shadow on the wall. The shadow is 2D — you’ve lost information — but the shadow still captures the rough shape of the sculpture. Some details are lost, some relationships get distorted, but the overall structure is preserved. That’s dimensionality reduction: projecting a high-dimensional space onto a flat surface you can actually look at.
The Two Tools You’ll Encounter
PCA (Principal Component Analysis): Finds the directions of maximum variance and projects onto those. Fast, deterministic, good for a quick overview. Think of it as finding the angle of the flashlight that casts the most informative shadow.
t-SNE (t-distributed Stochastic Neighbor Embedding): Specifically optimized to preserve local neighborhoods — things that are close in high-D stay close in 2D. Slower, non-deterministic (different runs give different layouts), but produces much better cluster visualizations.
1,536-D embedding space t-SNE projection to 2D
(invisible to humans) (visible on your screen)
[0.8, 0.2, ..., 0.5] →→→ ● apple
[0.7, 0.3, ..., 0.4] →→→ ● orange (cluster: fruits)
[0.75, 0.25, ..., 0.45] →→→ ● banana
[0.1, 0.9, ..., 0.8] →→→ ● car
[0.15, 0.85, ..., 0.75] →→→ ● truck (cluster: vehicles)
The PM Takeaway: You’ll never use t-SNE in production — it’s a visualization tool, not a search algorithm. But it’s incredibly valuable for debugging: “Are our product embeddings actually clustering by category?” “Are these two topics that should be separate actually overlapping?” If your data science team shows you a t-SNE plot and says “see, the clusters are clean” — you now know what you’re looking at.
Trade-offs:
- Dimensionality reduction always loses information — distances in 2D don’t perfectly reflect distances in 1,536D
- t-SNE is non-deterministic: the same data produces different-looking plots on different runs
- PCA preserves global structure better but can smash local clusters together
- Neither technique should be used for actual similarity computation — they’re for human eyes only
How It All Connects
Here’s the full embeddings picture, from raw input to product feature:
1. EMBED: Convert inputs (text, images, users) into vectors
→ Word2Vec, Sentence-BERT, CLIP, OpenAI Ada
↓
2. STORE: Put vectors in a vector database
→ Pinecone, Weaviate, ChromaDB, pgvector
↓
3. QUERY: Convert the user's query into a vector
→ Same embedding model as Step 1
↓
4. MEASURE: Find nearest neighbors using cosine similarity
→ Rank by similarity score
↓
5. RETURN: Deliver results to the user
→ Semantic search, recommendations, RAG retrieval
↓
6. DEBUG: Visualize with t-SNE/PCA if results feel wrong
→ Are the clusters clean? Are unrelated items overlapping?
This pipeline is the backbone of every embedding-powered feature. Whether you’re building semantic search, a recommendation engine, or a RAG system — you’re running some version of this flow.
Common Misconceptions
“Cosine similarity of 0.9 means 90% similar.” Not exactly. Cosine similarity isn’t a percentage — it’s a directional measurement. What counts as “high” depends on the embedding model, the domain, and the task. For some models, 0.7 is a strong match. For others, 0.85 is barely relevant. You need to calibrate thresholds empirically for your specific use case.
“t-SNE shows the true structure of the data.” t-SNE is a lossy projection. The distances between clusters in a t-SNE plot are not meaningful — only the existence of clusters matters. Two clusters that look far apart in 2D might actually be close in the original space. Use it for pattern discovery, not for measurement.
“CLIP understands images and text the same way.” CLIP learns to align text and image embeddings so that matching pairs are close. But it doesn’t “understand” either modality in a deep sense — it understands correspondence. It knows that the text “a golden retriever playing fetch” should be near a photo of that scene, because it was trained on millions of such pairs.
“We need a different embedding model for each data type.” Not anymore. Multi-modal models like CLIP put different data types into the same latent space. But single-modality models (Sentence-BERT for text, ResNet for images) often perform better within their domain. The choice is a tradeoff between versatility and precision.
The Mental Models — Your Cheat Sheet
| Concept | Mental Model | One-Liner |
|---|---|---|
| Cosine Similarity | The Compass Bearing | Same direction = similar, regardless of distance |
| Averaging Word Vectors | The Smoothie Problem | You lose all structure when you blend |
| Sentence Embeddings | Layered Cake vs. Smoothie | Order and context preserved, not just ingredients |
| Latent Space | The Universal Filing System | Everything has a shelf location based on learned similarity |
| Cross-Modal Search | Shared warehouse for text and images | CLIP puts words and pictures on the same shelves |
| Dimensionality Reduction | The Shadow on the Wall | 2D projection that preserves rough shape |
| t-SNE | The best-angle shadow | Optimized to keep neighbors close |
Final Thought
With this post, you now have the complete embeddings picture — from what they are, to how they’re learned, to how similarity is measured, to the unifying concept of latent spaces.
The key ideas to carry forward:
- Cosine similarity measures direction, not distance. Two vectors pointing the same way are similar regardless of their length. This is why it’s the default similarity metric in almost every production system.
- Sentence embeddings replaced word embeddings for real products. The jump from Word2Vec to Sentence-BERT was the jump from “interesting research” to “production-grade search and retrieval.” If you’re building anything with embeddings today, you’re almost certainly using sentence-level models.
- Latent spaces are the universal language of AI. Text, images, audio, user behavior — all of it can live in the same kind of space. Understanding this concept connects seemingly different AI capabilities (search, recommendations, generation, RAG) into a single coherent framework.
- Dimensionality reduction is for debugging, not for production. Use t-SNE and PCA to inspect and validate your embeddings. Use cosine similarity and vector databases for actual retrieval.
In the next post, we’ll take everything we’ve learned about embeddings and enter the architecture that changed AI forever: the Transformer. We’ll see how attention mechanisms use embedding spaces to let models process language with unprecedented power — and why the constraints of context windows, KV caches, and scaling laws matter for every AI product you’ll build.