Measuring Meaning: Cosine Similarity

10 Min Read

Post 2b/N

In the previous posts, we established that embeddings turn everything into points in space and that Word2Vec showed how to learn those points from context. But we glossed over something critical: how do you actually measure “closeness”?

When your search engine says “jogging sneakers” is similar to “running shoes,” what math is happening? When your RAG pipeline retrieves the “most relevant” document chunks, how does it decide what’s relevant? And when someone says CLIP puts images and text “in the same space” — what does that actually mean?

This post fills those gaps. We’ll cover how similarity is measured, how embeddings evolved from single words to entire sentences, and the powerful unifying concept of latent spaces — the idea that connects text search, image generation, recommendation engines, and everything in between.

How Similarity Is Actually Measured: Cosine Similarity

You have two points in embedding space. You need a number that says “how similar are these?” There are several ways to measure distance, but the one that dominates modern AI is cosine similarity.

Mental Model: The Compass Bearing

Forget about how far apart two ships are on the ocean. Instead, ask: are they sailing in the same direction? Two ships that are miles apart but heading due north are more “similar” than two ships that are close together but sailing in opposite directions. Cosine similarity measures the angle between two vectors, not the distance. If they point in the same direction, the similarity is 1. If they’re perpendicular (unrelated), it’s 0. If they point in opposite directions, it’s -1.

Why angle instead of distance? Because embedding vectors can have very different magnitudes depending on word frequency and training dynamics. A common word like “the” might have a long vector, and a rare word like “serendipity” might have a short one — but that doesn’t mean “the” is more meaningful. Cosine similarity ignores length and focuses purely on direction, which captures semantic relationships more reliably.

Cosine Similarity:

  Same direction     Perpendicular      Opposite direction
  (similarity = 1)   (similarity = 0)   (similarity = -1)
       ↗ ↗               ↗                    ↗
                          →                    ↙

  "running shoes"    "running shoes"     "running shoes"
  vs.                vs.                 vs.
  "jogging sneakers" "quantum physics"   "sitting still"

A simple example:

"king"  = [0.9, 0.1, 0.8]
"queen" = [0.85, 0.15, 0.75]
"car"   = [0.1, 0.9, 0.2]

cosine("king", "queen") = 0.97   ← Very similar (nearly parallel)
cosine("king", "car")   = 0.35   ← Not similar (very different angle)

You don’t need to compute this by hand — every vector database and embedding library does it automatically. But knowing what’s happening under the hood helps you debug when search results feel “off.”

Where It’s Used: Every semantic search system, every RAG retrieval step, every recommendation engine that uses embeddings. When Pinecone, Weaviate, or ChromaDB returns “top-k similar results,” they’re ranking by cosine similarity (or a close variant like dot product).

Trade-offs:

Cosine similarity treats all dimensions equally — it can’t tell you why two things are similar, just that they are
It works well for normalized vectors but can behave oddly with very sparse or very short vectors
For some use cases (like retrieval with relevance scoring), dot product works better because it preserves magnitude as a signal of “importance”

From Words to Sentences: The Evolution of Embeddings

Word2Vec gave us a coordinate for every word. But products don’t operate on single words — they operate on sentences, paragraphs, and documents. The evolution from word embeddings to sentence embeddings is one of the most important shifts in applied AI.

The Problem with Averaging

The naive approach: take the Word2Vec vectors for every word in a sentence and average them. “The dog bit the man” and “The man bit the dog” would get the same embedding — because the same words are present, just in different order. That’s clearly wrong. Order matters.

Mental Model: The Smoothie Problem

If you throw strawberries, bananas, and yogurt into a blender, you get a smoothie. If you throw the same ingredients in a different order, you get the exact same smoothie. Averaging word vectors is like making a smoothie — you lose all structure. You can’t taste which ingredient went in first. A good sentence embedding needs to be more like a layered cake, where the order and structure of ingredients matters.

Sentence-BERT: Context-Aware Sentence Embeddings

Sentence-BERT (2019) solved this by using a transformer-based architecture that processes the entire sentence at once, capturing word order, grammar, and context. Instead of averaging word vectors, it produces a single vector that represents the meaning of the whole sentence.

Word2Vec averaging:
  "The dog bit the man"  →  average of [the, dog, bit, the, man]  →  [0.4, 0.5, 0.3]
  "The man bit the dog"  →  average of [the, man, bit, the, dog]  →  [0.4, 0.5, 0.3]
  Same embedding! Wrong.

Sentence-BERT:
  "The dog bit the man"  →  [0.7, 0.2, 0.8, ...]  →  captures "dog is the biter"
  "The man bit the dog"  →  [0.3, 0.7, 0.4, ...]  →  captures "man is the biter"
  Different embeddings. Correct.

Why This Matters for Products: If you’re building semantic search, a support ticket classifier, or a RAG pipeline, you almost certainly want sentence-level embeddings, not word-level. The jump from Word2Vec to Sentence-BERT is the jump from “toy demo” to “production-ready.”

Model	Embeds	Captures Order?	Best For
Word2Vec	Single words	No	Word analogies, simple similarity
Sentence-BERT	Full sentences	Yes	Search, classification, RAG
OpenAI Ada-002	Sentences/paragraphs	Yes	General-purpose, API-first teams
Cohere Embed v3	Sentences/paragraphs	Yes	Multilingual, search-optimized

Latent Spaces: The Unifying Concept

So far we’ve talked about embedding spaces for text. But the same idea works for anything — images, audio, user behavior, products, molecules. The general term for this is a latent space.

Mental Model: The Universal Filing System

Imagine a massive warehouse with an infinite number of shelves, organized not by category labels but by similarity. If you walk in any direction, things gradually change. Walk north and images get brighter. Walk east and music gets faster. Walk up and text gets more formal. Every item in the warehouse — a photo, a song, a sentence, a product — has a specific shelf location based on its properties.

A latent space is this warehouse. “Latent” means hidden — the organizing dimensions aren’t visible or labeled, but they’re real and consistent. The model discovered them from data.

The word “latent” is important: these dimensions are hidden and learned, not designed. A latent space for faces might have dimensions that correspond to things like “smiling,” “age,” or “lighting angle” — but nobody programmed those. They emerged from the training data.

Why Latent Spaces Matter for Products

When you understand that everything can live in a latent space, several powerful capabilities unlock:

Cross-modal search. CLIP (by OpenAI) puts images and text into the same latent space. You can search for an image by typing “sunset over mountains” — the text query and the matching images are nearby in the shared space, even though one is words and the other is pixels.

Shared Latent Space (CLIP):

  "sunset over mountains" (text)     ●
                                      ↖ close!
  [photo of sunset over mountains]   ●

  "corporate spreadsheet" (text)              ● far away

Generative AI. Diffusion models (Stable Diffusion, DALL-E) and VAEs work by learning a latent space of images. Generating a new image means picking a point in latent space and decoding it into pixels. Nearby points produce similar images. Moving smoothly between two points produces a smooth visual transition.

Recommendation across types. If users, products, and content all live in the same latent space, you can recommend products to users, content to products, or users to users — all with the same “find nearest neighbors” operation.

Dimensionality Reduction: Seeing the Invisible

Real embedding spaces have hundreds or thousands of dimensions. Humans can see in 2D (a screen) or 3D (with depth). So how do you visualize or inspect a 1,536-dimensional space?

Dimensionality reduction techniques compress high-dimensional data into 2D or 3D for visualization while trying to preserve the neighborhood relationships — things that are close in 1,536D should still be close in 2D.

Mental Model: The Shadow on the Wall

Imagine a complex 3D sculpture hanging from the ceiling. You shine a flashlight on it and look at the shadow on the wall. The shadow is 2D — you’ve lost information — but the shadow still captures the rough shape of the sculpture. Some details are lost, some relationships get distorted, but the overall structure is preserved. That’s dimensionality reduction: projecting a high-dimensional space onto a flat surface you can actually look at.

The Two Tools You’ll Encounter

PCA (Principal Component Analysis): Finds the directions of maximum variance and projects onto those. Fast, deterministic, good for a quick overview. Think of it as finding the angle of the flashlight that casts the most informative shadow.

t-SNE (t-distributed Stochastic Neighbor Embedding): Specifically optimized to preserve local neighborhoods — things that are close in high-D stay close in 2D. Slower, non-deterministic (different runs give different layouts), but produces much better cluster visualizations.

1,536-D embedding space          t-SNE projection to 2D
(invisible to humans)            (visible on your screen)

[0.8, 0.2, ..., 0.5]    →→→     ●  apple
[0.7, 0.3, ..., 0.4]    →→→     ●  orange     (cluster: fruits)
[0.75, 0.25, ..., 0.45] →→→     ●  banana

[0.1, 0.9, ..., 0.8]    →→→              ●  car
[0.15, 0.85, ..., 0.75] →→→              ●  truck   (cluster: vehicles)

The PM Takeaway: You’ll never use t-SNE in production — it’s a visualization tool, not a search algorithm. But it’s incredibly valuable for debugging: “Are our product embeddings actually clustering by category?” “Are these two topics that should be separate actually overlapping?” If your data science team shows you a t-SNE plot and says “see, the clusters are clean” — you now know what you’re looking at.

Trade-offs:

Dimensionality reduction always loses information — distances in 2D don’t perfectly reflect distances in 1,536D
t-SNE is non-deterministic: the same data produces different-looking plots on different runs
PCA preserves global structure better but can smash local clusters together
Neither technique should be used for actual similarity computation — they’re for human eyes only

How It All Connects

Here’s the full embeddings picture, from raw input to product feature:

1. EMBED: Convert inputs (text, images, users) into vectors
   → Word2Vec, Sentence-BERT, CLIP, OpenAI Ada
          ↓
2. STORE: Put vectors in a vector database
   → Pinecone, Weaviate, ChromaDB, pgvector
          ↓
3. QUERY: Convert the user's query into a vector
   → Same embedding model as Step 1
          ↓
4. MEASURE: Find nearest neighbors using cosine similarity
   → Rank by similarity score
          ↓
5. RETURN: Deliver results to the user
   → Semantic search, recommendations, RAG retrieval
          ↓
6. DEBUG: Visualize with t-SNE/PCA if results feel wrong
   → Are the clusters clean? Are unrelated items overlapping?

This pipeline is the backbone of every embedding-powered feature. Whether you’re building semantic search, a recommendation engine, or a RAG system — you’re running some version of this flow.

Common Misconceptions

“Cosine similarity of 0.9 means 90% similar.” Not exactly. Cosine similarity isn’t a percentage — it’s a directional measurement. What counts as “high” depends on the embedding model, the domain, and the task. For some models, 0.7 is a strong match. For others, 0.85 is barely relevant. You need to calibrate thresholds empirically for your specific use case.

“t-SNE shows the true structure of the data.” t-SNE is a lossy projection. The distances between clusters in a t-SNE plot are not meaningful — only the existence of clusters matters. Two clusters that look far apart in 2D might actually be close in the original space. Use it for pattern discovery, not for measurement.

“CLIP understands images and text the same way.” CLIP learns to align text and image embeddings so that matching pairs are close. But it doesn’t “understand” either modality in a deep sense — it understands correspondence. It knows that the text “a golden retriever playing fetch” should be near a photo of that scene, because it was trained on millions of such pairs.

“We need a different embedding model for each data type.” Not anymore. Multi-modal models like CLIP put different data types into the same latent space. But single-modality models (Sentence-BERT for text, ResNet for images) often perform better within their domain. The choice is a tradeoff between versatility and precision.

The Mental Models — Your Cheat Sheet

Concept	Mental Model	One-Liner
Cosine Similarity	The Compass Bearing	Same direction = similar, regardless of distance
Averaging Word Vectors	The Smoothie Problem	You lose all structure when you blend
Sentence Embeddings	Layered Cake vs. Smoothie	Order and context preserved, not just ingredients
Latent Space	The Universal Filing System	Everything has a shelf location based on learned similarity
Cross-Modal Search	Shared warehouse for text and images	CLIP puts words and pictures on the same shelves
Dimensionality Reduction	The Shadow on the Wall	2D projection that preserves rough shape
t-SNE	The best-angle shadow	Optimized to keep neighbors close

Final Thought

With this post, you now have the complete embeddings picture — from what they are, to how they’re learned, to how similarity is measured, to the unifying concept of latent spaces.

The key ideas to carry forward:

Cosine similarity measures direction, not distance. Two vectors pointing the same way are similar regardless of their length. This is why it’s the default similarity metric in almost every production system.
Sentence embeddings replaced word embeddings for real products. The jump from Word2Vec to Sentence-BERT was the jump from “interesting research” to “production-grade search and retrieval.” If you’re building anything with embeddings today, you’re almost certainly using sentence-level models.
Latent spaces are the universal language of AI. Text, images, audio, user behavior — all of it can live in the same kind of space. Understanding this concept connects seemingly different AI capabilities (search, recommendations, generation, RAG) into a single coherent framework.
Dimensionality reduction is for debugging, not for production. Use t-SNE and PCA to inspect and validate your embeddings. Use cosine similarity and vector databases for actual retrieval.

In the next post, we’ll take everything we’ve learned about embeddings and enter the architecture that changed AI forever: the Transformer. We’ll see how attention mechanisms use embedding spaces to let models process language with unprecedented power — and why the constraints of context windows, KV caches, and scaling laws matter for every AI product you’ll build.

Tags:

Breaking the “Unbreakable” Encryption – Part 2

Seq2Seq Models: Basics behind LLMs

An Intuitive Guide to CNNs and RNNs

Word2Vec: Start of Dense Embeddings

How Smart Vector Search Works

Needle in the Haystack: Embedding Training and Context Rot

Measuring Meaning: Cosine Similarity

Post 2b/N

How Similarity Is Actually Measured: Cosine Similarity

From Words to Sentences: The Evolution of Embeddings

The Problem with Averaging

Sentence-BERT: Context-Aware Sentence Embeddings

Latent Spaces: The Unifying Concept

Why Latent Spaces Matter for Products

Dimensionality Reduction: Seeing the Invisible

The Two Tools You’ll Encounter

How It All Connects

Common Misconceptions

The Mental Models — Your Cheat Sheet

Final Thought

Related Posts:

Tags:

Archit Sharma

Other Articles

AI Paradigm Shift: From Rules to Patterns

Needle in the Haystack: Embedding Training and Context Rot

No Comment! Be the first one.

Leave a Reply Cancel reply

ML Basics

Model Intuition

Encryption

Privacy Tech

Musings