Skip to content
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
Recent Posts
March 1, 2026
Teaching AI Models: Gradient Descent
March 1, 2026
Needle in the Haystack: Embedding Training and Context Rot
March 1, 2026
Measuring Meaning: Cosine Similarity
February 28, 2026
AI Paradigm Shift: From Rules to Patterns
February 16, 2026
Seq2Seq Models: Basics behind LLMs
February 16, 2026
Word2Vec: Start of Dense Embeddings
February 13, 2026
Advertising in the Age of AI
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 2
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 1
February 8, 2026
ML Foundations – Linear Combinations to Logistic Regression
February 2, 2026
Privacy Enhancing Technologies – Introduction
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 3
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 2
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 1
February 2, 2026
An Intuitive Guide to CNNs and RNNs
February 2, 2026
Making Sense Of Embeddings
November 9, 2025
How CNNs Actually Work
August 17, 2025
How Smart Vector Search Works
Encryption

Breaking the “Unbreakable” Encryption – Part 2

In Part 1, we covered the “Safe” (Symmetric) and the “Mailbox” (Asymmetric). The TL;DR: we use…

Machine Learning Basics

Seq2Seq Models: Basics behind LLMs

When you use Google Translate to turn a complex English sentence into Spanish, or when you ask Gemini to summarize a…

Model Intuition

An Intuitive Guide to CNNs and RNNs

When your phone recognizes “Hey Siri,” a CNN is probably listening. When Google Translate converts your sentence into…

Machine Learning Basics

Word2Vec: Start of Dense Embeddings

Post 2a/N When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the…

Machine Learning Basics

How Smart Vector Search Works

In the ever-evolving world, the art of forging genuine connections remains timeless. Whether it’s with colleagues,…

Machine Learning Basics Model Intuition

Needle in the Haystack: Embedding Training and Context Rot

Post 2c/N You’ve probably experienced this: you paste a 50-page document into ChatGPT or Claude, ask a specific…

Home/Machine Learning Basics/Measuring Meaning: Cosine Similarity
Machine Learning Basics

Measuring Meaning: Cosine Similarity

By Archit Sharma
10 Min Read
0

Post 2b/N

In the previous posts, we established that embeddings turn everything into points in space and that Word2Vec showed how to learn those points from context. But we glossed over something critical: how do you actually measure “closeness”?

When your search engine says “jogging sneakers” is similar to “running shoes,” what math is happening? When your RAG pipeline retrieves the “most relevant” document chunks, how does it decide what’s relevant? And when someone says CLIP puts images and text “in the same space” — what does that actually mean?

This post fills those gaps. We’ll cover how similarity is measured, how embeddings evolved from single words to entire sentences, and the powerful unifying concept of latent spaces — the idea that connects text search, image generation, recommendation engines, and everything in between.


How Similarity Is Actually Measured: Cosine Similarity

You have two points in embedding space. You need a number that says “how similar are these?” There are several ways to measure distance, but the one that dominates modern AI is cosine similarity.

Mental Model: The Compass Bearing

Forget about how far apart two ships are on the ocean. Instead, ask: are they sailing in the same direction? Two ships that are miles apart but heading due north are more “similar” than two ships that are close together but sailing in opposite directions. Cosine similarity measures the angle between two vectors, not the distance. If they point in the same direction, the similarity is 1. If they’re perpendicular (unrelated), it’s 0. If they point in opposite directions, it’s -1.

Why angle instead of distance? Because embedding vectors can have very different magnitudes depending on word frequency and training dynamics. A common word like “the” might have a long vector, and a rare word like “serendipity” might have a short one — but that doesn’t mean “the” is more meaningful. Cosine similarity ignores length and focuses purely on direction, which captures semantic relationships more reliably.

Cosine Similarity:

  Same direction     Perpendicular      Opposite direction
  (similarity = 1)   (similarity = 0)   (similarity = -1)
       ↗ ↗               ↗                    ↗
                          →                    ↙

  "running shoes"    "running shoes"     "running shoes"
  vs.                vs.                 vs.
  "jogging sneakers" "quantum physics"   "sitting still"

A simple example:

"king"  = [0.9, 0.1, 0.8]
"queen" = [0.85, 0.15, 0.75]
"car"   = [0.1, 0.9, 0.2]

cosine("king", "queen") = 0.97   ← Very similar (nearly parallel)
cosine("king", "car")   = 0.35   ← Not similar (very different angle)

You don’t need to compute this by hand — every vector database and embedding library does it automatically. But knowing what’s happening under the hood helps you debug when search results feel “off.”

Where It’s Used: Every semantic search system, every RAG retrieval step, every recommendation engine that uses embeddings. When Pinecone, Weaviate, or ChromaDB returns “top-k similar results,” they’re ranking by cosine similarity (or a close variant like dot product).

Trade-offs:

  • Cosine similarity treats all dimensions equally — it can’t tell you why two things are similar, just that they are
  • It works well for normalized vectors but can behave oddly with very sparse or very short vectors
  • For some use cases (like retrieval with relevance scoring), dot product works better because it preserves magnitude as a signal of “importance”

From Words to Sentences: The Evolution of Embeddings

Word2Vec gave us a coordinate for every word. But products don’t operate on single words — they operate on sentences, paragraphs, and documents. The evolution from word embeddings to sentence embeddings is one of the most important shifts in applied AI.

The Problem with Averaging

The naive approach: take the Word2Vec vectors for every word in a sentence and average them. “The dog bit the man” and “The man bit the dog” would get the same embedding — because the same words are present, just in different order. That’s clearly wrong. Order matters.

Mental Model: The Smoothie Problem

If you throw strawberries, bananas, and yogurt into a blender, you get a smoothie. If you throw the same ingredients in a different order, you get the exact same smoothie. Averaging word vectors is like making a smoothie — you lose all structure. You can’t taste which ingredient went in first. A good sentence embedding needs to be more like a layered cake, where the order and structure of ingredients matters.

Sentence-BERT: Context-Aware Sentence Embeddings

Sentence-BERT (2019) solved this by using a transformer-based architecture that processes the entire sentence at once, capturing word order, grammar, and context. Instead of averaging word vectors, it produces a single vector that represents the meaning of the whole sentence.

Word2Vec averaging:
  "The dog bit the man"  →  average of [the, dog, bit, the, man]  →  [0.4, 0.5, 0.3]
  "The man bit the dog"  →  average of [the, man, bit, the, dog]  →  [0.4, 0.5, 0.3]
  Same embedding! Wrong.

Sentence-BERT:
  "The dog bit the man"  →  [0.7, 0.2, 0.8, ...]  →  captures "dog is the biter"
  "The man bit the dog"  →  [0.3, 0.7, 0.4, ...]  →  captures "man is the biter"
  Different embeddings. Correct.

Why This Matters for Products: If you’re building semantic search, a support ticket classifier, or a RAG pipeline, you almost certainly want sentence-level embeddings, not word-level. The jump from Word2Vec to Sentence-BERT is the jump from “toy demo” to “production-ready.”

ModelEmbedsCaptures Order?Best For
Word2VecSingle wordsNoWord analogies, simple similarity
Sentence-BERTFull sentencesYesSearch, classification, RAG
OpenAI Ada-002Sentences/paragraphsYesGeneral-purpose, API-first teams
Cohere Embed v3Sentences/paragraphsYesMultilingual, search-optimized

Latent Spaces: The Unifying Concept

So far we’ve talked about embedding spaces for text. But the same idea works for anything — images, audio, user behavior, products, molecules. The general term for this is a latent space.

Mental Model: The Universal Filing System

Imagine a massive warehouse with an infinite number of shelves, organized not by category labels but by similarity. If you walk in any direction, things gradually change. Walk north and images get brighter. Walk east and music gets faster. Walk up and text gets more formal. Every item in the warehouse — a photo, a song, a sentence, a product — has a specific shelf location based on its properties.

A latent space is this warehouse. “Latent” means hidden — the organizing dimensions aren’t visible or labeled, but they’re real and consistent. The model discovered them from data.

The word “latent” is important: these dimensions are hidden and learned, not designed. A latent space for faces might have dimensions that correspond to things like “smiling,” “age,” or “lighting angle” — but nobody programmed those. They emerged from the training data.

Why Latent Spaces Matter for Products

When you understand that everything can live in a latent space, several powerful capabilities unlock:

Cross-modal search. CLIP (by OpenAI) puts images and text into the same latent space. You can search for an image by typing “sunset over mountains” — the text query and the matching images are nearby in the shared space, even though one is words and the other is pixels.

Shared Latent Space (CLIP):

  "sunset over mountains" (text)     ●
                                      ↖ close!
  [photo of sunset over mountains]   ●

  "corporate spreadsheet" (text)              ● far away

Generative AI. Diffusion models (Stable Diffusion, DALL-E) and VAEs work by learning a latent space of images. Generating a new image means picking a point in latent space and decoding it into pixels. Nearby points produce similar images. Moving smoothly between two points produces a smooth visual transition.

Recommendation across types. If users, products, and content all live in the same latent space, you can recommend products to users, content to products, or users to users — all with the same “find nearest neighbors” operation.


Dimensionality Reduction: Seeing the Invisible

Real embedding spaces have hundreds or thousands of dimensions. Humans can see in 2D (a screen) or 3D (with depth). So how do you visualize or inspect a 1,536-dimensional space?

Dimensionality reduction techniques compress high-dimensional data into 2D or 3D for visualization while trying to preserve the neighborhood relationships — things that are close in 1,536D should still be close in 2D.

Mental Model: The Shadow on the Wall

Imagine a complex 3D sculpture hanging from the ceiling. You shine a flashlight on it and look at the shadow on the wall. The shadow is 2D — you’ve lost information — but the shadow still captures the rough shape of the sculpture. Some details are lost, some relationships get distorted, but the overall structure is preserved. That’s dimensionality reduction: projecting a high-dimensional space onto a flat surface you can actually look at.

The Two Tools You’ll Encounter

PCA (Principal Component Analysis): Finds the directions of maximum variance and projects onto those. Fast, deterministic, good for a quick overview. Think of it as finding the angle of the flashlight that casts the most informative shadow.

t-SNE (t-distributed Stochastic Neighbor Embedding): Specifically optimized to preserve local neighborhoods — things that are close in high-D stay close in 2D. Slower, non-deterministic (different runs give different layouts), but produces much better cluster visualizations.

1,536-D embedding space          t-SNE projection to 2D
(invisible to humans)            (visible on your screen)

[0.8, 0.2, ..., 0.5]    →→→     ●  apple
[0.7, 0.3, ..., 0.4]    →→→     ●  orange     (cluster: fruits)
[0.75, 0.25, ..., 0.45] →→→     ●  banana

[0.1, 0.9, ..., 0.8]    →→→              ●  car
[0.15, 0.85, ..., 0.75] →→→              ●  truck   (cluster: vehicles)

The PM Takeaway: You’ll never use t-SNE in production — it’s a visualization tool, not a search algorithm. But it’s incredibly valuable for debugging: “Are our product embeddings actually clustering by category?” “Are these two topics that should be separate actually overlapping?” If your data science team shows you a t-SNE plot and says “see, the clusters are clean” — you now know what you’re looking at.

Trade-offs:

  • Dimensionality reduction always loses information — distances in 2D don’t perfectly reflect distances in 1,536D
  • t-SNE is non-deterministic: the same data produces different-looking plots on different runs
  • PCA preserves global structure better but can smash local clusters together
  • Neither technique should be used for actual similarity computation — they’re for human eyes only

How It All Connects

Here’s the full embeddings picture, from raw input to product feature:

1. EMBED: Convert inputs (text, images, users) into vectors
   → Word2Vec, Sentence-BERT, CLIP, OpenAI Ada
          ↓
2. STORE: Put vectors in a vector database
   → Pinecone, Weaviate, ChromaDB, pgvector
          ↓
3. QUERY: Convert the user's query into a vector
   → Same embedding model as Step 1
          ↓
4. MEASURE: Find nearest neighbors using cosine similarity
   → Rank by similarity score
          ↓
5. RETURN: Deliver results to the user
   → Semantic search, recommendations, RAG retrieval
          ↓
6. DEBUG: Visualize with t-SNE/PCA if results feel wrong
   → Are the clusters clean? Are unrelated items overlapping?

This pipeline is the backbone of every embedding-powered feature. Whether you’re building semantic search, a recommendation engine, or a RAG system — you’re running some version of this flow.


Common Misconceptions

“Cosine similarity of 0.9 means 90% similar.” Not exactly. Cosine similarity isn’t a percentage — it’s a directional measurement. What counts as “high” depends on the embedding model, the domain, and the task. For some models, 0.7 is a strong match. For others, 0.85 is barely relevant. You need to calibrate thresholds empirically for your specific use case.

“t-SNE shows the true structure of the data.” t-SNE is a lossy projection. The distances between clusters in a t-SNE plot are not meaningful — only the existence of clusters matters. Two clusters that look far apart in 2D might actually be close in the original space. Use it for pattern discovery, not for measurement.

“CLIP understands images and text the same way.” CLIP learns to align text and image embeddings so that matching pairs are close. But it doesn’t “understand” either modality in a deep sense — it understands correspondence. It knows that the text “a golden retriever playing fetch” should be near a photo of that scene, because it was trained on millions of such pairs.

“We need a different embedding model for each data type.” Not anymore. Multi-modal models like CLIP put different data types into the same latent space. But single-modality models (Sentence-BERT for text, ResNet for images) often perform better within their domain. The choice is a tradeoff between versatility and precision.


The Mental Models — Your Cheat Sheet

ConceptMental ModelOne-Liner
Cosine SimilarityThe Compass BearingSame direction = similar, regardless of distance
Averaging Word VectorsThe Smoothie ProblemYou lose all structure when you blend
Sentence EmbeddingsLayered Cake vs. SmoothieOrder and context preserved, not just ingredients
Latent SpaceThe Universal Filing SystemEverything has a shelf location based on learned similarity
Cross-Modal SearchShared warehouse for text and imagesCLIP puts words and pictures on the same shelves
Dimensionality ReductionThe Shadow on the Wall2D projection that preserves rough shape
t-SNEThe best-angle shadowOptimized to keep neighbors close

Final Thought

With this post, you now have the complete embeddings picture — from what they are, to how they’re learned, to how similarity is measured, to the unifying concept of latent spaces.

The key ideas to carry forward:

  1. Cosine similarity measures direction, not distance. Two vectors pointing the same way are similar regardless of their length. This is why it’s the default similarity metric in almost every production system.
  2. Sentence embeddings replaced word embeddings for real products. The jump from Word2Vec to Sentence-BERT was the jump from “interesting research” to “production-grade search and retrieval.” If you’re building anything with embeddings today, you’re almost certainly using sentence-level models.
  3. Latent spaces are the universal language of AI. Text, images, audio, user behavior — all of it can live in the same kind of space. Understanding this concept connects seemingly different AI capabilities (search, recommendations, generation, RAG) into a single coherent framework.
  4. Dimensionality reduction is for debugging, not for production. Use t-SNE and PCA to inspect and validate your embeddings. Use cosine similarity and vector databases for actual retrieval.

In the next post, we’ll take everything we’ve learned about embeddings and enter the architecture that changed AI forever: the Transformer. We’ll see how attention mechanisms use embedding spaces to let models process language with unprecedented power — and why the constraints of context windows, KV caches, and scaling laws matter for every AI product you’ll build.

Related Posts:

  • Making Sense Of Embeddings
  • Word2Vec: Start of Dense Embeddings
  • How CNNs Actually Work
  • AI Paradigm Shift: From Rules to Patterns
  • ML Foundations 1 - Linear Combinations to Logistic…
  • Breaking the "Unbreakable" Encryption - Part 2

Tags:

artificial-intelligenceEmbeddingsmachine-learning
Author

Archit Sharma

Follow Me
Other Articles
Previous

AI Paradigm Shift: From Rules to Patterns

Next

Needle in the Haystack: Embedding Training and Context Rot

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

icons8 pencil 100
ML Basics

Back to the basics

screenshot 1
Model Intuition

Build model intuition

icons8 lock 100 (1)
Encryption

How encryption works

icons8 gears 100
Privacy Tech

What protects privacy

screenshot 4
Musings

Writing is thinking

Recent Posts

  • Teaching AI Models: Gradient Descent
  • Needle in the Haystack: Embedding Training and Context Rot
  • Measuring Meaning: Cosine Similarity
  • AI Paradigm Shift: From Rules to Patterns
  • Seq2Seq Models: Basics behind LLMs
  • Word2Vec: Start of Dense Embeddings
  • Advertising in the Age of AI
  • Breaking the “Unbreakable” Encryption – Part 2
  • Breaking the “Unbreakable” Encryption – Part 1
  • ML Foundations – Linear Combinations to Logistic Regression
Copyright 2026 — Building AI Intuition. All rights reserved. Blogsy WordPress Theme