Skip to content
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
Recent Posts
March 1, 2026
Teaching AI Models: Gradient Descent
March 1, 2026
Needle in the Haystack: Embedding Training and Context Rot
March 1, 2026
Measuring Meaning: Cosine Similarity
February 28, 2026
AI Paradigm Shift: From Rules to Patterns
February 16, 2026
Seq2Seq Models: Basics behind LLMs
February 16, 2026
Word2Vec: Start of Dense Embeddings
February 13, 2026
Advertising in the Age of AI
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 2
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 1
February 8, 2026
ML Foundations – Linear Combinations to Logistic Regression
February 2, 2026
Privacy Enhancing Technologies – Introduction
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 3
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 2
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 1
February 2, 2026
An Intuitive Guide to CNNs and RNNs
February 2, 2026
Making Sense Of Embeddings
November 9, 2025
How CNNs Actually Work
August 17, 2025
How Smart Vector Search Works
Machine Learning Basics

Word2Vec: Start of Dense Embeddings

Post 2a/N When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the…

Machine Learning Basics

AI Paradigm Shift: From Rules to Patterns

Post 1/N Every piece of software you’ve ever shipped or have seen shipped works the same way. A developer sits…

Privacy Tech

Privacy Enhancing Technologies – Introduction

Every time you browse a website, click an ad, make a purchase, or train an ML model, data flows through systems.…

Encryption

Breaking the “Unbreakable” Encryption – Part 2

In Part 1, we covered the “Safe” (Symmetric) and the “Mailbox” (Asymmetric). The TL;DR: we use…

Machine Learning Basics Model Intuition

Teaching AI Models: Gradient Descent

Post 1b/N In the last post, we established the big idea: machine learning is about finding patterns from data instead…

Machine Learning Basics

Breaking the “Unbreakable” Encryption – Part 1

If you’ve spent any time in tech, you’ve heard of AES, RSA, and Diffie-Hellman. We treat them like digital…

Home/Machine Learning Basics/Word2Vec: Start of Dense Embeddings
Machine Learning Basics

Word2Vec: Start of Dense Embeddings

By Archit Sharma
8 Min Read
0
Updated on March 1, 2026

Post 2a/N

When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the system doesn’t just look for those exact letters. It understands that “chill” is related to “relaxing” and “acoustic” is related to “guitar.” It treats words like coordinates on a map, knowing exactly how close one idea sits to another.

In the previous post, we built the intuition for what embeddings are – points in space where closeness means similarity. This post zooms into the engine that made it all possible: Word2Vec. Published by Google in 2013, Word2Vec was the breakthrough that proved you could learn rich word meanings from raw text alone – no dictionaries, no human labels, no linguistic rules. It was the beginning of dense embeddings that form the bedrock of modern LLMs.


The Conceptual Framework: From Words to Maps

To understand Word2Vec, stop thinking of words as strings of letters. Start thinking of them as points in space. The model’s entire job is to move these points around until words that share context are huddled together.

PhaseActionGoal
InputA massive text corpus (billions of words)Provide raw examples of word usage
ProcessThe training loop (prediction + adjustment)Move word vectors based on co-occurrence
OutputThe word map (embeddings)A coordinate system where distance = meaning

Part 1: The Sliding Window — Defining the Neighborhood

Word2Vec doesn’t read a whole book at once. It uses a small sliding window to look at a few words at a time, creating mini-puzzles for itself to solve.

Mental Model: The Spotlight in a Dark Library

Imagine walking through a dark library with a small flashlight. You can only see one word clearly (the center word) and a few words immediately to its left and right (the context). Everything else is in the dark. As you move the light word by word across every sentence, you learn that “coffee” is often seen near “cream” but rarely near “astronomy.”

If our sentence is “The quick brown fox jumps,” and our spotlight is on “brown,” the model sees:

[The  quick]  BROWN  [fox  jumps]
  ← context →  center  ← context →

The model’s mission is simple: given the word brown, can I predict that quick and fox are nearby? Every time the window slides one word forward, the model gets a new puzzle.

Where It’s Used: This windowing technique is the foundation for almost all NLP systems. Meta uses it for content understanding. Microsoft uses it in Bing’s semantic search. Google used it as the basis for their search improvements long before transformers arrived.

Trade-offs:

  • Window size matters: too small and you miss broader context, too large and you dilute the signal
  • Linear context only – the window can’t jump across paragraphs or connect distant references
  • Order within the window is ignored in the basic Skip-gram variant

Part 2: The Scoring System — Log Likelihood

You might be used to models that calculate a “wrongness” score by subtracting a predicted number from a real number (like mean squared error). Word2Vec is different – it uses log likelihood to measure how “surprised” it is by the actual text.

Mental Model: The Multiple-Choice Quiz

Imagine you’re taking a quiz. The question is: “What word goes with Peanut Butter?” The model gives its probability for every word in the vocabulary: Jelly (2%), Concrete (0.01%), Space (0.01%). When the answer key (the actual text) reveals the answer is Jelly, the model sees it only gave Jelly a 2% chance. That low probability creates a penalty. The model needs to study harder – meaning it needs to move “Peanut Butter” and “Jelly” closer together in embedding space.

How it works step by step:

  1. The model starts with random coordinates for every word
  2. It calculates the probability of a context word appearing using the dot product – measuring how much two vectors point in the same direction
  3. It takes the logarithm of that probability – this is the “score”
  4. If the probability was low (like 0.001), the log likelihood is a huge negative number – a loud signal to adjust
  5. The model nudges the vectors closer together so the probability will be higher next time
Round 1:  "peanut butter" and "jelly" are far apart  →  P(jelly) = 0.02  →  Big penalty
          Nudge them closer.

Round 2:  Now slightly closer  →  P(jelly) = 0.08  →  Smaller penalty
          Nudge again.

Round 1000: Close together  →  P(jelly) = 0.61  →  Tiny penalty
            Almost done learning this pair.

Trade-offs:

  • Computationally expensive: calculating probabilities across a vocabulary of 100,000+ words for every training example is slow
  • This led to the invention of negative sampling – a shortcut where the model only updates a small random sample of “wrong” words instead of the entire vocabulary
  • Rare words don’t get enough updates to move to the right spot – they end up with noisy, unreliable embeddings

Part 3: Why Frequency Dominates the Map

Since Word2Vec learns by nudging vectors closer together every time they co-occur, the words you see most often end up having the strongest pull.

Mental Model: The Dance Floor

Think of word vectors as people on a dance floor. Every time two people are seen together, they take a step toward each other. If “bread” and “butter” are seen together 1,000 times, they’ll be standing toe-to-toe. If “bread” and “honey” are seen together twice, they’ll only take two small steps toward each other. The dance floor is a popularity contest – frequent pairs dominate the map.

A word that ranks #2 in frequency for a specific context will almost never leapfrog #1. The model is a mirror of the corpus. If “quick” appears next to “brown” 10x more than “dark” does, “quick” will always have a higher probability in the model’s output.

Why This Matters for PMs: Your embeddings are only as good as your training data. If your corpus is biased – overrepresenting certain topics, demographics, or styles – the embedding space will encode those biases as geometry. “Doctor” might end up closer to “he” and “nurse” closer to “she” simply because the training text reflects historical patterns. This isn’t the model being sexist – it’s being a faithful mirror of biased data.


Skip-gram vs. CBOW: The Two Flavors

Word2Vec actually comes in two variants that approach the prediction game from opposite directions:

Mental Model: Two Ways to Study Flashcards

Skip-gram is like seeing the answer and guessing the questions. Given the center word “coffee,” predict the context words: “morning,” “cup,” “black.” It asks: “If I know this word, what words are likely nearby?”

CBOW (Continuous Bag of Words) is the reverse – seeing the questions and guessing the answer. Given the context words “morning,” “cup,” “black,” predict the center word: “coffee.” It asks: “If I know the neighborhood, what word lives here?”

FactorSkip-gramCBOW
DirectionCenter word → predict contextContext words → predict center
StrengthBetter for rare wordsBetter for frequent words
SpeedSlower (one prediction per context word)Faster (one prediction per window)
Best ForSmaller datasets, diverse vocabularyLarge datasets, common patterns

In practice, Skip-gram is used more often because it handles rare words better and produces slightly richer embeddings. But both produce the same kind of output: a coordinate for every word in the vocabulary.


How It All Connects

Word2Vec creates meaning through a simple, repetitive cycle:

1. SPOTLIGHT: Look at a center word and its neighbors
       ↓
2. GUESS: Use current vector coordinates to predict the neighbor probabilities
       ↓
3. SCORE: Use log likelihood to measure how surprised the model was
       ↓
4. NUDGE: Move vectors closer (if they co-occurred) or apart (if they didn't)
       ↓
5. REPEAT: Slide the window forward one word. Do this billions of times.
       ↓
   RESULT: Random vectors → Semantic map where distance = meaning

After billions of these cycles, the random starting points have organized themselves into a meaningful geometry. Words that share contexts are close. Words that don’t are far apart. And the famous analogies emerge: King – Man + Woman = Queen.


Common Misconceptions

“Word2Vec understands language.” It doesn’t. It understands co-occurrence statistics. It has no concept of grammar, logic, or truth. It just knows which words tend to appear near which other words. That’s both its power and its limitation.

“The model is being creative when it generates probabilities.” Word2Vec is purely deterministic during training. It isn’t trying to be interesting – it’s trying to be an exact mathematical reflection of the word frequencies in your corpus. There’s no temperature, no randomness in the output embeddings.

“Word2Vec is outdated and irrelevant.” The specific model is rarely used in production today – transformer-based embeddings (BERT, Sentence-BERT) have surpassed it. But the ideas Word2Vec introduced – learning from context, dense vectors, vector arithmetic – are the foundation of every modern embedding model. Understanding Word2Vec means understanding the DNA of GPT, BERT, and Claude.

“You need massive compute to train embeddings.” Word2Vec was trained on a single machine in hours, not weeks on GPU clusters. It was revolutionary precisely because it was simple and fast. The complexity came later with transformers.


The Mental Models — Your Cheat Sheet

ConceptMental ModelOne-Liner
Sliding WindowSpotlight in a Dark LibraryOnly see the center word and its immediate neighbors
Log LikelihoodThe Multiple-Choice QuizLow confidence in the right answer = big penalty
Frequency EffectsThe Dance FloorFrequent pairs take more steps toward each other
Skip-gramSee the answer, guess the questionsCenter word predicts context
CBOWSee the questions, guess the answerContext predicts center word
Negative SamplingOnly grade a few wrong answersShortcut to avoid scoring the entire vocabulary

Final Thought

Word2Vec proved something profound: you don’t need to explain the definition of a word to a computer. You just need to show it who that word hangs out with. By maximizing the likelihood of patterns found in real-world text, the model builds a map of human language – entirely from context.

Three things to carry with you:

  1. Context is everything. A word’s meaning is defined entirely by its neighbors. “Bank” near “river” means something completely different than “bank” near “money.” Word2Vec captures this by looking at windows of surrounding text.
  2. Log likelihood is the teacher. It provides the signal that tells the model how much to move the vectors. Big surprise = big adjustment. Small surprise = tiny nudge. Over billions of examples, this simple feedback loop produces rich semantic structure.
  3. The embeddings are the prize. Word2Vec’s prediction task is just a means to an end. Nobody cares about the predictions themselves. The valuable output is the coordinate system – the embedding space – where every word has a position that encodes its meaning.

While modern LLMs have added layers of attention and more sophisticated training objectives on top of this foundation, the core intuition remains: the best way to understand a word is to look at the company it keeps.

In the next post, we’ll cover what’s still missing from the embeddings picture: how to measure similarity, how embeddings evolved from words to sentences, and the concept of latent spaces – the unifying idea that connects text embeddings, image embeddings, and the entire RAG architecture.

Related Posts:

  • Making Sense Of Embeddings
  • Measuring Meaning: Cosine Similarity
  • Needle in the Haystack: Embedding Training and Context Rot
  • Teaching AI Models: Gradient Descent
  • How CNNs Actually Work
  • AI Paradigm Shift: From Rules to Patterns

Tags:

aiartificial-intelligenceEmbeddingsmachine-learningWord2Vec
Author

Archit Sharma

Follow Me
Other Articles
Previous

Advertising in the Age of AI

Next

Seq2Seq Models: Basics behind LLMs

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

icons8 pencil 100
ML Basics

Back to the basics

screenshot 1
Model Intuition

Build model intuition

icons8 lock 100 (1)
Encryption

How encryption works

icons8 gears 100
Privacy Tech

What protects privacy

screenshot 4
Musings

Writing is thinking

Recent Posts

  • Teaching AI Models: Gradient Descent
  • Needle in the Haystack: Embedding Training and Context Rot
  • Measuring Meaning: Cosine Similarity
  • AI Paradigm Shift: From Rules to Patterns
  • Seq2Seq Models: Basics behind LLMs
  • Word2Vec: Start of Dense Embeddings
  • Advertising in the Age of AI
  • Breaking the “Unbreakable” Encryption – Part 2
  • Breaking the “Unbreakable” Encryption – Part 1
  • ML Foundations – Linear Combinations to Logistic Regression
Copyright 2026 — Building AI Intuition. All rights reserved. Blogsy WordPress Theme