Word2Vec: Start of Dense Embeddings

8 Min Read

Updated on March 1, 2026

Post 2a/N

When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the system doesn’t just look for those exact letters. It understands that “chill” is related to “relaxing” and “acoustic” is related to “guitar.” It treats words like coordinates on a map, knowing exactly how close one idea sits to another.

In the previous post, we built the intuition for what embeddings are – points in space where closeness means similarity. This post zooms into the engine that made it all possible: Word2Vec. Published by Google in 2013, Word2Vec was the breakthrough that proved you could learn rich word meanings from raw text alone – no dictionaries, no human labels, no linguistic rules. It was the beginning of dense embeddings that form the bedrock of modern LLMs.

The Conceptual Framework: From Words to Maps

To understand Word2Vec, stop thinking of words as strings of letters. Start thinking of them as points in space. The model’s entire job is to move these points around until words that share context are huddled together.

Phase	Action	Goal
Input	A massive text corpus (billions of words)	Provide raw examples of word usage
Process	The training loop (prediction + adjustment)	Move word vectors based on co-occurrence
Output	The word map (embeddings)	A coordinate system where distance = meaning

Part 1: The Sliding Window — Defining the Neighborhood

Word2Vec doesn’t read a whole book at once. It uses a small sliding window to look at a few words at a time, creating mini-puzzles for itself to solve.

Mental Model: The Spotlight in a Dark Library

Imagine walking through a dark library with a small flashlight. You can only see one word clearly (the center word) and a few words immediately to its left and right (the context). Everything else is in the dark. As you move the light word by word across every sentence, you learn that “coffee” is often seen near “cream” but rarely near “astronomy.”

If our sentence is “The quick brown fox jumps,” and our spotlight is on “brown,” the model sees:

[The  quick]  BROWN  [fox  jumps]
  ← context →  center  ← context →

The model’s mission is simple: given the word brown, can I predict that quick and fox are nearby? Every time the window slides one word forward, the model gets a new puzzle.

Where It’s Used: This windowing technique is the foundation for almost all NLP systems. Meta uses it for content understanding. Microsoft uses it in Bing’s semantic search. Google used it as the basis for their search improvements long before transformers arrived.

Trade-offs:

Window size matters: too small and you miss broader context, too large and you dilute the signal
Linear context only – the window can’t jump across paragraphs or connect distant references
Order within the window is ignored in the basic Skip-gram variant

Part 2: The Scoring System — Log Likelihood

You might be used to models that calculate a “wrongness” score by subtracting a predicted number from a real number (like mean squared error). Word2Vec is different – it uses log likelihood to measure how “surprised” it is by the actual text.

Mental Model: The Multiple-Choice Quiz

Imagine you’re taking a quiz. The question is: “What word goes with Peanut Butter?” The model gives its probability for every word in the vocabulary: Jelly (2%), Concrete (0.01%), Space (0.01%). When the answer key (the actual text) reveals the answer is Jelly, the model sees it only gave Jelly a 2% chance. That low probability creates a penalty. The model needs to study harder – meaning it needs to move “Peanut Butter” and “Jelly” closer together in embedding space.

How it works step by step:

The model starts with random coordinates for every word
It calculates the probability of a context word appearing using the dot product – measuring how much two vectors point in the same direction
It takes the logarithm of that probability – this is the “score”
If the probability was low (like 0.001), the log likelihood is a huge negative number – a loud signal to adjust
The model nudges the vectors closer together so the probability will be higher next time

Round 1:  "peanut butter" and "jelly" are far apart  →  P(jelly) = 0.02  →  Big penalty
          Nudge them closer.

Round 2:  Now slightly closer  →  P(jelly) = 0.08  →  Smaller penalty
          Nudge again.

Round 1000: Close together  →  P(jelly) = 0.61  →  Tiny penalty
            Almost done learning this pair.

Trade-offs:

Computationally expensive: calculating probabilities across a vocabulary of 100,000+ words for every training example is slow
This led to the invention of negative sampling – a shortcut where the model only updates a small random sample of “wrong” words instead of the entire vocabulary
Rare words don’t get enough updates to move to the right spot – they end up with noisy, unreliable embeddings

Part 3: Why Frequency Dominates the Map

Since Word2Vec learns by nudging vectors closer together every time they co-occur, the words you see most often end up having the strongest pull.

Mental Model: The Dance Floor

Think of word vectors as people on a dance floor. Every time two people are seen together, they take a step toward each other. If “bread” and “butter” are seen together 1,000 times, they’ll be standing toe-to-toe. If “bread” and “honey” are seen together twice, they’ll only take two small steps toward each other. The dance floor is a popularity contest – frequent pairs dominate the map.

A word that ranks #2 in frequency for a specific context will almost never leapfrog #1. The model is a mirror of the corpus. If “quick” appears next to “brown” 10x more than “dark” does, “quick” will always have a higher probability in the model’s output.

Why This Matters for PMs: Your embeddings are only as good as your training data. If your corpus is biased – overrepresenting certain topics, demographics, or styles – the embedding space will encode those biases as geometry. “Doctor” might end up closer to “he” and “nurse” closer to “she” simply because the training text reflects historical patterns. This isn’t the model being sexist – it’s being a faithful mirror of biased data.

Skip-gram vs. CBOW: The Two Flavors

Word2Vec actually comes in two variants that approach the prediction game from opposite directions:

Mental Model: Two Ways to Study Flashcards

Skip-gram is like seeing the answer and guessing the questions. Given the center word “coffee,” predict the context words: “morning,” “cup,” “black.” It asks: “If I know this word, what words are likely nearby?”

CBOW (Continuous Bag of Words) is the reverse – seeing the questions and guessing the answer. Given the context words “morning,” “cup,” “black,” predict the center word: “coffee.” It asks: “If I know the neighborhood, what word lives here?”

Factor	Skip-gram	CBOW
Direction	Center word → predict context	Context words → predict center
Strength	Better for rare words	Better for frequent words
Speed	Slower (one prediction per context word)	Faster (one prediction per window)
Best For	Smaller datasets, diverse vocabulary	Large datasets, common patterns

In practice, Skip-gram is used more often because it handles rare words better and produces slightly richer embeddings. But both produce the same kind of output: a coordinate for every word in the vocabulary.

How It All Connects

Word2Vec creates meaning through a simple, repetitive cycle:

1. SPOTLIGHT: Look at a center word and its neighbors
       ↓
2. GUESS: Use current vector coordinates to predict the neighbor probabilities
       ↓
3. SCORE: Use log likelihood to measure how surprised the model was
       ↓
4. NUDGE: Move vectors closer (if they co-occurred) or apart (if they didn't)
       ↓
5. REPEAT: Slide the window forward one word. Do this billions of times.
       ↓
   RESULT: Random vectors → Semantic map where distance = meaning

After billions of these cycles, the random starting points have organized themselves into a meaningful geometry. Words that share contexts are close. Words that don’t are far apart. And the famous analogies emerge: King – Man + Woman = Queen.

Common Misconceptions

“Word2Vec understands language.” It doesn’t. It understands co-occurrence statistics. It has no concept of grammar, logic, or truth. It just knows which words tend to appear near which other words. That’s both its power and its limitation.

“The model is being creative when it generates probabilities.” Word2Vec is purely deterministic during training. It isn’t trying to be interesting – it’s trying to be an exact mathematical reflection of the word frequencies in your corpus. There’s no temperature, no randomness in the output embeddings.

“Word2Vec is outdated and irrelevant.” The specific model is rarely used in production today – transformer-based embeddings (BERT, Sentence-BERT) have surpassed it. But the ideas Word2Vec introduced – learning from context, dense vectors, vector arithmetic – are the foundation of every modern embedding model. Understanding Word2Vec means understanding the DNA of GPT, BERT, and Claude.

“You need massive compute to train embeddings.” Word2Vec was trained on a single machine in hours, not weeks on GPU clusters. It was revolutionary precisely because it was simple and fast. The complexity came later with transformers.

The Mental Models — Your Cheat Sheet

Concept	Mental Model	One-Liner
Sliding Window	Spotlight in a Dark Library	Only see the center word and its immediate neighbors
Log Likelihood	The Multiple-Choice Quiz	Low confidence in the right answer = big penalty
Frequency Effects	The Dance Floor	Frequent pairs take more steps toward each other
Skip-gram	See the answer, guess the questions	Center word predicts context
CBOW	See the questions, guess the answer	Context predicts center word
Negative Sampling	Only grade a few wrong answers	Shortcut to avoid scoring the entire vocabulary

Final Thought

Word2Vec proved something profound: you don’t need to explain the definition of a word to a computer. You just need to show it who that word hangs out with. By maximizing the likelihood of patterns found in real-world text, the model builds a map of human language – entirely from context.

Three things to carry with you:

Context is everything. A word’s meaning is defined entirely by its neighbors. “Bank” near “river” means something completely different than “bank” near “money.” Word2Vec captures this by looking at windows of surrounding text.
Log likelihood is the teacher. It provides the signal that tells the model how much to move the vectors. Big surprise = big adjustment. Small surprise = tiny nudge. Over billions of examples, this simple feedback loop produces rich semantic structure.
The embeddings are the prize. Word2Vec’s prediction task is just a means to an end. Nobody cares about the predictions themselves. The valuable output is the coordinate system – the embedding space – where every word has a position that encodes its meaning.

While modern LLMs have added layers of attention and more sophisticated training objectives on top of this foundation, the core intuition remains: the best way to understand a word is to look at the company it keeps.

In the next post, we’ll cover what’s still missing from the embeddings picture: how to measure similarity, how embeddings evolved from words to sentences, and the concept of latent spaces – the unifying idea that connects text embeddings, image embeddings, and the entire RAG architecture.

Tags: