Needle in the Haystack: Embedding Training and Context Rot

13 Min Read

Post 2c/N

You’ve probably experienced this: you paste a 50-page document into ChatGPT or Claude, ask a specific question about something buried on page 37, and the model either ignores it, gives a vague answer, or confidently cites something from page 2 instead. You know the answer is in there. The model “read” the whole document. So why can’t it find it?

This isn’t a bug — it’s a fundamental mechanical limitation of how these models process information. And understanding why it happens will change how you design AI-powered features, structure your RAG pipelines, and set expectations with stakeholders. This post covers two connected topics: how embedding models are trained to represent meaning, and why that representation breaks down as content gets longer.

A Quick Recap: Where We Are

In the previous posts, we established that embeddings turn words, sentences, and documents into points in space where closeness means similarity. We covered how Word2Vec learns from context, how cosine similarity measures direction, and how latent spaces provide a universal filing system for any kind of data.

Now we need to address two questions the previous posts left open:

How are modern embedding models actually trained? (Beyond Word2Vec’s simple prediction game)
Why does similarity get murkier as content gets longer? (The needle in a haystack problem)

These two questions are deeply connected — and the connection is the key insight of this post.

Part 1: How Modern Embedding Models Are Trained

Word2Vec learned word embeddings by predicting neighboring words. Modern embedding models (Sentence-BERT, OpenAI’s embedding models, Cohere Embed) use a more sophisticated training approach called contrastive learning.

Mental Model: The Matchmaking Game

Imagine you’re running a speed-dating event. You have 1,000 people in a room. Before the event, you already know which pairs are compatible (they filled out a survey). Your job is to arrange the seating so that compatible people end up near each other and incompatible people end up far apart.

You start with everyone seated randomly. After each round, you nudge compatible pairs closer and push incompatible pairs farther apart. After hundreds of rounds, the seating arrangement reflects real compatibility — people who belong together are sitting together.

That’s contrastive learning. The model starts with random embeddings, then adjusts them so that matching pairs (similar sentences) get closer and non-matching pairs (unrelated sentences) get farther apart.

The Training Recipe

Modern embedding models are typically trained on massive datasets of paired examples — sentences or passages that are known to be related:

Question + its correct answer (from Q&A datasets)
A search query + the document it should retrieve
Two paraphrases of the same idea
A premise + its logical entailment

For each pair, the model also needs negatives — examples that should not be close. These can be randomly sampled from the dataset (easy negatives) or deliberately chosen to be confusingly similar (hard negatives).

Training Example:

  Anchor:    "What causes high blood pressure?"
  Positive:  "Hypertension is primarily driven by genetics, diet, and stress."
  Negative:  "Blood pressure monitors are available at most pharmacies."

  Goal: Push Anchor closer to Positive. Push Anchor away from Negative.

The negative example is tricky on purpose — it’s about blood pressure but doesn’t answer the question. Training on these “hard negatives” is what teaches the model to distinguish between “topically related” and “actually answers the question.”

The Loss Function: Contrastive Loss

The training signal comes from a contrastive loss function — a scoring system that penalizes the model when similar things are far apart or when dissimilar things are too close.

Mental Model: The Seating Penalty

Back to the speed-dating event. After each round, a judge walks around with a clipboard. If a compatible pair is seated far apart, the judge gives a big penalty. If an incompatible pair is seated too close, another big penalty. If everyone is roughly in the right place, the penalty is small. The model adjusts the seating to minimize the total penalty across all pairs.

Step 1: Embed the anchor, positive, and negative
           Anchor  →  Point A
           Positive → Point P
           Negative → Point N

Step 2: Measure distances
           distance(A, P) = 0.8   ← too far! These should be close.
           distance(A, N) = 0.3   ← too close! These should be far.

Step 3: Calculate penalty
           Penalty = distance(A, P) - distance(A, N) + margin
           Model adjusts to reduce this penalty.

Step 4: After millions of examples
           distance(A, P) → small
           distance(A, N) → large
           The embedding space now reflects real semantic relationships.

Why Training Data Quality Is Everything

This is where product teams make or break their embedding quality. The model learns exactly what you teach it:

Train on Q&A pairs → the model gets good at matching questions to answers
Train on paraphrase pairs → the model gets good at detecting duplicate content
Train on search queries + clicked results → the model gets good at information retrieval

If your training pairs are noisy (wrong answers labeled as correct, irrelevant documents labeled as relevant), the embedding space will be noisy. The geometry of the space is a direct reflection of the training data quality.

The PM Takeaway: When evaluating embedding models for your product, the first question isn’t “how many dimensions?” — it’s “what was it trained on?” A model trained on scientific papers will produce a very different embedding space than one trained on customer support tickets. Domain match between training data and your use case matters more than model size.

Part 2: Why Everything Gets Murkier with Length

Now for the question that trips up every team building on LLMs: why does the model lose its ability to find specific information as the input gets longer?

This isn’t about the model “forgetting.” It’s a mechanical failure across multiple systems, all of which degrade as content length increases. Let’s walk through each one.

The Attention Budget: Why the Model Can’t Pay Attention to Everything

At the heart of every transformer-based model is the attention mechanism — the system that decides which parts of the input to focus on when generating each word of the output. The critical constraint: the total attention at any step is a fixed budget.

Mental Model: The Flashlight in a Stadium

Imagine you’re standing in a dark stadium holding a flashlight. The flashlight has a fixed amount of light — you can’t make it brighter. In a small room, the flashlight illuminates everything clearly. You can see every detail. In a massive stadium, that same flashlight barely makes a dent. Everything becomes dim. You can see the general shape of things, but specific details vanish into the darkness.

That’s what happens to the attention mechanism as the context window grows. The total “light” (attention) is fixed at 1.0. The more tokens compete for that light, the dimmer each one gets.

This is mathematically enforced by a function called softmax. Here’s what it does in plain English: softmax takes a list of raw scores (how relevant is each token?) and converts them into probabilities that sum to exactly 1.0. It’s the mechanism that creates the fixed budget.

Short context (100 tokens):
  Token 42 (the needle): raw score = 5.0
  After softmax: attention = 0.10  (10% of the budget)
  → The needle stands out clearly.

Long context (100,000 tokens):
  Token 42 (the needle): raw score = 5.0  (same relevance!)
  After softmax: attention = 0.0001  (0.01% of the budget)
  → The needle is mathematically indistinguishable from noise.

The raw relevance score of the needle hasn’t changed — it’s still a 5.0. But softmax distributes the fixed budget across ALL tokens. With 100,000 tokens competing, even a highly relevant token gets buried. The signal-to-noise ratio collapses.

Mental Model: The Classroom Raise-Your-Hand

A teacher asks a question. In a class of 10 students, the one kid who knows the answer raises their hand high and gets noticed immediately — they command 10% of the teacher’s visual attention. Now put that same kid in an auditorium of 10,000 students, many of whom also raise their hands (because they think they know the answer). The teacher’s eyes can only take in so much. The right answer is still in the room, but it’s lost in a sea of raised hands.

Semantic Compression: Packing a Novel into a Paragraph

Most models represent their “understanding” of the context as a vector of fixed dimensions — typically 4,096 or 8,192 numbers. This doesn’t change whether the input is 100 words or 100,000 words.

Mental Model: The Fixed-Size Suitcase

You have one suitcase for a trip. If you’re packing for a weekend, everything fits perfectly — each item gets its own space, neatly folded. If you’re packing for a year-long trip using the same suitcase, you have to compress ruthlessly. Shirts get rolled. Some items get left behind entirely. And when you need to find that one specific pair of socks, good luck — everything is jammed together in an undifferentiated mass.

This is what happens when a model tries to compress a 500-page document into a fixed-dimensional vector. The sharpness of specific facts gets sacrificed. Details blur together. The model retains the broad themes (what the suitcase looks like from the outside) but loses the individual items (specific facts buried in the middle).

This directly causes the well-documented “Lost in the Middle” phenomenon: models tend to recall information from the beginning and end of long documents but struggle with content in the middle. The beginning gets encoded first (primacy effect). The end is freshest (recency effect). The middle gets compressed the hardest.

Document Position vs. Recall Accuracy:

  High  |  * *                              * *
        |      *                          *
        |        *                      *
        |          *    *    *    *   *
  Low   |               *    *    *
        +------------------------------------→
        Beginning    Middle              End

  Beginning: Encoded first, gets strong representation
  Middle:    Compressed hardest, details blur together
  End:       Most recent, freshest in the representation

Position Bias: When the Model Loses Its Sense of Space

Transformers use positional encodings — signals that tell the model where each token sits in the sequence. Without these, the model would treat “the dog bit the man” and “the man bit the dog” identically (same words, no order).

But positional encodings have limits.

Mental Model: The Address System

Imagine a city where every house has an address. In a small town of 100 houses, address #1 and address #100 feel meaningfully different — you can picture the distance between them. Now imagine a city with 1,000,000 houses. The difference between address #437,291 and address #437,295 is nearly meaningless — they’re practically the same location. And address #1 feels impossibly far from address #999,999 — the “relationship” between them has stretched so thin that the model can’t meaningfully connect them.

Two specific problems emerge with long sequences:

Distance Decay: The model’s ability to connect a query at position 99,000 with a relevant fact at position 2,000 degrades because the positional “distance” between them is enormous. The model knows the fact is somewhere back there, but it can’t pin down the exact coordinates.

Out-of-Distribution Positions: Most models are trained on sequences of a certain length. When you push them beyond that length — say, feeding 128K tokens into a model trained primarily on 4K-8K sequences — the positional values enter ranges the model hasn’t seen much during training. It’s like navigating a city where the address numbers go higher than any map you’ve studied.

Distractor Interference: The Fog of Similar Content

In a short document, there might be one sentence that answers your question. In a 100-page document, there might be dozens of sentences that look like they could be the answer but aren’t quite right.

Mental Model: The Crowded Parking Lot

You’re looking for your silver Toyota Camry in a parking lot. In a lot with 20 cars, you find it instantly — even if there’s one other silver car, you can quickly check both. In a lot with 10,000 cars, there might be 50 silver Camrys. Each one is a “near-miss” that you have to inspect and reject. The sheer volume of similar-looking options creates a fog that makes finding your specific car exponentially harder.

In transformer terms, these near-misses are called distractors. They score high on the attention mechanism because they’re semantically similar to the query. The model’s internal scoring starts to spread its attention across these distractors rather than concentrating on the single correct answer. Worse, distractors that appear more recently or more frequently in the document get an extra boost from position bias and repetition, potentially outscoring the actual needle.

Short document (1 page):
  Query: "What was the Q3 revenue?"
  Candidate 1: "Q3 revenue was $4.2M"  ← score: 0.92  (correct!)
  Candidate 2: "Q3 goals were set in April" ← score: 0.41
  → Easy. Clear winner.

Long document (100 pages):
  Query: "What was the Q3 revenue?"
  Candidate 1: "Q3 revenue was $4.2M"      ← score: 0.31  (correct, but diluted)
  Candidate 2: "Q3 revenue targets were $5M" ← score: 0.29  (distractor!)
  Candidate 3: "Revenue grew 12% in Q3"     ← score: 0.28  (distractor!)
  Candidate 4: "Q3 budget allocation..."     ← score: 0.27  (distractor!)
  ... 15 more similar candidates ...
  → Foggy. The correct answer barely edges out the noise.

How These Four Forces Compound

The critical thing to understand is that these aren’t four independent problems — they multiply each other:

Attention Dilution     →  The needle gets less "light"
       ×
Semantic Compression   →  The needle's features are blurred
       ×
Position Bias          →  The needle's location is uncertain
       ×
Distractor Interference →  The needle looks like the hay
       =
The model can't reliably find specific information in long contexts.

Double the context length and each of these factors gets worse — not additively, but compoundingly. This is why the needle-in-a-haystack problem isn’t linear: going from 4K to 8K tokens might barely hurt performance, but going from 32K to 128K can cause a dramatic collapse.

What This Means for Products

Understanding these mechanics changes how you build AI-powered features:

For RAG Pipelines

Don’t shove entire documents into the context window. Chunk, embed, and retrieve only the most relevant pieces. This is the entire rationale behind RAG — instead of making the model search a 100-page document (haystack), you use embeddings to find the 3-5 most relevant paragraphs (pre-filtered needles) and only feed those to the model.

Bad:  Stuff 100 pages into context → Ask question → Hope the model finds the answer
Good: Embed all chunks → Retrieve top 5 by cosine similarity → Feed only those to the model

For Context Window Strategy

Treat the context window as expensive real estate. Every token you add dilutes the attention available for every other token. Front-load the most important information. Put instructions and key facts at the beginning and end (where recall is strongest). Minimize filler content in the middle.

For Evaluation

Test specifically for needle-in-a-haystack failures. Place known facts at different positions in long documents and measure retrieval accuracy. You’ll likely find a U-shaped curve: good recall at the start, good recall at the end, poor recall in the middle. Design your product around this reality instead of pretending it doesn’t exist.

For User Expectations

Don’t tell users the model “reads” their entire document the way a human would. A more honest framing: “The model can reference your document, but it works best when you point it to the right section or ask specific questions.” Setting the right expectation prevents frustration and builds trust.

Common Misconceptions

“Models with bigger context windows don’t have this problem.” Bigger context windows help, but they don’t eliminate the problem — they just push the failure point further out. A 128K-token model will still struggle with needles buried in 100K tokens of hay. The physics of softmax dilution and fixed-dimensional compression don’t change just because the window is larger.

“The model forgot the information.” It didn’t forget — the information is still in the context. The model’s retrieval mechanism (attention) failed to surface it. It’s not a memory problem. It’s a search problem.

“Fine-tuning fixes this.” Fine-tuning can help the model learn to pay attention to certain types of content (like specific question-answer patterns), but it doesn’t change the fundamental mechanics of softmax dilution or positional encoding limits. Architecture changes (like sliding window attention or retrieval-augmented approaches) address the root cause more directly.

“Just use a longer context window.” Longer context windows are better than shorter ones, all else equal. But they come with quadratically increasing compute costs and don’t solve the underlying signal-to-noise problem. RAG (retrieving relevant chunks instead of stuffing everything into context) remains more reliable and more cost-effective for most use cases.

The Mental Models — Your Cheat Sheet

Concept	Mental Model	One-Liner
Contrastive Learning	The Matchmaking Game	Push matching pairs close, non-matching pairs far
Hard Negatives	The tricky speed-dating match	Looks compatible on paper but isn’t
Contrastive Loss	The Seating Penalty	Big penalty when pairs are in the wrong position
Softmax / Attention Budget	Flashlight in a Stadium	Fixed light, bigger room = dimmer everywhere
Attention Dilution	Classroom Raise-Your-Hand	More students = harder to spot the right answer
Semantic Compression	The Fixed-Size Suitcase	Same suitcase, more stuff = details get crushed
Lost in the Middle	Primacy + Recency	Beginnings and endings stick, middles blur
Position Bias	The Address System	Large address ranges lose spatial meaning
Distractor Interference	Crowded Parking Lot	Too many silver Camrys to find yours

Final Thought

The needle-in-a-haystack problem isn’t a temporary limitation that will be solved by the next model release. It’s a structural consequence of how attention, compression, and positioning work in transformer architectures. Bigger context windows and better training help at the margins, but the fundamental tension remains: a fixed attention budget divided across an expanding number of tokens will always dilute the signal.

Three principles to build on:

Don’t make the model search — search for it. RAG exists because retrieval with embeddings is more reliable than attention over long contexts. Pre-filter the haystack before the model ever sees it.
Respect the U-shaped recall curve. Put critical information at the beginning and end of your context. Structure prompts with the most important content first. Don’t bury instructions in the middle of a massive prompt.
Test for position, not just accuracy. When evaluating your AI product, don’t just ask “did it get the right answer?” Ask “did it get the right answer when the fact was at position 500? At position 5,000? At position 50,000?” The answer will be very different, and your product design should account for that.

In the next post, we’ll dive into the Transformer architecture itself — the engine that powers everything from GPT to Claude to Gemini. We’ll see how attention mechanisms actually work, why context windows have the limits they do, and what scaling laws mean for the future of AI products.

Tags:

Word2Vec: Start of Dense Embeddings

How Smart Vector Search Works

Measuring Meaning: Cosine Similarity

Breaking the “Unbreakable” Encryption – Part 2

Privacy Enhancing Technologies (PETs) — Part 2

ML Foundations – Linear Combinations to Logistic Regression

Needle in the Haystack: Embedding Training and Context Rot

Post 2c/N

A Quick Recap: Where We Are

Part 1: How Modern Embedding Models Are Trained

The Training Recipe

The Loss Function: Contrastive Loss

Why Training Data Quality Is Everything

Part 2: Why Everything Gets Murkier with Length

The Attention Budget: Why the Model Can’t Pay Attention to Everything

Semantic Compression: Packing a Novel into a Paragraph

Position Bias: When the Model Loses Its Sense of Space

Distractor Interference: The Fog of Similar Content

How These Four Forces Compound

What This Means for Products

For RAG Pipelines

For Context Window Strategy

For Evaluation

For User Expectations

Common Misconceptions

The Mental Models — Your Cheat Sheet

Final Thought

Related Posts:

Tags:

Archit Sharma

Other Articles

Measuring Meaning: Cosine Similarity

Teaching AI Models: Gradient Descent

No Comment! Be the first one.

Leave a Reply Cancel reply

ML Basics

Model Intuition

Encryption

Privacy Tech

Musings