[MI 4] The BERT Family: A family of different experts

12 Min Read

You’ve been there. You type “my package never showed up” into a company’s support chatbot, and it confidently responds with a step-by-step guide on how to update your shipping address. Same general topic — shipping. Completely wrong answer. You rephrase: “I never received my order.” This time, it gives you the return policy. Still wrong. The system clearly “knows” about shipping, but it can’t tell the difference between a missing package, a wrong address, and a return request.

This isn’t a language problem. The chatbot understood every word you typed. It’s a matching problem — the system retrieved the wrong article from its knowledge base, because the model powering the search doesn’t organize its internal map the way you’d expect. It doesn’t place “my package never arrived” close to “missing delivery” in its vector space. It might place them on opposite ends of the room — and neither you nor the chatbot would ever know.

This post unpacks the three embedding models you’ll run into when building any retrieval or classification system — BERT, SBERT, and SetFit. Same family tree, radically different behavior in how they organize meaning. No equations — just the mental models that will help you pick the right one.

The Core Question: What Happens After Projection?

In Making Sense of Embeddings, we established that embeddings turn things — words, sentences, products — into points in a high-dimensional space where closeness means similarity. All three models in this post do that. They all convert text into numerical coordinates.

The question isn’t whether they create embeddings. The question is: what does the geography of the resulting vector space look like?

Most people assume that if a model “understands” two sentences mean the same thing, it must place their embeddings close together. That assumption is wrong — and it’s the single most important gotcha in this entire post.

Part 1: BERT — The Brilliant Professor Who Can’t Organize a Bookshelf

BERT (Bidirectional Encoder Representations from Transformers) is the foundational model. It reads text in both directions simultaneously — left-to-right and right-to-left — which gives it a deep understanding of context. It knows that “bank” in “river bank” and “bank” in “bank account” mean completely different things. That level of nuance was a breakthrough when Google published it in 2018.

But here’s the catch.

Mental Model: The Brilliant, Messy Professor

Imagine a university professor who has read every book in every discipline. You hand her a sentence — “The food was horrible” — and she understands perfectly: negative sentiment, informal register, food domain. You hand her another — “The meal was really bad” — and she understands that too: same meaning, same negativity, same topic.

Now you ask her to pin both sentences on a giant wall map. She pins “The food was horrible” near the cafeteria corner. She pins “The meal was really bad” next to the parking lot. Miles apart.

Why? Because she was never asked to organize the map. She was asked to understand sentences. She’s brilliant at that. But understanding and organizing are two different jobs.

This is BERT’s fundamental limitation for retrieval tasks. It generates embeddings that contain semantic meaning, but it does not guarantee that semantically similar sentences will land near each other in the vector space.

Why Does This Happen? The Objective Function.

This is worth pausing on, because it’s the root of all the confusion.

When BERT is pre-trained, it has two jobs: predict masked (hidden) words in a sentence, and determine whether two sentences follow each other naturally. That’s it. The loss function — the thing that tells the model “you’re getting better” or “you’re getting worse” — is optimized for these two tasks. At no point during training does anyone tell BERT: “Hey, make sure similar sentences end up with similar embeddings.”

So when you fine-tune BERT for a classification task like sentiment analysis, here’s what actually happens under the hood:

BERT generates an embedding for the input sentence.
That embedding gets fed into a linear layer — a simple mathematical function.
The output of the linear layer gets passed through a sigmoid or softmax function, which converts it into a probability — say, 95% negative sentiment.

The training process adjusts both BERT’s weights and the linear layer’s parameters to minimize classification error. But here’s the critical insight: the linear layer does most of the heavy lifting for closing the gap.

Mental Model: The Universal Translator

Imagine two people speaking in completely different languages — one in Japanese, one in Portuguese. Both are saying “this restaurant is terrible.” A skilled translator sitting between them can tell you they’re both saying the same thing, even though the sounds they’re producing are wildly different.

The linear layer + sigmoid is that translator. It compensates for the fact that BERT’s embeddings are scattered. It translates different embedding locations into the correct probability. The classification works — but the embeddings themselves never moved closer together.

Think about what this means concretely. “The food was horrible” might produce an embedding at coordinates [0.8, -0.3, 0.5, …]. “The meal was really bad” might produce an embedding at [-0.1, 0.7, -0.4, …]. Completely different locations. But the linear layer learns to take both of those scattered points and map them to a high probability of “negative sentiment.”

BERT's classification pipeline:

"food was horrible"  →  embedding [0.8, -0.3, 0.5]  →  sigmoid  →  0.95 (negative)
"meal was really bad" →  embedding [-0.1, 0.7, -0.4] →  sigmoid  →  0.93 (negative)

Embeddings: far apart in space
Probabilities: both correct
The sigmoid compensated. The embeddings didn't move.

The error rate drops because the linear function’s parameters adjust. But the embedding function itself does not change enough to force similar sentences together. That’s not the objective function. It is wrong to expect BERT to cluster semantically similar embeddings in the vector space.

Where It’s Used: BERT is still the go-to for tasks where you need deep understanding of a single piece of text — sentiment analysis with large labeled datasets, named entity recognition (finding names, dates, locations inside documents), and extractive question answering where the model highlights the exact answer span within a provided paragraph.

Trade-offs:

Excellent at understanding internal structure and context within a single sentence
Classification accuracy can be very high because the linear layer compensates for scattered embeddings
Embeddings are not spatially meaningful — cosine similarity between BERT embeddings does not reliably reflect semantic similarity
Computationally expensive to use for pairwise comparison: comparing a query against 500 articles requires 500 forward passes through the full model

Part 2: SBERT — The Librarian Who Redesigns the Entire Map

SBERT (Sentence-BERT) takes BERT’s deep understanding and adds a completely new training regime whose entire purpose is to fix the geography problem.

Mental Model: The Librarian

Remember the brilliant professor? Now imagine you hire a librarian. The librarian has the same knowledge as the professor — she’s read every book — but she has a different mandate: organize the library so that similar books are on the same shelf, same section, same floor. She doesn’t just understand each book. She physically moves them so that anyone walking through the library can find related material just by walking to the nearest shelf.

SBERT is that librarian. It takes BERT’s understanding and restructures the entire vector space around one principle: similar meaning = nearby location.

How It Works: Contrastive Learning and Siamese Networks

SBERT’s training setup looks very different from BERT’s:

Siamese Architecture. The model takes in two sentences at a time, not one. Both sentences pass through the same BERT-based network (hence “siamese” — twin networks sharing identical weights).

Contrastive Loss Function. This is the ingredient that changes everything. The loss function does two things simultaneously:

The Pull: If the two input sentences are semantically similar (“The food was horrible” and “The meal was really bad”), the loss function penalizes the model for placing their embeddings far apart. It forces the embeddings closer together.
The Push: If the two input sentences are semantically different (“The food was horrible” and “The hotel had a great pool”), the loss function penalizes the model for placing them too close. It pushes dissimilar embeddings apart.

This is contrastive learning — the same principle used in How Smart Vector Search Works. Not only does it form clusters, it actively spaces those clusters apart so the boundaries between topics are clean.

BERT's Vector Space               SBERT's Vector Space
(Scattered)                       (Clustered)

"bad meal" •                         • "bad meal"
                                     • "horrible food"
      • "great pool"                 • "terrible dinner"

  • "horrible food"

                                     • "great pool"
• "terrible dinner"                  • "amazing view"
       • "amazing view"

Over millions of training pairs, this push-and-pull reshapes the entire geometry of the embedding space. Similar sentences form tight clusters. Dissimilar sentences land in distant neighborhoods. And the gaps between clusters are meaningful — they reduce the chance of contamination during retrieval.

Why This Matters for Search and RAG

Let’s make this concrete. You’re building a support chatbot with 500 knowledge base articles. All articles get embedded and stored in a vector database. When a user asks “my package never arrived,” the system:

Converts the query into an embedding.
Runs a cosine similarity search against all stored article embeddings.
Retrieves the top-k closest articles.
Passes those articles to an LLM to generate an answer.

With BERT embeddings, step 2 is unreliable. The query embedding might land near an article about return policies instead of missing deliveries, because BERT never organized its space for proximity-based comparison.

With SBERT embeddings, step 2 works as intended. “My package never arrived” lands near “Handling Missing Deliveries” because SBERT explicitly trained to cluster delivery-related content together and push return-related content away.

The speed difference matters too. With SBERT, you pre-compute all article embeddings once. Each incoming query requires one forward pass plus simple cosine similarity arithmetic. With raw BERT, comparing a query against 500 articles would mean 500 full forward passes through the transformer — the difference between milliseconds and seconds.

Where It’s Used: Every RAG pipeline, every semantic search engine, every knowledge base chatbot, every duplicate-question detector. When Pinecone, Weaviate, or ChromaDB returns “top-k similar results,” the embeddings powering that search are almost always SBERT-style or its descendants (like OpenAI’s Ada, Cohere Embed, etc.).

Trade-offs:

Embeddings are spatially meaningful — cosine similarity reliably reflects semantic similarity
Pre-computed embeddings make real-time retrieval extremely fast
Requires paired training data (similar and dissimilar sentence pairs) to train the contrastive objective
Designed for retrieval (finding similar things), not for classification (sorting things into buckets) — though it can be adapted for classification, that’s not its sweet spot

Part 3: SetFit — The Specialist Who Works With Almost No Examples

SetFit is the newest member of this family, built directly on top of SBERT. It inherits the same contrastive learning and clustering behavior. The difference is in what it’s optimized for.

Mental Model: The Border Patrol Agent

If SBERT is the librarian who organizes books by similarity, SetFit is the border patrol agent who draws extremely precise boundaries between territories — even when she’s only seen a handful of travelers from each country.

SBERT says: “These two books are related, put them on the same shelf.” SetFit says: “This support ticket is a billing issue, not a technical bug — and I’m confident about it even though I’ve only seen 8 billing tickets before.”

The Few-Shot Superpower

SetFit’s defining trait is strong text classification performance with very few labeled examples — as few as 5–10 per category. This is called few-shot learning.

How it works:

Start with SBERT. SetFit uses a pre-trained SBERT model as its foundation, inheriting the organized vector space.
Contrastive Fine-Tuning. It takes your small labeled dataset and generates pairs — positive pairs (same category) and negative pairs (different categories). It fine-tunes the SBERT model to tighten the clusters around your specific categories.
Classification Head. After fine-tuning, it adds a simple classifier on top. Because the embeddings are now hyper-organized around your categories, even a basic logistic regression layer achieves high accuracy.

Same family, same contrastive learning DNA, but with the classification layer baked in and calibrated for tiny datasets.

When to Use SetFit vs. SBERT

The distinction is about the task shape:

SBERT answers: “What are the most similar articles to this query?” — a retrieval problem, searching for neighbors in an open-ended space.
SetFit answers: “Which bucket does this input belong to?” — a classification problem, sorting inputs into predefined categories.

For a support chatbot, if you want to classify user intent — “Is this customer asking about billing, delivery, or returns?” — SetFit is your tool, especially when you only have a handful of labeled examples per intent. But for finding the specific article that answers the user’s question from a pool of 500? SBERT is the right architecture.

Where It’s Used: Routing support tickets to the right team with minimal training data, classifying user intent in early-stage chatbots, detecting policy violations when labeled violations are rare, sorting customer feedback into categories during an MVP phase when you don’t have thousands of labeled examples yet.

Trade-offs:

Exceptional accuracy with very few labeled examples (5–10 per class)
Inherits SBERT’s clustered vector space, so embeddings are spatially meaningful
Designed for classification, not open-ended retrieval — it sorts into predefined buckets, it doesn’t search a library
Performance improves as you add more labeled data, but for large-dataset classification, a standard fine-tuned BERT may still outperform it

Common Misconceptions

“BERT embeddings are bad.” They’re not bad — they’re just not designed for comparison. BERT embeddings are rich in semantic information. A classifier sitting on top of them can achieve excellent accuracy. The problem only shows up when you try to use them for similarity search, because the geography of the space wasn’t trained for that.

“SBERT and SetFit are completely different models.” SetFit literally uses SBERT as its foundation. Think of them as the same engine with different transmissions — SBERT is geared for retrieval, SetFit is geared for classification. Both use contrastive learning. Both cluster similar text together.

“If I use SBERT, I don’t need to worry about retrieval quality.” SBERT fixes the embedding organization problem, but it doesn’t fix data quality problems. If your knowledge base articles are poorly written, contradictory, or missing critical information, SBERT will faithfully retrieve the wrong article — it just retrieves the closest wrong article. The garbage-in-garbage-out principle still applies.

“BERT is outdated.” The specific BERT model is rarely used raw in production today, but the architecture is everywhere. SBERT is BERT with a different training objective. Most modern embedding models — OpenAI’s text-embedding-3, Cohere’s Embed v3, Google’s Gecko — all build on transformer architectures that descend directly from BERT. Understanding BERT means understanding the DNA of the entire family.

The Mental Models — Your Cheat Sheet

Concept	Mental Model	One-Liner
BERT embeddings	The Brilliant, Messy Professor	Understands everything, organizes nothing
Why BERT scatters	The objective function	Trained to classify correctly, not to cluster spatially
Sigmoid compensation	The Universal Translator	The linear layer fixes the output even though the embeddings are scattered
SBERT’s approach	The Librarian	Same knowledge, but reorganizes the entire space by similarity
Contrastive learning	Push and Pull	Pull similar sentences together, push dissimilar ones apart
SetFit’s specialty	The Border Patrol Agent	Draws precise boundaries with almost no examples
SBERT vs. SetFit	Retrieval vs. Classification	“Find similar things” vs. “Sort into buckets”

Which Model Fits Your Task?

Factor	BERT	SBERT	SetFit
Core Strength	Deep context understanding	Fast semantic comparison	Few-shot classification
Clusters similar text?	No	Yes (primary goal)	Yes (inherited from SBERT)
Search/retrieval speed	Slow (full forward pass per pair)	Very fast (pre-computed embeddings)	Fast
Labeled data required	High	Medium	Very low (5–10 per class)
Ideal for…	Classifying a single document	Building search / RAG systems	Sorting niche categories with little data
Cosine similarity reliable?	No	Yes	Yes

Final Thought

Here’s the gotcha that ties everything together: BERT-based retrieval doesn’t crash. It retrieves the wrong articles with high confidence. There’s no error message. The cosine similarity scores look reasonable. The chatbot generates a fluent, well-structured answer — from the wrong source. This is the worst kind of failure because it’s invisible.

Three things to carry with you:

BERT understands meaning but doesn’t organize the map. Its embeddings contain semantic information, but two sentences with identical meaning can land in completely different locations. The classification accuracy you see comes from the linear layer compensating — not from the embeddings being well-organized. It is wrong to expect BERT to cluster semantically similar text.
SBERT and SetFit redraw the map entirely. Both use contrastive learning to pull similar embeddings together and push dissimilar embeddings apart. The result is an organized vector space where cosine similarity actually reflects semantic similarity. This is why they work for retrieval and search — the spatial structure is the feature.
They don’t just cluster — they space the clusters apart. Contrastive learning ensures that dissimilar clusters are separated by meaningful gaps. When your chatbot searches for “missing package,” you don’t want it accidentally pulling articles from the “return policy” cluster. The push-apart mechanism is what prevents that contamination.

Pick the model that was trained for your task — not the one that sounds the smartest.

Tags:

[MI 4] The BERT Family: A family of different experts

The Core Question: What Happens After Projection?

Part 1: BERT — The Brilliant Professor Who Can’t Organize a Bookshelf

Why Does This Happen? The Objective Function.

Part 2: SBERT — The Librarian Who Redesigns the Entire Map

How It Works: Contrastive Learning and Siamese Networks

Why This Matters for Search and RAG

Part 3: SetFit — The Specialist Who Works With Almost No Examples

The Few-Shot Superpower

When to Use SetFit vs. SBERT

Common Misconceptions

The Mental Models — Your Cheat Sheet

Which Model Fits Your Task?

Final Thought

Related Posts:

Tags:

Archit Sharma

Other Articles

Exploring “Linear” in Linear Regression

No Comment! Be the first one.

Leave a Reply Cancel reply

ML Basics

Model Intuition

Encryption

Privacy Tech

Musings