• Convolutional Neural Networks (CNNs) are often associated with image recognition, but they’re equally powerful for text tasks – and they’re simpler to understand when you think in terms of text. When Gmail filters spam, when TikTok flags toxic comments, when your phone recognizes “Hey Siri,” a CNN is likely working behind the scenes. This post will help you understand the intuition behind the CNNs and visualize the process of training and inference.

    What is a CNN, really?

    A CNN is a machine learning model – a system that learns patterns from examples and uses those learned patterns to make predictions or decisions on new data. Think of it like training a security guard for a building that only allows adults to pass. You show the guard 100 photos of legitimate visitors (adults) and 100 photos of ineligible visitors (kids). After seeing enough examples, the guard learns the subtle differences. Next time someone walks in, they can quickly say: “That person looks like an adult” or “That person looks like a kid” and accordingly allows or denies entry.

    A CNN does the exact same thing with text (and with images). You show it 10,000 positive reviews and 10,000 negative reviews – where each review is accompanied with its label – Review = 1 (Good) or 0 (Bad). The CNN model will learn from the examples and afterwards, when you ask it to judge a new review, it can look at the text and predict: “This is positive” or “This is negative.”

    What problem does a CNN solve?

    The core problem: How do you automatically understand or classify text without a human reading every word?

    If you’re running Gmail, you can’t hire 1 million people to read every email and decide if it’s spam. You need an automated system that learns from examples and then makes fast, consistent decisions on new incoming emails.

    A CNN solves this by learning which text patterns are predictive of the outcome you care about. If you’re detecting spam, it learns that phrases like “Click here now,” “Act immediately,” and “Limited time offer” are signals of spam. If you’re detecting positive sentiment, it learns that phrases like “I loved it,” “Highly recommend,” and “Best purchase ever” signal positivity.

    How Does a CNN work? Learning Detector Phrases (Training)

    Like most humans, you need to teach a CNN to recognize patterns before we ask it to predict (or classify or categorize or infer in machine learning lingo) a new review. This happens during training.

    Imagine you give the CNN a pile of 100,000 real reviews from your platform – each one labeled either POSITIVE or NEGATIVE by humans or through clear signals (like star ratings). The CNN’s job during training is simple: “Figure out which text patterns are predictive of positive vs. negative reviews.”

    It does this by:

    1. Taking random windows of length 2 – 4 words and slides it over the text from each review — Just like we’re about to do. Think of the sliding window as a mask that has the window of size n words cut into it – as you slide this over the text, you see only n words at a time and you can count how many times each n word pattern appears in good reviews and bad reviews.
    2. Checking: Do reviews with this window tend to be positive or negative? — For example: “absolutely terrible” appears mostly in negative reviews. “Highly recommended” appears mostly in positive reviews. In simple terms, you can imagine the model developing a frequency table for each 2 – 4 word window that can look like this:
    Detector phrase (inference)# in good reviews# in bad reviews
    Highly recommended
    (seems good)
    455623
    Absolutely terrible
    (seems bad)
    1573445
    is really
    (seems neutral)
    45573956
    1. Building detectors based on these patterns — It creates Detector A: “If you see ‘absolutely terrible,’ that’s a NEGATIVE signal.” It creates Detector B: “If you see ‘highly recommend,’ that’s a POSITIVE signal.”
    2. Strengthening detectors that work — Over thousands of examples, detectors that accurately predict the review’s label get stronger – this means that when they trigger, it is noticed by the model. Detectors that don’t predict anything useful get weaker – this means, when they trigger, model ignores these, primarily because they trigger for both good and bad reviews.

    After training on 100,000 reviews, the CNN has learned dozens of detectors. Each one recognizes a specific pattern and knows what signal to generate.

    How a CNN Uses Those Detectors? (Inference)

    Now that training is done, a brand new review comes in: “The food was absolutely terrible and not worth the price.”

    The CNN has already learned its detectors. Now it uses them. This is called inference – applying what you learned to make a prediction on new data. The CNN slides its learned detectors across the new review, generating signals as it goes. Here’s what happens:

    Imagine you have a brand new review: “The food was absolutely terrible.”

    Your goal: Decide if this review is positive or negative.

    A CNN doesn’t try to understand the entire sentence at once. Instead, it does this:

    1. Break the problem into small pieces – It looks at the text 2 – 4 words at a time (a “window”)
    2. Check each piece against learned patterns – At each window position, it asks: “Does this match any pattern I’ve learned?”
    3. Collect the signals – Every time a pattern matches, it records that signal
    4. Make a final decision – Based on all the signals collected, it predicts the outcome

    Essentially, when CNN is parsing a new input text, various detectors lights up when they see a match in the text. Instead of remembering the actual words, the CNN builds a fingerprint of which patterns it recognized across the entire text. Based on this fingerprint, it makes a final decision: classify the sentiment, predict the next word, or flag for spam. That’s it. That’s the whole idea.

    Example: How the sliding window works

    Your review: “The food was absolutely terrible and not worth the price”

    The CNN slides a 3-word window across the text and when:

    • One detector specializes in “negative sentiment phrases” and lights up for “absolutely terrible”
    • Another specializes in “complaint patterns” and lights up for “not worth”
    • A third watches for intensifiers and lights up for “absolutely”

    When a detector recognizes its pattern, it fires. If it doesn’t match anything the detector has learned, it stays quiet.

    👁 [The food was]           → Signal: Neutral start
       👁 [food was absolutely]  → Signal: Intensifier coming
          👁 [was absolutely terrible] → STRONG NEGATIVE  ✓
             👁 [absolutely terrible and] → Still negative  ✓
                👁 [terrible and not] → VERY negative  ✓
                   👁 [and not worth] → NEGATIVE PATTERN ALERT  ✓
                      👁 [not worth the] → STRONG NEGATIVE  ✓
                         👁 [worth the price] → Complaint about cost  ✓
    
    

    Local context matters: Word order and proximity are key signals. “Not good” and “good not” are fundamentally different, and CNNs naturally capture this through their sliding window approach.

    • Speed + Simplicity: Unlike Transformers which consider relationships between every pair of words in a text, CNNs focus only on local neighborhoods. This makes them faster and often more interpretable.
    • Pattern is Everything: The CNN learns which n-gram (this means a sequence of n words) combinations are meaningful signals. It doesn’t understand language semantically – it just gets very good at recognizing which local patterns predict the outcome you care about.

    Part 2: Pattern Recognition Creates a Fingerprint

    Here’s the key insight: The CNN doesn’t store the words. It stores which patterns it recognized.

    Instead of remembering “The food was absolutely terrible and not worth the price,” the CNN creates a summary – a fingerprint of activations:

    Negative sentiment detector:   [FIRE] [FIRE] [FIRE] [FIRE]
    Intensifier detector:          [QUIET] [QUIET] [FIRE] [QUIET] [QUIET]
    Cost complaint detector:       [QUIET] [QUIET] [QUIET] [FIRE] [FIRE]
    Positive phrases detector:     [QUIET] [QUIET] [QUIET] [QUIET] [QUIET]
    

    The final decision layer sees this fingerprint and asks: “What does this activation pattern tell me?”

    With 4 negative fires, 0 positive fires, and 2 complaint fires, the answer is clear: “This is a negative review.”

    This is why CNNs work so well for sentiment analysis, spam detection, and toxicity flagging – the fingerprint of local patterns is highly predictive of the overall classification.

    Part 3: Two Different Tasks, Same Architecture

    Text Prediction (Language Modeling)

    You give the CNN: “The food was absolutely”

    The CNN slides its window and asks: What usually follows an intensifier like “absolutely”?

    The detector trained on “intensifier patterns” fires up and signals: Strong adjectives follow intensifiers – likely TERRIBLE, AMAZING, or WONDERFUL.

    The CNN predicts the next word is probably “TERRIBLE.”

    If it guesses wrong, it learns. It adjusts its detectors to better recognize intensifier patterns and what words typically follow them.

    Text Classification (Sentiment or Spam)

    You give the CNN the full review.

    It slides its window across all words, collecting fingerprints (which patterns it recognized).

    By the end: “I saw 8 negative pattern matches, 0 positive matches, multiple intensifiers, cost complaints.”

    Final verdict: “Negative review.” Or if you’re filtering spam: “This is spam.” Or if you’re detecting toxicity: “Flag this for review.”

    Part 4: Why CNNs Win at Local Pattern Recognition

    The real magic is that CNNs are obsessed with local context in a way that serves many text tasks perfectly.

    • When you see “not” followed by “good,” the order matters. “good not” is bizarre. A CNN’s sliding window naturally captures that word order and proximity are key signals.
    • When filtering spam, patterns like “Click now” or “Get rich quick” are always suspicious. These are local, repeated patterns that don’t need global context to be recognized.
    • When detecting toxicity, harmful language often clusters in specific phrase combinations. A detector trained on “slur + context” will catch these reliably.

    Compare this to Transformers, which consider relationships between every pair of words in the entire text. Transformers are more powerful for understanding nuance and long-range meaning – they’re philosophers reading for deep understanding. But they’re slower, heavier, and often overkill for tasks where local patterns are the signal.

    For spam, sentiment, toxicity, keyword spotting, and voice wake words – CNNs are often the right tool.

    Part 5: Real-World Applications

    • Spam detection: Gmail and other email systems use CNNs (often alongside other models) to catch “Click here,” “Get rich fast,” and similar patterns that are nearly always spam.
    • Keyword spotting: When your phone hears “Hey Siri” or “Hey Google,” a lightweight CNN is often listening. It’s fast enough to run locally without draining your battery.
    • Sentiment classification: E-commerce platforms like Amazon and Airbnb use CNNs to quickly classify reviews as positive or negative based on local phrase patterns.
    • Toxicity detection: Platforms like Discord and Slack use CNNs to flag messages containing harmful language patterns.
    • Named Entity Recognition (NER): Financial systems use CNNs to spot entity patterns like “John Smith” (first name + surname) or company abbreviations.

    Part 6: The Mental Model

    Picture yourself standing in a hallway. Your job is to decide if a conversation flowing past is positive or negative.

    You have a small window in the wall – you can see only 3-4 words at any moment.

    As the conversation passes by:

    • Sometimes you see “absolutely terrible” → You think: negative
    • Sometimes you see “I loved it” → You think: positive
    • Sometimes you see “expensive” → You think: complaint

    By the end, you’ve seen hundreds of small moments. You synthesize them: “Overall, this person is unhappy.”

    That’s exactly what a CNN does.

    You don’t need to understand the entire conversation at once. You don’t need to track which word relates to which other word across the whole text. You just recognize patterns as they flow past, collect the count of what you saw, and make a decision.

    You can hold this mental model in your head without a single equation.

    Final Thought

    CNNs are pattern-matching specialists that excel at recognizing local signals. They slide a small window across text, detect which patterns match learned detectors, and build a fingerprint of these activations. Based on the fingerprint – not the original words – they make a decision.

    They’re not the most semantically sophisticated models (Transformers are). They’re not the most flexible (RNNs can process variable-length sequences better). But for the majority of practical text tasks where local patterns are predictive – spam, sentiment, toxicity, keyword spotting – CNNs are fast, interpretable, and remarkably effective.

    The next time you see a CNN mentioned in a text classification context, remember the hallway analogy and the fingerprint metaphor. You’ll understand exactly why they work.

    What are signals? How does the CNN generate them?

    Each window is checked against learned “detectors.” Think of detectors as specialized pattern-matchers.

    The CNN has learned:

    • Detector A recognizes “absolutely terrible” → generates signal: STRONG NEGATIVE
    • Detector B recognizes “not worth” → generates signal: COMPLAINT
    • Detector C recognizes “I loved” → generates signal: STRONG POSITIVE
    • Detector D recognizes “highly recommend” → generates signal: POSITIVE

    When a window matches a detector’s pattern, that detector fires and generates a signal.

    In our example:

    • “was absolutely terrible” matches Detector A → generates STRONG NEGATIVE signal
    • “not worth the” matches Detector B → generates COMPLAINT signal
    • No windows match Detector C or D → they stay silent

    What does the CNN do with all these signals?

    The CNN collects all the signals generated:

    Signal 1: STRONG NEGATIVE (from "absolutely terrible")
    Signal 2: COMPLAINT (from "not worth")
    Signal 3: NEGATIVE (from "terrible")
    Signal 4: NEGATIVE (from "absolutely")
    
    

    Now it asks a simple question: “What do all these signals together tell me?”

    With 4 negative signals and 0 positive signals, the answer is obvious: “This is a NEGATIVE review.”

    If it saw 5 positive signals and 1 negative signal, it would say: “This is a POSITIVE review.”

    This is how the CNN makes its decision – by looking at the pattern of signals, not by understanding the meaning of the words.

    Key insight: The CNN doesn’t understand language. It recognizes patterns.

    The CNN doesn’t know what “terrible” means. It doesn’t have a dictionary. It just knows: “In my training data, when I saw the pattern ‘absolutely terrible,’ the review was almost always negative.”

    This is crucial to understand. The CNN is a pattern matcher, not a language understander.


    Part 1: What Detectors Are and How They’re Built

    A detector is essentially a learned rule: “If you see this specific pattern of words, that’s a signal for the outcome I care about.”

    In training, the CNN is shown thousands of examples:

    Example 1: “I loved this movie!” → Label: POSITIVE Example 2: “Absolutely terrible experience” → Label: NEGATIVE Example 3: “Not worth my time” → Label: NEGATIVE Example 4: “Highly recommend!” → Label: POSITIVE

    The CNN studies these examples and learns:

    • “I loved” appears in positive reviews → Detector fires for positive
    • “Absolutely terrible” appears in negative reviews → Detector fires for negative
    • “Not worth” appears in negative reviews → Detector fires for negative
    • “Highly recommend” appears in positive reviews → Detector fires for positive

    Over thousands of examples, the CNN builds up dozens of these detectors, each trained to recognize a specific pattern and generate a specific signal.

    Part 2: Why the Sliding Window Matters

    Why does the CNN use a small sliding window instead of looking at the whole text at once?

    Because order and proximity matter for signals.

    Consider:

    • “Not good” = NEGATIVE signal
    • “Good not” = Weird, probably not actual English, but if it were, it would be POSITIVE

    The position and proximity of words changes the meaning. A sliding window naturally captures this. It says: “Only look at words that are close together, because words far apart probably don’t form a meaningful pattern.”

    This is why CNNs are great for text tasks where local patterns are predictive.

    Part 3: From Signals to Decisions

    After the CNN generates all its signals, it makes a final decision using a simple rule:

    For sentiment classification:

    • If positive signals > negative signals → POSITIVE review
    • If negative signals > positive signals → NEGATIVE review
    • If they’re tied → uncertain, could default to NEUTRAL or request human review

    For spam detection:

    • If spam-pattern signals > legitimate-pattern signals → FLAG as SPAM
    • Otherwise → DELIVER to inbox

    For toxicity detection:

    • If toxicity-pattern signals exceed threshold → FLAG for review
    • Otherwise → ALLOW

    The specific rule depends on what you’re trying to classify, but the principle is the same: Collect signals, count them, make a decision based on the count.

    Part 4: Real-World Applications

    • Gmail spam detection: CNN learns patterns like “Click here,” “Act now,” “Limited offer” → generates SPAM signals → flags email
    • TikTok toxicity detection: CNN learns harmful language patterns → generates TOXICITY signals → removes or flags video
    • Amazon review classification: CNN learns positive patterns like “Love it,” “Highly recommend” → generates POSITIVE signals → displays prominently
    • Airbnb review sentiment: CNN learns negative patterns like “Dirty,” “Rude,” “Broken” → generates NEGATIVE signals → affects host rating
    • Phone wake word detection (“Hey Siri”, “Hey Google”): CNN learns acoustic patterns of wake words → generates MATCH signals → wakes the device
    • Keyword spotting in customer service: CNN learns command patterns like “I want to cancel” → generates INTENT signals → routes to cancellation team

    Part 5: Why This Approach Works

    The sliding window + detector approach is powerful because:

    1. Fast: You’re not analyzing the entire text once. You’re checking small windows against learned patterns. This is quick.
    2. Interpretable: You can see which patterns triggered which signals. You can explain the decision: “This review was flagged as negative because it contained ‘not worth it’ and ‘waste of money.’”
    3. Efficient to train: You don’t need millions of examples. With thousands of examples, the CNN learns which local patterns matter.
    4. Works for many languages: The approach is language-agnostic. It learns whatever patterns exist in the training data.

    Part 6: The Mental Model

    Here’s the mental model you should carry with you:

    A CNN is like a security scanner at an airport. It doesn’t read your entire biography. It just checks a few things:

    • Do you have a ticket? → Signal: LEGITIMATE
    • Is your ID valid? → Signal: LEGITIMATE
    • Are you carrying anything prohibited? → Signal: ALARM

    Based on these signals, it decides: ALLOW or DENY.

    A CNN works exactly like this:

    • Does this text contain pattern A? → Signal generated
    • Does it contain pattern B? → Signal generated
    • Does it contain pattern C? → Signal generated

    Based on the collection of signals, it decides: POSITIVE, NEGATIVE, SPAM, LEGITIMATE, etc.

    Final Thought

    A CNN is a pattern-matching machine that learns which text patterns predict the outcome you care about. It slides a small window across text, checks each window against learned detectors, generates signals when patterns match, and makes a final decision based on the collected signals.

    You don’t need to understand how detectors are mathematically trained. You just need to understand that:

    1. CNNs learn patterns from examples
    2. They generate signals when they spot familiar patterns
    3. They make decisions by counting and comparing signals

    With this mental model, you can understand why Gmail uses CNNs for spam, why platforms use them for toxicity detection, and why they’re fast and interpretable compared to more complex models.

    You can now sit at the table and say: “A CNN learns patterns from examples, slides a window across new text, and makes decisions based on which patterns it recognizes.” That’s all you need to know to be dangerous.

  • When you search on Amazon, YouTube, or Google, the system isn’t scanning every item one by one – it’s using ANN algorithms like HNSW to leap across billions of items in milliseconds. Modern search and LLM’s actively use hybrid search where semantic search augments keyword search. I have shared the intuition for semantic search in a previous post.

    Semantic search deploys ANN (Approximate Nearest Neighbor) algorithms to navigate through billions of items to find the closest matching item. This post shares the general idea of ANN and presents the intuition for a popular ANN algorithm – HNSW (Highly Navigable Small World) used by vector search systems like FAISS, Pinecone, Weaviate to retrieve similar documents, images, or products in milliseconds,

    Think of ANN as a super-smart post office for high-dimensional data, with a key caveat – it guarantees extremely fast routing of the mail but doesn’t guarantee the delivery to the exactly right address. It trades off perfect accuracy against speed. In other words, ANN is happy as long as the mail gets delivered to say, any home on the right street a.k.a. approximate nearest neighbor quickly. In most use cases where semantic search is deployed, this trade-off is acceptable – primarily because in most cases, there may not even be a perfect match.

    • Meaning over Exact Words: Semantic search is all about matching the intended meaning or context of a user’s query, not the exact words. Real-world queries rarely correspond exactly to a single document or text fragment; instead, the user is looking for information that best answers his query – even if the words or phrasing are different.
    • Perfect Match is Rare: In most practical scenarios, especially with large and diverse datasets, a perfect (word-for-word or context-for-context) match doesn’t exist. Users may phrase their queries differently from how content is indexed, or they may not know the precise terminology.

    Part 1: What is HNSW? Think mail delivery system for Vectors
    When mail delivery system sorts mail:
    1. It goes from state → city → ZIP → street → house
    2. Each step narrows down where the package should go.

    HNSW does exactly that with embeddings. Instead of brute-force comparing your query to all vectors, HNSW:

    1. Builds a hierarchy of increasingly dense graphs
    2. Starts the search from the highest level which is also the sparsest (like “state”), where it finds the closest match to incoming vector
    3. Then the search lowers down to the next level from the closest match and repeats the process
    4. The lowest layer has all the nodes or documents and search can finally find the approximate closest match to the incoming vector
    5. Ends in dense layers (“house number”) to get a precise match

    Part 2: How HNSW Inserts a New Vector

    Let’s say you’re inserting product vector `V123`.

    Step 1: Assign a Random Max Level

    – V123 is randomly given a max level (say Level 3)
    – It will be inserted into Levels 3, 2, 1, and 0

      Step 2: Top down insertion

      For each level from 3 → 0:

      1. Use Greedy Search to find an entry point.
      2. Perform `ef` Construction based search to explore candidates.
      3. Select up to M nearest neighbors using a smart heuristic.
      4. Connect bidirectionally (if possible).

      Level 0 is the most important – it’s the dense, searchable layer. This is why HNSW powers semantic search in systems like FAISS and Pinecone – it’s how your query embedding finds the right chunk of meaning fast.

      Part 3: Greedy vs `ef` Search – Why `ef` Matters

      Say you’re at node P and evaluating neighbors A (8), B (5), and C (1), where the number in parentheses is the distance of point from the original vector – less means better match. Greedy search finds B is better than P and jumps to B – missing C, the true closest node.

      With `ef = 3`, you:
      – Evaluate A, B, and C
      – Add all to candidate list
      – Explore C later – and find the true best match

      So `ef` controls how deeply and widely you explore:
      – Greedy = fast, may miss best result
      – `ef` search = slower, better accuracy

      Part 4: What if a Node Already Has Max Neighbors?

      When inserting a new node:
      – It connects to M nearest neighbors
      – Those neighbors may or may not link back

      If a neighbor is already full:
      – It evaluates if the new node is better than its current connections
      – If so, it may drop a weaker link to accept the new one
      – Otherwise, connection stays one-way only

      This ensures:
      – Graph remains navigable
      – Degree of nodes stays bounded
      – No constant link reshuffling

      Final Thought

      HNSW builds a navigable graph that scales to millions of vectors with:
      – Fast approximate search
      – Sparse yet connected layers
      – Smart insertion and neighbor selection

      So next time someone says “vector search,” remember the analogy of mail delivery system with a caveat of high speed with street level accuracy. Even if HNSW misses the absolute closest vector, retrieval pipelines typically fetch the top-k candidates, then rerank them with a more precise model. This way, speed and accuracy both stay high.