Skip to content
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
icon icon Building AI Intuition

Connecting the dots...

icon icon Building AI Intuition

Connecting the dots...

  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
  • Home
  • ML Basics
  • Model Intuition
  • Encryption
  • Privacy Tech
  • Musings
  • About
Close

Search

Subscribe
Recent Posts
March 1, 2026
Teaching AI Models: Gradient Descent
March 1, 2026
Needle in the Haystack: Embedding Training and Context Rot
March 1, 2026
Measuring Meaning: Cosine Similarity
February 28, 2026
AI Paradigm Shift: From Rules to Patterns
February 16, 2026
Seq2Seq Models: Basics behind LLMs
February 16, 2026
Word2Vec: Start of Dense Embeddings
February 13, 2026
Advertising in the Age of AI
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 2
February 8, 2026
Breaking the “Unbreakable” Encryption – Part 1
February 8, 2026
ML Foundations – Linear Combinations to Logistic Regression
February 2, 2026
Privacy Enhancing Technologies – Introduction
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 3
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 2
February 2, 2026
Privacy Enhancing Technologies (PETs) — Part 1
February 2, 2026
An Intuitive Guide to CNNs and RNNs
February 2, 2026
Making Sense Of Embeddings
November 9, 2025
How CNNs Actually Work
August 17, 2025
How Smart Vector Search Works
Machine Learning Basics Model Intuition

Teaching AI Models: Gradient Descent

Post 1b/N In the last post, we established the big idea: machine learning is about finding patterns from data instead…

Machine Learning Basics

Word2Vec: Start of Dense Embeddings

Post 2a/N When you type a search query into Google or ask Spotify to find “chill acoustic covers,” the…

Machine Learning Basics

AI Paradigm Shift: From Rules to Patterns

Post 1/N Every piece of software you’ve ever shipped or have seen shipped works the same way. A developer sits…

Machine Learning Basics Model Intuition

Needle in the Haystack: Embedding Training and Context Rot

Post 2c/N You’ve probably experienced this: you paste a 50-page document into ChatGPT or Claude, ask a specific…

Privacy Tech

Privacy Enhancing Technologies – Introduction

Every time you browse a website, click an ad, make a purchase, or train an ML model, data flows through systems.…

Privacy Tech

Privacy Enhancing Technologies (PETs) — Part 3

Privacy-Preserving Computation and Measurement In Part 1, we covered how organizations protect data internally —…

Home/Privacy Tech/Privacy Enhancing Technologies (PETs) — Part 1
Privacy Tech

Privacy Enhancing Technologies (PETs) — Part 1

By Archit Sharma
6 Min Read
0
Updated on February 28, 2026

How Your Data Gets Protected

Every time you browse a website, click an ad, or make a purchase, data flows through dozens of systems. Companies need this data for analytics, measurement, and personalization—but they also have legal and ethical obligations to protect your privacy.

This is where Privacy Enhancing Technologies (PETs) come in. PETs are the technical toolkit that lets companies use data responsibly—extracting insights while minimizing the risk of exposing individuals.

This two-part series will give you the intuition for how these technologies work, without the math. Part 1 covers the foundational layer: what happens to your data as it moves through collection, storage, and analysis.


The Four Layers of Data Protection

Think of privacy protection like an onion with multiple layers. Each layer adds protection, and the best systems use all four together.

  • Layer 1: Data Minimization — Collect less in the first place
  • Layer 2: Storage Anonymization — Disguise what you store
  • Layer 3: Query Anonymization — Protect what comes out of queries
  • Layer 4: Differential Privacy — Add mathematical guarantees

1. Data Minimization

The “Don’t Collect It” Principle: The simplest privacy protection is also the most powerful: don’t collect data you don’t need. Another simple principal to apply is to check if we are clear on how the data we plan to collect will be used — if the answer is “No”, just leave it alone.

Mental Model: The Minimalist Packer

Imagine you’re packing for a weekend trip. You could bring your entire wardrobe “just in case” — but then you’re lugging heavy bags, and if your luggage gets lost, you lose everything.

A smarter approach: pack only what you’ll actually use. Data minimization works the same way:

  • Only collect fields you truly need for the product to function.
  • Set short retention windows (delete data after 30 days, not 10 years).
  • Drop unused fields early in data pipelines before they spread.

Where It’s Used

Everywhere — ingestion, logging, analytics, ML training. Every system that touches user data should ask: “Do we actually need this field?”

Trade-offs

The downside is reduced flexibility. Product teams often say “we might need that data later for a feature we haven’t built yet.” Data minimization requires strong upfront thinking about what you’ll actually use.

  • Important Limitation: Minimization only reduces surface area. It doesn’t protect the data you do collect — you still need the other layers.

2. Storage Anonymization

Disguising Identity at Rest: Once data is collected, you need to store it somewhere. Storage anonymization (usually pseudonymization in practice) disguises direct identifiers so that raw user IDs aren’t sitting in every database.

Mental Model: The Witness Protection Program

When someone enters witness protection, they get a new identity — new name, new address, new documents. Their real identity is stored in a secure vault that only a few authorized people can access. Everyone else interacts with the new identity.

How Pseudonymization Works:

  • Real UserID or email is replaced with a pseudo-ID (often a salted hash).
  • The mapping table (real ID ↔ pseudo ID) is stored separately with strict access controls.
  • All analytics, ML, and dashboards operate on pseudo-IDs only.
Raw DataPseudonymized
john.doe@email.comuser_a8f3c2
jane.smith@email.comuser_b7e4d1
bob.jones@email.comuser_c9f5e3

A more nuanced detail around hashing — it is common practice to use Salt and Pepper in the Hashing process to drive stronger privacy. Salt is random string generated per user and often stored in cleartext and Pepper is a system level string stored in an environment variable. The final hash is performed on raw data+ salt + pepper to reduce the threat vector.

Key Nuance: Pseudonymization vs. True Anonymization

If the mapping table exists, this is pseudonymization — not true anonymization. Someone with access to the mapping can re-identify users.

True anonymization means destroying the mapping entirely. But this creates a problem: if you can’t identify users, you can’t honor deletion requests (“delete my data”) or export requests (“give me my data”). GDPR and similar laws require these capabilities, so most systems use pseudonymization.

Trade-offs

  • The mapping table is still sensitive — it becomes a high-value target.
  • Re-identification is still possible for anyone with mapping access.
  • Behavior patterns can reveal identity even without direct IDs (e.g., the user who logs in at 8am from Seattle and searches for “gluten-free recipes” every day is identifiable by pattern).

3. Query Anonymization

Protecting What Comes Out: Storage anonymization protects data at rest. But what about when analysts query that data? Query anonymization ensures that query results don’t expose individual users.

Mental Model: The Crowded Room Rule

Imagine you’re a journalist who wants to report on salary data. You could publish: “John Smith earns $150,000” — but that violates John’s privacy.

Instead, you publish: “Engineers in Seattle earn an average of $145,000” — now John is protected by the crowd. Query anonymization enforces this crowd protection automatically.

Mechanisms:

  • No raw user IDs in query results.
  • Minimum cohort thresholds (e.g., results must include at least 100 users: k-anonymity).
  • Small groups are dropped entirely (if only 3 users match, return nothing).
  • Outliers are trimmed (the CEO’s $10M salary doesn’t skew “average salary” for everyone).

Example Output:

Plaintext

Query: "Show me purchase behavior by ZIP code"

ZIP 98101: 5,000 users  --> Returns data
ZIP 98102: 3,200 users  --> Returns data
ZIP 98199: 12 users     --> DROPPED (below threshold)




Where It’s Used

Analytics platforms, measurement dashboards, internal SQL tools, any system where humans or automated processes query user data.

Trade-offs

  • Loss of granularity: You can’t analyze small segments.
  • Missing Data: Small but valuable cohorts disappear entirely.
  • Frustration: Can hinder debugging workflows (“why can’t I see data for this specific user?”).
  • Limitation: Only protects query outputs — the underlying storage still contains full data.

4. Differential Privacy

Mathematical Guarantees: The previous techniques are practical but don’t offer formal guarantees. A determined attacker with enough queries might still extract individual information. Differential Privacy (DP) provides a mathematical guarantee: one user’s presence or absence in the dataset doesn’t materially affect the results.

Mental Model: The Noisy Crowd

Imagine you’re trying to figure out if your neighbor voted in the last election by looking at precinct-level turnout data.

If the precinct reports “exactly 1,247 people voted,” and you know 1,246 of your neighbors, you can deduce whether that one person voted.

Differential privacy adds calibrated random noise to the result. Instead of “1,247,” the system reports “approximately 1,240-1,260 people voted.” Now you can’t tell if your neighbor’s vote is in there or not — their individual contribution is smaller than the noise. Its important to note that in some cases, the DP returns a single value result with random noise added and this can change each time the query is repeated.

How It Works

  1. A query runs against the real data.
  2. Before returning results, calibrated random noise is added (Laplace or Gaussian distribution).
  3. The amount of noise is controlled by a privacy parameter (ε, pronounced “epsilon”).
  4. Each query consumes some of the “privacy budget.”
  5. Once the budget is exhausted, no more queries are allowed.

Plaintext

True count:      1,247
Noise added:     +/- random value
Reported:        1,251 (or 1,243, or 1,249...)

Result: Individual impact < Noise magnitude




Where It’s Used

  • Aggregations: Counts, sums, averages.
  • Advertising: Reach and frequency measurement, Lift studies.
  • ML Training: DP-SGD adds noise during model training.
  • Real World: Apple uses differential privacy for keyboard suggestions; Google uses it for Chrome usage statistics; Meta uses it for ads measurement.

Trade-offs

  • The Privacy-Utility Tradeoff: More privacy = more noise = less accuracy.
  • Budget Exhaustion: After enough queries, you can’t ask more questions.
  • Communication: Hard to explain to non-technical stakeholders (“why is this number fuzzy?”).
  • Small Data: Doesn’t work well for small datasets where noise overwhelms signal.
  • Format: Often returns single noisy numbers, not ranges, which can be counterintuitive.

How These Layers Work Together

The strongest privacy posture combines all four:

  1. Data Minimization: Only collect purchase amount, not full credit card.
  2. Storage Anonymization: Replace email with pseudo-ID in warehouse.
  3. Query Anonymization: Enforce k=100 minimum cohort on all dashboards.
  4. Differential Privacy: Add ε-calibrated noise to reported metrics.

Each layer catches what the previous layer missed. Minimization reduces what’s collected; Pseudonymization protects storage; Query anonymization protects outputs; DP provides mathematical guarantees on top.

Common Misconceptions

  • ❌ “We anonymized the data, so we’re safe.”
    • Pseudonymization (replacing IDs) is not anonymization. If a mapping exists, re-identification is possible.
  • ❌ “We only share aggregates, so privacy is protected.”
    • Without DP, repeated aggregate queries can leak individual information (the “differencing attack”).
  • ❌ “Differential privacy makes data useless.”
    • At scale, DP works well. The noise averages out over billions of data points. For small datasets, however, DP can overwhelm the signal.

Final Thought

Privacy protection isn’t a single technology — it’s a layered defense.

  1. Data Minimization: Collect only what you need.
  2. Storage Anonymization: Disguise identifiers at rest.
  3. Query Anonymization: Protect outputs with cohort thresholds.
  4. Differential Privacy: Add mathematical noise guarantees.

The art of privacy engineering is choosing the right combination for your use case — balancing utility against protection, and being honest about what each technique does and doesn’t guarantee.

In Part 2, we’ll cover the collaboration layer: how multiple organizations can work together on data without sharing raw information — Data Clean Rooms, identity mapping, purpose limitation, and secure hardware enclaves.

Related Posts:

  • Privacy Enhancing Technologies - Introduction
  • Privacy Enhancing Technologies (PETs) — Part 2
  • Privacy Enhancing Technologies (PETs) — Part 3
  • How CNNs Actually Work
  • Needle in the Haystack: Embedding Training and Context Rot
  • AI Paradigm Shift: From Rules to Patterns

Tags:

aiartificial-intelligencechatgptdata-minimizationdifferential-privacyllmpetsprivacy-enhancing-technologiesprivacy-preserving-technologiesquery-anonymizationstorage-anonymizationtechnology
Author

Archit Sharma

Follow Me
Other Articles
Previous

An Intuitive Guide to CNNs and RNNs

Next

Privacy Enhancing Technologies (PETs) — Part 2

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

icons8 pencil 100
ML Basics

Back to the basics

screenshot 1
Model Intuition

Build model intuition

icons8 lock 100 (1)
Encryption

How encryption works

icons8 gears 100
Privacy Tech

What protects privacy

screenshot 4
Musings

Writing is thinking

Recent Posts

  • Teaching AI Models: Gradient Descent
  • Needle in the Haystack: Embedding Training and Context Rot
  • Measuring Meaning: Cosine Similarity
  • AI Paradigm Shift: From Rules to Patterns
  • Seq2Seq Models: Basics behind LLMs
  • Word2Vec: Start of Dense Embeddings
  • Advertising in the Age of AI
  • Breaking the “Unbreakable” Encryption – Part 2
  • Breaking the “Unbreakable” Encryption – Part 1
  • ML Foundations – Linear Combinations to Logistic Regression
Copyright 2026 — Building AI Intuition. All rights reserved. Blogsy WordPress Theme