[PET 1.a] Privacy Enhancing Technologies (PETs) — Part 1

6 Min Read

Updated on March 3, 2026

How Your Data Gets Protected

Every time you browse a website, click an ad, or make a purchase, data flows through dozens of systems. Companies need this data for analytics, measurement, and personalization—but they also have legal and ethical obligations to protect your privacy.

This is where Privacy Enhancing Technologies (PETs) come in. PETs are the technical toolkit that lets companies use data responsibly—extracting insights while minimizing the risk of exposing individuals.

This two-part series will give you the intuition for how these technologies work, without the math. Part 1 covers the foundational layer: what happens to your data as it moves through collection, storage, and analysis.

The Four Layers of Data Protection

Think of privacy protection like an onion with multiple layers. Each layer adds protection, and the best systems use all four together.

Layer 1: Data Minimization — Collect less in the first place
Layer 2: Storage Anonymization — Disguise what you store
Layer 3: Query Anonymization — Protect what comes out of queries
Layer 4: Differential Privacy — Add mathematical guarantees

1. Data Minimization

The “Don’t Collect It” Principle: The simplest privacy protection is also the most powerful: don’t collect data you don’t need. Another simple principal to apply is to check if we are clear on how the data we plan to collect will be used — if the answer is “No”, just leave it alone.

Mental Model: The Minimalist Packer

Imagine you’re packing for a weekend trip. You could bring your entire wardrobe “just in case” — but then you’re lugging heavy bags, and if your luggage gets lost, you lose everything.

A smarter approach: pack only what you’ll actually use. Data minimization works the same way:

Only collect fields you truly need for the product to function.

Set short retention windows (delete data after 30 days, not 10 years).

Drop unused fields early in data pipelines before they spread.

Where It’s Used

Everywhere — ingestion, logging, analytics, ML training. Every system that touches user data should ask: “Do we actually need this field?”

Trade-offs

The downside is reduced flexibility. Product teams often say “we might need that data later for a feature we haven’t built yet.” Data minimization requires strong upfront thinking about what you’ll actually use.

Important Limitation: Minimization only reduces surface area. It doesn’t protect the data you do collect — you still need the other layers.

2. Storage Anonymization

Disguising Identity at Rest: Once data is collected, you need to store it somewhere. Storage anonymization (usually pseudonymization in practice) disguises direct identifiers so that raw user IDs aren’t sitting in every database.

Mental Model: The Witness Protection Program

When someone enters witness protection, they get a new identity — new name, new address, new documents. Their real identity is stored in a secure vault that only a few authorized people can access. Everyone else interacts with the new identity.

How Pseudonymization Works:

Real UserID or email is replaced with a pseudo-ID (often a salted hash).
The mapping table (real ID ↔ pseudo ID) is stored separately with strict access controls.
All analytics, ML, and dashboards operate on pseudo-IDs only.

Raw Data	Pseudonymized
`john.doe@email.com`	`user_a8f3c2`
`jane.smith@email.com`	`user_b7e4d1`
`bob.jones@email.com`	`user_c9f5e3`

A more nuanced detail around hashing — it is common practice to use Salt and Pepper in the Hashing process to drive stronger privacy. Salt is random string generated per user and often stored in cleartext and Pepper is a system level string stored in an environment variable. The final hash is performed on raw data+ salt + pepper to reduce the threat vector.

Key Nuance: Pseudonymization vs. True Anonymization

If the mapping table exists, this is pseudonymization — not true anonymization. Someone with access to the mapping can re-identify users.

True anonymization means destroying the mapping entirely. But this creates a problem: if you can’t identify users, you can’t honor deletion requests (“delete my data”) or export requests (“give me my data”). GDPR and similar laws require these capabilities, so most systems use pseudonymization.

Trade-offs

The mapping table is still sensitive — it becomes a high-value target.
Re-identification is still possible for anyone with mapping access.
Behavior patterns can reveal identity even without direct IDs (e.g., the user who logs in at 8am from Seattle and searches for “gluten-free recipes” every day is identifiable by pattern).

3. Query Anonymization

Protecting What Comes Out: Storage anonymization protects data at rest. But what about when analysts query that data? Query anonymization ensures that query results don’t expose individual users.

Mental Model: The Crowded Room Rule

Imagine you’re a journalist who wants to report on salary data. You could publish: “John Smith earns $150,000” — but that violates John’s privacy.

Instead, you publish: “Engineers in Seattle earn an average of $145,000” — now John is protected by the crowd. Query anonymization enforces this crowd protection automatically.

Mechanisms:

No raw user IDs in query results.
Minimum cohort thresholds (e.g., results must include at least 100 users: k-anonymity).
Small groups are dropped entirely (if only 3 users match, return nothing).
Outliers are trimmed (the CEO’s $10M salary doesn’t skew “average salary” for everyone).

Example Output:

Plaintext

Query: "Show me purchase behavior by ZIP code"

ZIP 98101: 5,000 users  --> Returns data
ZIP 98102: 3,200 users  --> Returns data
ZIP 98199: 12 users     --> DROPPED (below threshold)

Where It’s Used

Analytics platforms, measurement dashboards, internal SQL tools, any system where humans or automated processes query user data.

Trade-offs

Loss of granularity: You can’t analyze small segments.
Missing Data: Small but valuable cohorts disappear entirely.
Frustration: Can hinder debugging workflows (“why can’t I see data for this specific user?”).
Limitation: Only protects query outputs — the underlying storage still contains full data.

4. Differential Privacy

Mathematical Guarantees: The previous techniques are practical but don’t offer formal guarantees. A determined attacker with enough queries might still extract individual information. Differential Privacy (DP) provides a mathematical guarantee: one user’s presence or absence in the dataset doesn’t materially affect the results.

Mental Model: The Noisy Crowd

Imagine you’re trying to figure out if your neighbor voted in the last election by looking at precinct-level turnout data.

If the precinct reports “exactly 1,247 people voted,” and you know 1,246 of your neighbors, you can deduce whether that one person voted.

Differential privacy adds calibrated random noise to the result. Instead of “1,247,” the system reports “approximately 1,240-1,260 people voted.” Now you can’t tell if your neighbor’s vote is in there or not — their individual contribution is smaller than the noise. Its important to note that in some cases, the DP returns a single value result with random noise added and this can change each time the query is repeated.

How It Works

A query runs against the real data.
Before returning results, calibrated random noise is added (Laplace or Gaussian distribution).
The amount of noise is controlled by a privacy parameter (ε, pronounced “epsilon”).
Each query consumes some of the “privacy budget.”
Once the budget is exhausted, no more queries are allowed.

Plaintext

True count:      1,247
Noise added:     +/- random value
Reported:        1,251 (or 1,243, or 1,249...)

Result: Individual impact < Noise magnitude

Where It’s Used

Aggregations: Counts, sums, averages.
Advertising: Reach and frequency measurement, Lift studies.
ML Training: DP-SGD adds noise during model training.
Real World: Apple uses differential privacy for keyboard suggestions; Google uses it for Chrome usage statistics; Meta uses it for ads measurement.

Trade-offs

The Privacy-Utility Tradeoff: More privacy = more noise = less accuracy.
Budget Exhaustion: After enough queries, you can’t ask more questions.
Communication: Hard to explain to non-technical stakeholders (“why is this number fuzzy?”).
Small Data: Doesn’t work well for small datasets where noise overwhelms signal.
Format: Often returns single noisy numbers, not ranges, which can be counterintuitive.

How These Layers Work Together

The strongest privacy posture combines all four:

Data Minimization: Only collect purchase amount, not full credit card.
Storage Anonymization: Replace email with pseudo-ID in warehouse.
Query Anonymization: Enforce k=100 minimum cohort on all dashboards.
Differential Privacy: Add ε-calibrated noise to reported metrics.

Each layer catches what the previous layer missed. Minimization reduces what’s collected; Pseudonymization protects storage; Query anonymization protects outputs; DP provides mathematical guarantees on top.

Common Misconceptions

❌ “We anonymized the data, so we’re safe.”
- Pseudonymization (replacing IDs) is not anonymization. If a mapping exists, re-identification is possible.
❌ “We only share aggregates, so privacy is protected.”
- Without DP, repeated aggregate queries can leak individual information (the “differencing attack”).
❌ “Differential privacy makes data useless.”
- At scale, DP works well. The noise averages out over billions of data points. For small datasets, however, DP can overwhelm the signal.

Final Thought

Privacy protection isn’t a single technology — it’s a layered defense.

Data Minimization: Collect only what you need.
Storage Anonymization: Disguise identifiers at rest.
Query Anonymization: Protect outputs with cohort thresholds.
Differential Privacy: Add mathematical noise guarantees.

The art of privacy engineering is choosing the right combination for your use case — balancing utility against protection, and being honest about what each technique does and doesn’t guarantee.

In Part 2, we’ll cover the collaboration layer: how multiple organizations can work together on data without sharing raw information — Data Clean Rooms, identity mapping, purpose limitation, and secure hardware enclaves.

Tags:

[PET 1.a] Privacy Enhancing Technologies (PETs) — Part 1

How Your Data Gets Protected

The Four Layers of Data Protection

1. Data Minimization

2. Storage Anonymization

3. Query Anonymization

4. Differential Privacy

How These Layers Work Together

Common Misconceptions

Final Thought

Related Posts:

Tags:

Archit Sharma

Other Articles

[PET 1.b] Privacy Enhancing Technologies (PETs) — Part 2

[PET 1] Privacy Enhancing Technologies – Introduction

No Comment! Be the first one.

Leave a Reply Cancel reply

ML Basics

Model Intuition

Encryption

Privacy Tech

Musings