Privacy Enhancing Technologies (PETs) — Part 1
How Your Data Gets Protected
Every time you browse a website, click an ad, or make a purchase, data flows through dozens of systems. Companies need this data for analytics, measurement, and personalization—but they also have legal and ethical obligations to protect your privacy.
This is where Privacy Enhancing Technologies (PETs) come in. PETs are the technical toolkit that lets companies use data responsibly—extracting insights while minimizing the risk of exposing individuals.
This two-part series will give you the intuition for how these technologies work, without the math. Part 1 covers the foundational layer: what happens to your data as it moves through collection, storage, and analysis.
The Four Layers of Data Protection
Think of privacy protection like an onion with multiple layers. Each layer adds protection, and the best systems use all four together.
- Layer 1: Data Minimization — Collect less in the first place
- Layer 2: Storage Anonymization — Disguise what you store
- Layer 3: Query Anonymization — Protect what comes out of queries
- Layer 4: Differential Privacy — Add mathematical guarantees
1. Data Minimization
The “Don’t Collect It” Principle: The simplest privacy protection is also the most powerful: don’t collect data you don’t need. Another simple principal to apply is to check if we are clear on how the data we plan to collect will be used — if the answer is “No”, just leave it alone.
Mental Model: The Minimalist Packer
Imagine you’re packing for a weekend trip. You could bring your entire wardrobe “just in case” — but then you’re lugging heavy bags, and if your luggage gets lost, you lose everything.
A smarter approach: pack only what you’ll actually use. Data minimization works the same way:
- Only collect fields you truly need for the product to function.
- Set short retention windows (delete data after 30 days, not 10 years).
- Drop unused fields early in data pipelines before they spread.
Where It’s Used
Everywhere — ingestion, logging, analytics, ML training. Every system that touches user data should ask: “Do we actually need this field?”
Trade-offs
The downside is reduced flexibility. Product teams often say “we might need that data later for a feature we haven’t built yet.” Data minimization requires strong upfront thinking about what you’ll actually use.
- Important Limitation: Minimization only reduces surface area. It doesn’t protect the data you do collect — you still need the other layers.
2. Storage Anonymization
Disguising Identity at Rest: Once data is collected, you need to store it somewhere. Storage anonymization (usually pseudonymization in practice) disguises direct identifiers so that raw user IDs aren’t sitting in every database.
Mental Model: The Witness Protection Program
When someone enters witness protection, they get a new identity — new name, new address, new documents. Their real identity is stored in a secure vault that only a few authorized people can access. Everyone else interacts with the new identity.
How Pseudonymization Works:
- Real UserID or email is replaced with a pseudo-ID (often a salted hash).
- The mapping table (
real ID ↔ pseudo ID) is stored separately with strict access controls. - All analytics, ML, and dashboards operate on pseudo-IDs only.
| Raw Data | Pseudonymized |
john.doe@email.com | user_a8f3c2 |
jane.smith@email.com | user_b7e4d1 |
bob.jones@email.com | user_c9f5e3 |
A more nuanced detail around hashing — it is common practice to use Salt and Pepper in the Hashing process to drive stronger privacy. Salt is random string generated per user and often stored in cleartext and Pepper is a system level string stored in an environment variable. The final hash is performed on raw data+ salt + pepper to reduce the threat vector.
Key Nuance: Pseudonymization vs. True Anonymization
If the mapping table exists, this is pseudonymization — not true anonymization. Someone with access to the mapping can re-identify users.
True anonymization means destroying the mapping entirely. But this creates a problem: if you can’t identify users, you can’t honor deletion requests (“delete my data”) or export requests (“give me my data”). GDPR and similar laws require these capabilities, so most systems use pseudonymization.
Trade-offs
- The mapping table is still sensitive — it becomes a high-value target.
- Re-identification is still possible for anyone with mapping access.
- Behavior patterns can reveal identity even without direct IDs (e.g., the user who logs in at 8am from Seattle and searches for “gluten-free recipes” every day is identifiable by pattern).
3. Query Anonymization
Protecting What Comes Out: Storage anonymization protects data at rest. But what about when analysts query that data? Query anonymization ensures that query results don’t expose individual users.
Mental Model: The Crowded Room Rule
Imagine you’re a journalist who wants to report on salary data. You could publish: “John Smith earns $150,000” — but that violates John’s privacy.
Instead, you publish: “Engineers in Seattle earn an average of $145,000” — now John is protected by the crowd. Query anonymization enforces this crowd protection automatically.
Mechanisms:
- No raw user IDs in query results.
- Minimum cohort thresholds (e.g., results must include at least 100 users: k-anonymity).
- Small groups are dropped entirely (if only 3 users match, return nothing).
- Outliers are trimmed (the CEO’s $10M salary doesn’t skew “average salary” for everyone).
Example Output:
Plaintext
Query: "Show me purchase behavior by ZIP code"
ZIP 98101: 5,000 users --> Returns data
ZIP 98102: 3,200 users --> Returns data
ZIP 98199: 12 users --> DROPPED (below threshold)
Where It’s Used
Analytics platforms, measurement dashboards, internal SQL tools, any system where humans or automated processes query user data.
Trade-offs
- Loss of granularity: You can’t analyze small segments.
- Missing Data: Small but valuable cohorts disappear entirely.
- Frustration: Can hinder debugging workflows (“why can’t I see data for this specific user?”).
- Limitation: Only protects query outputs — the underlying storage still contains full data.
4. Differential Privacy
Mathematical Guarantees: The previous techniques are practical but don’t offer formal guarantees. A determined attacker with enough queries might still extract individual information. Differential Privacy (DP) provides a mathematical guarantee: one user’s presence or absence in the dataset doesn’t materially affect the results.
Mental Model: The Noisy Crowd
Imagine you’re trying to figure out if your neighbor voted in the last election by looking at precinct-level turnout data.
If the precinct reports “exactly 1,247 people voted,” and you know 1,246 of your neighbors, you can deduce whether that one person voted.
Differential privacy adds calibrated random noise to the result. Instead of “1,247,” the system reports “approximately 1,240-1,260 people voted.” Now you can’t tell if your neighbor’s vote is in there or not — their individual contribution is smaller than the noise. Its important to note that in some cases, the DP returns a single value result with random noise added and this can change each time the query is repeated.
How It Works
- A query runs against the real data.
- Before returning results, calibrated random noise is added (Laplace or Gaussian distribution).
- The amount of noise is controlled by a privacy parameter (ε, pronounced “epsilon”).
- Each query consumes some of the “privacy budget.”
- Once the budget is exhausted, no more queries are allowed.
Plaintext
True count: 1,247
Noise added: +/- random value
Reported: 1,251 (or 1,243, or 1,249...)
Result: Individual impact < Noise magnitude
Where It’s Used
- Aggregations: Counts, sums, averages.
- Advertising: Reach and frequency measurement, Lift studies.
- ML Training: DP-SGD adds noise during model training.
- Real World: Apple uses differential privacy for keyboard suggestions; Google uses it for Chrome usage statistics; Meta uses it for ads measurement.
Trade-offs
- The Privacy-Utility Tradeoff: More privacy = more noise = less accuracy.
- Budget Exhaustion: After enough queries, you can’t ask more questions.
- Communication: Hard to explain to non-technical stakeholders (“why is this number fuzzy?”).
- Small Data: Doesn’t work well for small datasets where noise overwhelms signal.
- Format: Often returns single noisy numbers, not ranges, which can be counterintuitive.
How These Layers Work Together
The strongest privacy posture combines all four:
- Data Minimization: Only collect purchase amount, not full credit card.
- Storage Anonymization: Replace email with pseudo-ID in warehouse.
- Query Anonymization: Enforce
k=100minimum cohort on all dashboards. - Differential Privacy: Add
ε-calibratednoise to reported metrics.
Each layer catches what the previous layer missed. Minimization reduces what’s collected; Pseudonymization protects storage; Query anonymization protects outputs; DP provides mathematical guarantees on top.
Common Misconceptions
- ❌ “We anonymized the data, so we’re safe.”
- Pseudonymization (replacing IDs) is not anonymization. If a mapping exists, re-identification is possible.
- ❌ “We only share aggregates, so privacy is protected.”
- Without DP, repeated aggregate queries can leak individual information (the “differencing attack”).
- ❌ “Differential privacy makes data useless.”
- At scale, DP works well. The noise averages out over billions of data points. For small datasets, however, DP can overwhelm the signal.
Final Thought
Privacy protection isn’t a single technology — it’s a layered defense.
- Data Minimization: Collect only what you need.
- Storage Anonymization: Disguise identifiers at rest.
- Query Anonymization: Protect outputs with cohort thresholds.
- Differential Privacy: Add mathematical noise guarantees.
The art of privacy engineering is choosing the right combination for your use case — balancing utility against protection, and being honest about what each technique does and doesn’t guarantee.
In Part 2, we’ll cover the collaboration layer: how multiple organizations can work together on data without sharing raw information — Data Clean Rooms, identity mapping, purpose limitation, and secure hardware enclaves.