Privacy Enhancing Technologies (PETs) — Part 2
Secure Collaboration Without Sharing Raw Data
In Part 1, we covered how individual organizations protect data internally — minimization, anonymization, query controls, and differential privacy.
But modern business often requires multiple parties to collaborate on data: advertisers measuring campaign effectiveness, retailers sharing purchase signals, healthcare providers conducting joint research.
The challenge: how do you extract insights from combined data without any party seeing the other’s raw information?
Part 2 covers the collaboration layer: the technologies that enable secure multi-party data work.
The Collaboration Problem
Consider this scenario: A retailer wants to know if an advertiser’s campaign drove in-store purchases.
- The Retailer has purchase data.
- The Advertiser has ad exposure data.
- Neither wants to share their raw data with the other — it’s competitively sensitive and privacy-regulated.
They need a way to answer: “Did people who saw the ad buy more products?” without either party seeing the other’s user-level data. This is where Data Clean Rooms, identity mapping, purpose controls, and secure hardware come in.
1. Identity Mapping (Crosswalks, Private Set Intersection)
Connecting Users Across Systems: Before any collaboration can happen, you need to know that “User A” in System 1 is the same person as “User B” in System 2. This is the identity mapping problem.
Mental Model: The Diplomatic Translator Imagine two countries that speak different languages want to negotiate a treaty. Neither wants to learn the other’s language (that would give away intelligence).
Instead, they use a trusted translator who can convert messages between languages without either side understanding the other’s native tongue.
How Identity Mapping Works:
- Each party has their own user identifiers (cookies, emails, device IDs).
- An identity provider creates hashed mappings that connect these IDs.
- Each party only sees their own mapping — not the full identity graph.
- The connection happens through cryptographic matching, not raw data sharing.
| Retailer’s View | Identity Provider (The “Crosswalk”) | Advertiser’s View |
customer_123 | Matches | cookie_abc |
customer_456 | Matches | cookie_def |
| Sees Hash X | ↔ | Sees Hash X |
Export to Sheets
Private Set Intersection (PSI) with commutative encryption is another method of identity mapping when a deterministic, verified user ID like email ID or mobile number is available. In this case, each party generates salted encryption and shares the result with the other party. Then each party generates a salted encryption again on their end on shared data and since they are using commutative encryption, they arrive at the same final output that can then be matched without revealing underlying data.
Key Point: Identity mapping usually happens before data enters a clean room. The crosswalk is established first, then both parties bring their pre-mapped data to the collaboration environment.
Where It’s Used Advertising measurement, retail media networks, cross-platform attribution, identity resolution services (LiveRamp, Experian, etc.).
Trade-offs
- High Privacy Risk: If the full graph is centralized anywhere, it becomes a target.
- Consent Management: Did the user agree to this linking?
- Decay: Identity graphs decay over time as cookies expire and devices change.
- Latency: Adds operational complexity and delay.
2. Data Clean Rooms (DCR)
The Secure Collaboration Space: A Data Clean Room (DCR) is a controlled environment where multiple parties can compute joint insights without sharing raw data.
Mental Model: The Sealed Arbitration Chamber Imagine two companies in a legal dispute. Neither wants to show their confidential documents to the other.
Instead, they submit documents to a sealed arbitration chamber. A neutral arbitrator inside the chamber can read both sets of documents and issue a ruling — but the documents never leave the chamber, and neither party sees the other’s submissions.
How a Data Clean Room Works:
- Each party uploads their data (with hashed IDs from the crosswalk).
- Only pre-approved aggregate queries are allowed.
- The clean room enforces minimum thresholds (no results for small cohorts).
- No row-level data export — only aggregates come out.
- Often combined with differential privacy for additional protection.
Plaintext
+------------------Data Clean Room------------------+
| |
| Retailer uploads: Advertiser uploads: |
| [purchases by [ad exposures by |
| hashed user] hashed user] |
| |
| ↓ ↓ |
| +---------------------------+ |
| | Approved aggregate query: | |
| | "Conversion rate for | |
| | exposed vs. unexposed" | |
| +---------------------------+ |
| ↓ |
| [Aggregate result only] |
| "Lift: +12%" |
+---------------------------------------------------+
↓
Neither party sees the other's raw data
Where It’s Used Attribution measurement, sales lift studies, audience overlap analysis, measurement across walled gardens (Google, Meta, Amazon).
Trade-offs
- Slow Iteration: Can’t explore data freely.
- Operationally Heavy: Requires setup, governance, and approvals.
- Expensive: Clean room services aren’t cheap.
- Limited Flexibility: Only pre-approved queries work.
- Governance Trust: You still must trust the clean room operator.
3. Purpose Limitation
Binding Data to Declared Intent: Even within a single organization, data collected for one purpose shouldn’t automatically be available for all purposes.
Mental Model: The Library with Restricted Sections Imagine a university library with different sections: general stacks (anyone can access), research archives (faculty only), and medical records (IRB-approved researchers only).
Your library card grants access based on your role and declared purpose. You can’t wander into medical records just because you have a card.
How Purpose Limitation Works:
- Data is tagged with purpose at collection time (ads, safety, research).
- Every query must declare its purpose.
- Access is granted only if the query’s purpose matches the data’s allowed purposes.
- Encryption gates enforce this at decryption time — data literally cannot be read without matching purpose.
| Data Tagged For | Query Declares | Result |
[ads, measurement] | "ads" | ✅ Access Granted |
[ads, measurement] | "research" | ❌ Access Denied |
[safety] | "ads" | ❌ Access Denied |
[safety] | "safety" | ✅ Access Granted |
Export to Sheets
Trade-offs
- Honesty: Relies on honest declaration (someone could lie about their purpose).
- Auditing: Needs strong auditing to catch misuse.
- Taxonomy Drift: New use cases might not fit old categories.
- Deterrence vs Prevention: Doesn’t stop malicious insiders instantly; it primarily detects and deters.
4. Trusted Execution Environments (TEE) & Confidential VMs
The Hardware Vault: All previous technologies assume you trust the system running the computation. But what if you don’t? What if even the cloud provider shouldn’t see the data?
Trusted Execution Environments (TEEs) provide hardware-level isolation — a secure enclave where even the operating system and cloud admin cannot see what’s happening inside.
Mental Model: The Sealed Black Box Imagine a voting machine designed so that even the election officials can’t tamper with it. Votes go in, tallies come out, but no one — not the manufacturer, not the officials, not hackers — can see or modify the individual votes inside. The machine is physically sealed and cryptographically verified.
How a TEE Works:
- Data enters the enclave encrypted.
- Inside the enclave, data is decrypted and processed.
- Only the approved code can run (verified by “attestation”).
- The operating system, hypervisor, and cloud admin cannot see inside.
- Only approved aggregate outputs leave the enclave.
Plaintext
+------------Cloud Server------------+
| |
| Operating System (untrusted) |
| +---------TEE Enclave---------+ |
| | Encrypted memory | |
| | Only attested code runs | |
| | Data decrypted only inside | |
| | Admin cannot inspect | |
| +-----------------------------+ |
| |
+------------------------------------+
Confidential Virtual Machines (CVMs): A CVM is simply a TEE applied to an entire cloud virtual machine.
- TEE = The technology (e.g., Intel SGX, AMD SEV).
- CVM = The cloud deployment of that technology.
Where It’s Used Healthcare and genomics (HIPAA data), financial services (trading algorithms), secure ML training, private attribution.
Trade-offs
- Performance Overhead: Encryption/decryption costs time.
- Complex Debugging: You can’t inspect what’s inside to fix bugs.
- Cost: Cloud cost premium for confidential computing.
- Code Trust: If the code inside the enclave is buggy or malicious, the TEE won’t save you.
- Output Leakage: TEE protects processing, not results. You still need Differential Privacy on the output.
How These Technologies Work Together
A complete privacy-preserving collaboration might use all of these steps:
- Identity Mapping: Retailer and advertiser establish crosswalk via identity provider.
- Data Clean Room: Both parties upload data (keyed by pseudonyms) to a clean room.
- Confidential VM: The clean room runs inside a TEE/CVM for hardware protection.
- Purpose Limitation: Query declares “ads measurement”; system verifies permission.
- Query Execution: Approved query runs; Differential Privacy noise is added.
- Output: Only noisy aggregate leaves the room. Neither party sees raw data.
The Encryption Foundation: Diffie-Hellman and AES
You might wonder: how does data stay encrypted while moving between systems?
- Diffie-Hellman Key Exchange: Allows two parties to generate the same secret key over a public network without ever transmitting the key itself. (Like mixing paint colors to match a secret shade without showing the original colors).
- AES Symmetric Encryption: Once the shared key is established, AES uses it to encrypt the actual data. It is the standard for secure communication (HTTPS, databases, enclaves).
Choosing the Right Tools
| Scenario | Technologies to Consider |
| Two companies measuring ad effectiveness | Identity mapping + Data Clean Room + DP |
| Healthcare joint research | Clean Room inside CVM + Strict Purpose Controls |
| Internal analytics on sensitive data | Purpose Limitation + Query Anonymization + DP |
| ML training on user data | DP-SGD + TEE for model training |
| Cross-platform identity resolution | Decentralized crosswalks + No centralized graph |
Export to Sheets
Common Misconceptions
- ❌ “Clean rooms mean no one sees the data.”
- The clean room operator (or the code running inside) does see the data during computation. TEEs reduce this trust requirement, but you still trust the code. True “no one sees” requires advanced cryptography like Fully Homomorphic Encryption (FHE), which is currently too slow for most uses.
- ❌ “TEEs solve all privacy problems.”
- TEEs protect data during processing, but outputs can still leak information. A TEE that returns exact user counts is not private. You need DP or query anonymization on top.
- ❌ “Identity mapping is anonymous.”
- Hashed IDs are pseudonymous, not anonymous. If an attacker obtains the hash function and your email, they can compute your hash and re-identify you.
Final Thought
Privacy-preserving collaboration is a layered system.
- Identity Mapping: Connect users across systems without sharing raw IDs.
- Data Clean Rooms: Compute joint insights without exposing raw data.
- Purpose Limitation: Bind data usage to declared intent.
- TEEs and CVMs: Protect data even from the infrastructure operator.
The art of privacy engineering is layering these appropriately — understanding what each tool protects, what it doesn’t, and where the residual risks remain. No single technology is a silver bullet. But combined thoughtfully, they enable collaboration that would otherwise be impossible — extracting value from data while genuinely protecting the individuals behind it.