Balancing Privacy and Performance in Entity Alignment

The Challenge

A consortium of financial institutions participating in the UK-US PETs Prize Challenges faced a significant issue: how to train a shared fraud detection model across vertically partitioned datasets without exposing customer identities or transaction patterns. Each institution held different attributes about the same customers—one had transaction histories, another had credit scores, a third had account behavior patterns. To build an effective model, they needed to align records across institutions without revealing which customers they had in common.

The technical challenge wasn't just about encryption or secure computation. It was about entity resolution—determining which records across different datasets refer to the same person—while maintaining meaningful privacy guarantees. Use too much privacy protection and your model trains on noise. Use too little and you leak exactly the information you're trying to protect.

The Environment and Constraints

The consortium operated under three hard constraints. First, regulatory requirements meant they couldn't directly share customer identifiers. Second, they needed model accuracy sufficient for production deployment—a fraud detection system that generates too many false positives becomes unusable. Third, they had to complete training within compute budgets that made sense for their infrastructure.

The datasets were vertically partitioned, meaning each institution held different features for overlapping sets of customers. Unlike horizontal partitioning where everyone has the same features for different people, vertical partitioning requires you to figure out which records match before you can even start training. This entity alignment step becomes your first privacy leak point.

The Approach Taken

The team evaluated two primary techniques for privacy-preserving entity alignment: private set intersection (PSI) and Bloom filters.

PSI protocols let parties compute the intersection of their datasets without revealing non-matching records. In theory, this sounds perfect—you only learn about customers you have in common. In practice, PSI reveals information only for rows that match on a common key, but that revelation itself can be significant. If Institution A learns that customer ID 12345 exists in Institution B's fraud database, that's a data point they didn't have before, even if they don't see the associated features.

The consortium implemented PSI for the initial alignment phase but quickly hit performance walls. The cryptographic overhead of secure PSI protocols meant alignment took hours for datasets with millions of records. Worse, the binary nature of PSI—a record either matches or it doesn't—provided no fuzzy matching capability for records with slight identifier variations.

They then explored Bloom filters as an alternative. A Bloom filter is a probabilistic data structure that can test set membership with a known false positive rate. You hash each identifier multiple times and set corresponding bits in a bit array. To check if an identifier exists, you hash it the same way and check if all corresponding bits are set.

The key insight: Bloom filters can lead to false positives, providing a form of input privacy protection. If your Bloom filter says customer ID 12345 might be in the set, that "might" creates plausible deniability. The false positive rate becomes a tunable privacy parameter. Set it higher and you get more privacy but more noise in your aligned dataset. Set it lower and you get cleaner alignment but less privacy protection.

Results and Metrics

The source material doesn't provide specific performance numbers or accuracy metrics from the consortium's implementation. What we know is that both approaches revealed fundamental tradeoffs rather than solutions.

PSI provided stronger privacy guarantees for non-matching records but created a binary privacy boundary. Records that matched were fully exposed for the purposes of model training. The computational overhead made it impractical for iterative model development where you might need to re-align datasets as you refine your feature sets.

Bloom filters offered better performance and tunable privacy through false positive rates, but that tunability became its own problem. How do you decide the right false positive rate? Too high and your model trains on too much noise. Too low and you've barely improved over plain PSI. The team lacked clear guidance on mapping privacy requirements to filter parameters.

What They Would Do Differently

The consortium's experience revealed that entity alignment isn't a one-time privacy decision—it's an ongoing tradeoff that needs to be re-evaluated as your model evolves. If they were starting over, they would:

Establish privacy budgets upfront. Rather than choosing techniques based on theoretical privacy properties, they would define acceptable information leakage in concrete terms. How many false customer matches can your model tolerate? What's the maximum acceptable probability that an adversary could infer a customer's presence in a partner's dataset?

Build measurement into the alignment pipeline. They needed instrumentation to track actual privacy leakage, not just theoretical bounds. This means logging what information each party learns during alignment and quantifying it against your privacy budget.

Consider hybrid approaches. PSI for high-confidence matches, Bloom filters for fuzzy matching, and differential privacy noise for the final aligned dataset. No single technique solved their problem—they needed composition.

Test with adversarial scenarios. The team spent too much time optimizing for honest-but-curious adversaries. Real threats include malicious insiders who might manipulate their input data to extract information about other parties' datasets.

Takeaways for Your Team

If you're implementing privacy-preserving federated learning with vertical partitioning, here's what matters:

Entity alignment is your first privacy leak. Before you even start training, you've revealed set membership information. Budget for this. If your privacy analysis only covers model updates, you've missed the most obvious attack vector.

False positives aren't just bugs—they're privacy features. When evaluating probabilistic techniques like Bloom filters, the error rate is a parameter you tune based on your privacy requirements, not something you minimize automatically.

Performance costs compound across parties. In a two-party protocol, doubling computation time might be acceptable. In a ten-party protocol, it becomes prohibitive. Your technique needs to scale not just with data size but with participant count.

Map privacy requirements to technical parameters. "We need strong privacy" doesn't translate to filter configuration. Work backward from your compliance requirements: What specific inferences must you prevent? What false positive rate achieves that? What computational budget does that require?

The hardest part isn't implementing PSI or configuring Bloom filters. It's deciding when the performance cost of additional privacy measures is justified. That decision requires understanding not just the cryptographic properties of your techniques, but the actual risks in your threat model and the actual value of the accuracy you're sacrificing.

If you can't articulate the specific privacy harm you're preventing and measure whether you've prevented it, you're just adding overhead without knowing if it helps.

Differential Privacy Federated Learning

When Entity Alignment Breaks Your Privacy Budget

The Challenge

The Environment and Constraints

The Approach Taken

Results and Metrics

What They Would Do Differently

Takeaways for Your Team

You Might Also Like

Setting Up AI-Powered Vulnerability Scanning After Mozilla's Claude Mythos Success

Repository Lockdown Runbook: Your First-Hour Response Template

Build Your Own Vulnerability Triage System Before NIST's April 15 Cutoff