amphi-rl-dpgraph / README.md
vkatg's picture
Update README.md
accff63 verified
metadata
title: Stateful Exposure-Aware De-Identification for Multimodal Streaming Data
emoji: 🔐
colorFrom: blue
colorTo: indigo
sdk: static
pinned: true
tags:
  - healthcare
  - privacy
  - de-identification
  - deidentification
  - reinforcement-learning
  - multimodal
  - streaming
  - phi
  - pii
  - hipaa
  - nlp
  - python
  - clinical-text
  - healthcare-nlp
  - medical-nlp
  - benchmark
  - synthetic
  - audit-log
  - risk-scoring
  - graph
  - re-identification
  - federated-learning
  - ehr
  - clinical-nlp
license: mit
short_description: Adaptive PHI de-identification for streaming multimodal data
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/69a896595f77a6ceac8e464f/0v2MecagKvP2L-7VUzjPS.png
models:
  - vkatg/dcpg-cross-modal-phi-risk-scorer
  - vkatg/exposureguard-policynet
  - vkatg/exposureguard-synthrewrite-t5
  - vkatg/exposureguard-dcpg-encoder
  - vkatg/exposureguard-fedcrdt-distill
datasets:
  - vkatg/streaming-phi-deidentification-benchmark
  - vkatg/dag_remediation_traces

Stateful Exposure-Aware De-Identification for Multimodal Streaming Data

Open In Colab DOI


Imagine a patient admitted to a hospital. A clinical note is written. An hour later, a voice transcription is logged. A face region is flagged in an imaging header. A waveform timestamp appears in monitoring data.

None of these records identifies the patient on their own. A standard de-identification pipeline sees each one, strips the obvious PHI, and moves on. Risk resets between events. No memory. No accumulation.

But an adversary watching the stream sees something different. The name fragment. The matching date. The voice signature that correlates with the ASR transcript. The face region that ties to the note from this morning. Individually harmless. Together, identifying.

This is the problem that static masking cannot solve. And this is what this system was built to fix.


What happens when you run it

Re-identification risk: down 91.67%

A logistic regression adversary trained to distinguish patients from masked records loses almost all signal by the end of the stream. Multi-run mean delta-AUROC: -0.9167 +/- 0.0000 (95% CI, n=10). Replicated with zero variance across 10 independent runs.

Privacy and utility, at the same time

On the bursty workload, adaptive is the only policy above the 0.85 privacy floor with utility above 0.50. Every static policy fails at least one side.

17ms decision latency

Canonical multi-run masking decision latency across pseudo, redact, and synthetic. Flat with respect to risk score. The controller does not slow down as exposure accumulates.


The numbers

Bursty workload (primary evaluation scenario, new low-risk patients entering every 6 events):

Policy Privacy @ High Risk Utility @ Low Risk Consent Violations
Always-Raw 0.000 1.000 0
Always-Weak 0.004 0.847 0
Always-Synthetic 0.564 0.676 0
Always-Pseudo 0.855 0.440 0
Always-Redact 1.000 0.000 17
Adaptive 0.991 0.847 10

Always-Redact gets the privacy score. But utility collapses to zero. Always-Pseudo gets close on privacy but drops utility to 0.44. Always-Weak preserves utility but provides almost no privacy protection when risk is high.

Adaptive is the only policy that clears both floors simultaneously. That is the core claim, and it holds across all three workloads tested.


How it works

De-identification is treated as a longitudinal control problem, not a labeling task.

Per-subject exposure state is maintained across the full event stream. A continuously updated risk score governs which of five masking tiers to apply:

raw  ->  weak  ->  synthetic  ->  pseudo  ->  redact
               0.40         0.60         0.80

The risk score comes from a Dynamic Contextual Privacy Graph (DCPG), a per-patient, per-modality graph that accumulates PHI unit counts, detects cross-modal semantic links (cosine similarity > 0.30), and applies recency weighting to the entropy calculation.

When two or more cross-modal links are detected, a +0.20 link bonus fires. Three or more: +0.30. This is what causes the sharp risk jumps when the same patient appears across multiple modalities in a session.

Risk formula:

R = 0.8 * (1 - exp(-k * units)) + 0.2 * recency + link_bonus

Policy decisions are made by a PPO reinforcement learning agent with an LSTM backbone (128-dim hidden, 2 layers, 14-dim state), pre-trained over 200 stratified episodes before live deployment.

When risk crosses 0.68, pseudonym tokens are versioned forward. Prior references to the patient are retokenized without global reprocessing.

The risk score is validated against a closed-form combinatorial reconstruction probability: Pearson r = 0.881 across all 34 events. It tracks actual re-identification threat, not a proxy.


The adversarial test

Here is what an attacker actually tries.

Space PHI at risk 0.34-0.39, just below the lowest escalation threshold. Stay sub-threshold on every individual event so no policy escalation ever fires. Then probe cross-modally every 5th event to accumulate linkage signal before the DCPG link bonus kicks in.

This is a formally modeled sub-threshold probing attack. It works against every static policy. Always-Weak never escalates, regardless of how many cross-modal probes it sees.

The adaptive controller catches it. Cross-modal probes are detected via cosine similarity. A +0.15 risk nudge is applied on every probe event, pushing risk over the threshold and escalating policy to pseudo. The attacker's evasion strategy fails on every phase-4 event.


Try it live

State is shared across all users and sessions. Submit events for patient_001 and the next person who opens the Space picks up where you left off.

Open the live scorer

Suggested sequence:

  1. text: Patient John Smith, DOB 1985-03-12, MRN 00123456. Presenting to Riverside Clinic with chest discomfort.
  2. asr: John Smith date of birth March twelve nineteen eighty five complaining of chest pain
  3. audio_proxy: voice profile John Smith male DOB 1985-03-12
  4. image_proxy: face region detected patient John Smith MRN 00123456 Riverside Clinic
  5. text: John Smith MRN 00123456 follow-up cardiology referral DOB 1985-03-12

Watch cross_modal_matches fire on step 2. Watch the link bonus appear on step 5. Keep going until trigger: true and pseudonym_version increments to 1.


Federated deployment

The DCPG state merges across edge devices using a CRDT. Two devices with overlapping patient observations converge to the same result regardless of update ordering. Convergence is guaranteed regardless of message arrival order.


Run it yourself

pip install phi-exposure-guard
python -m amphi_rl_dpgraph.run_demo

Results go to results/. Full test suite:

pytest -vv

What's in the repo

File What it does
dcpg.py Dynamic Contextual Privacy Graph: per-patient, per-modality PHI accumulation, cross-modal semantic links, risk entropy
context_state.py Per-subject exposure state persisted via SQLite: effective units, recency factor, link bonus computation
controller.py Risk-threshold adaptive policy selection, consent cap enforcement, retokenization trigger
rl_agent.py PPO agent with LSTM policy network (128-dim hidden, 2 layers, 14-dim state) and reward computation
masking.py Text, ASR, image, waveform, and audio masking operations across all five policy tiers
masking_ops.py Unified masking dispatch: routes events to the correct CMO by modality and policy
cmo_registry.py Composable Masking Operator registry with DAG execution, contract hashing, and action logging
cmo_media.py Media-specific masking: image blur/overlay, audio pitch shift, waveform header obfuscation, synthetic text replacement
flow_controller.py DAG-based policy flow controller with audit provenance and fallback handling
dcpg_crdt.py CRDT-based graph merge for federated edge deployments, convergence-guaranteed
dcpg_federation.py Gossip-bus federation: edge device state sync, live federation demo, deterministic pseudonym generation
consent.py Per-patient consent token management: expiry, tier resolution, policy cap enforcement
audit_signing.py Merkle-chained audit log with ECDSA signing, verification, and FHIR export
schemas.py Core dataclasses: PHISpan, DataEvent, DecisionRecord, AuditRecord
db.py SQLite connection management and cross-modal remask count queries
phi_detector.py PHI span detection, leakage scoring, and synthetic match filtering
eval.py Delta-AUROC computation, latency aggregation, policy table generation
metrics.py Leakage score, utility proxy, and delta-AUROC rolling window metrics
downstream_feedback.py Rolling utility monitor for live downstream feedback to the RL agent
baseline_experiment.py Baseline policy comparison across monotonic, bursty, and mixed workloads with Pareto plotting
run_demo.py Full demo runner: synthetic stream generation, PPO pretraining, live loop, all result plots and reports

Related


Citation

@software{phi_exposure_guard,
  title  = {Stateful Exposure-Aware De-Identification for Multimodal Streaming Data},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://doi.org/10.5281/zenodo.18865882}
}

See CITATION.cff in the GitHub repository for full citation details.

Patent notice: U.S. provisional patent application filed 2025-07-05. Public release: 2026-03-02.

All experiments run on fully synthetic data. This is research code, not a production compliance system.