ExposureGuard-FedCRDT-Distill
Distilled MLP for post-merge CRDT re-identification risk prediction. Takes merged graph state from two edge devices and predicts whether the combined PHI exposure crosses the retokenization threshold, including cases where neither device would have escalated independently.
No dependencies beyond Python standard library. Runs anywhere. Weights are 22KB.
The problem this solves
In a federated clinical deployment, device A might process a patient's text notes while device B handles audio transcripts and imaging metadata. Each device runs de-identification independently and applies masking based on its local risk score. Neither device can see the other's data.
After a CRDT merge, the combined graph may reveal cross-modal linkage that neither device detected alone. A patient whose text risk was 0.31 and whose audio risk was 0.28 might have a merged risk of 0.74. Neither device would have escalated. The merged state demands it.
This model is the inference component for that scenario. It takes the merged node state and predicts merged risk and retokenization probability without any raw PHI leaving either device.
Performance
Trained on 8,000 synthetic CRDT merge scenarios, evaluated on 2,000.
| Metric | Value |
|---|---|
| Risk MAE | 0.033 |
| Retok accuracy | 97.4% |
| Invisible-pre-merge recall | 90.5% |
| n_invisible (test set) | 315 |
Invisible-pre-merge recall is the metric that matters most here. It measures how often the model correctly triggers retokenization on cases where cross-modal linkage is only visible after the merge: both pre-merge device risks were at or below the 0.55 threshold, but the merged risk crossed it. 90.5% recall on 315 such cases in the test set.
Architecture
Student MLP distilled from CRDTGraph.risk_for() and threshold logic in dcpg_crdt.py.
CRDT merged state (53 features)
|
Linear 53->64 + ReLU
|
Linear 64->32 + ReLU
|
Linear 32->2 + Sigmoid
|
[risk_score, retok_prob]
Input features (53 dimensions)
Per modality (5 modalities x 9 dims = 45):
phi_units_norm,link_signals_norm,pseudonym_version_norm,risk_entropy- modality one-hot (5 dims)
Graph-level (8):
total_units_norm,degree_norm,max_risk_entropy,mean_risk_entropytotal_link_norm,cross_modal_flag,pre_risk_a,pre_risk_b
Training
- 10,000 synthetic CRDT merge scenarios (8K train / 2K test)
- 80 epochs, SGD, lr=0.005, batch size=64
- Loss: MSE (risk) + BCE (retok)
- Retok positive rate in training set: 17.4%
Usage
from fedcrdt_distill import FedCRDTDistill, CRDTNodeState
model = FedCRDTDistill.load("model_weights.bin")
nodes = [
CRDTNodeState(patient_key="p1", modality="text"),
CRDTNodeState(patient_key="p1", modality="audio_proxy"),
]
nodes[0].phi_unit_counts["dev_A"] = 3
nodes[1].phi_unit_counts["dev_B"] = 2
nodes[1].link_counts["dev_B"] = 2
result = model.predict(nodes, pre_risk_a=0.31, pre_risk_b=0.28)
# {
# "risk_score": 0.74,
# "retok_prob": 0.91,
# "retok_trigger": True,
# "invisible_pre_merge": True
# }
From a CRDTGraph.summary() output:
from inference import predict_merge
result = predict_merge(summary_device_a, summary_device_b)
Where it fits
Device A (text, image) Device B (audio, asr)
CRDTGraph CRDTGraph
| |
+-------- CRDT merge -----------+
|
FedCRDTDistill.predict()
|
+-----------+-----------+
risk_score retok_trigger
| |
DCPGEncoder pseudonym_version++
(exposureguard-dcpg-encoder) (break linkage continuity)
Reproduce
The full training script is in train_fedcrdt.py. It generates synthetic CRDT merge scenarios from the same teacher logic as dcpg_crdt.py, trains the MLP, and saves model_weights.bin and train_metrics.json.
pip install numpy
python train_fedcrdt.py
Related
- phi-exposure-guard: full system with CRDT federation, PPO agent, and adaptive masking pipeline
- dcpg-cross-modal-phi-risk-scorer: single-device risk scorer
- exposureguard-policynet: policy classification model
- exposureguard-dcpg-encoder: DCPG graph encoder
- streaming-phi-deidentification-benchmark: benchmark dataset
- multimodal-phi-masking-benchmark: PHI masking dataset with signed and FHIR configs
Citation
@software{exposureguard_fedcrdt,
title = {ExposureGuard-FedCRDT-Distill: Federated CRDT Risk Distillation for PHI De-identification},
author = {Ganti, Venkata Krishna Azith Teja},
doi = {10.5281/zenodo.18865882},
url = {https://huggingface.co/vkatg/exposureguard-fedcrdt-distill},
note = {US Provisional Patent filed 2025-07-05}
}
MIT License. All training data is fully synthetic.
- Downloads last month
- 66