ExposureGuard-FedCRDT-Distill

DOI

Distilled MLP for post-merge CRDT re-identification risk prediction. Takes merged graph state from two edge devices and predicts whether the combined PHI exposure crosses the retokenization threshold, including cases where neither device would have escalated independently.

No dependencies beyond Python standard library. Runs anywhere. Weights are 22KB.


The problem this solves

In a federated clinical deployment, device A might process a patient's text notes while device B handles audio transcripts and imaging metadata. Each device runs de-identification independently and applies masking based on its local risk score. Neither device can see the other's data.

After a CRDT merge, the combined graph may reveal cross-modal linkage that neither device detected alone. A patient whose text risk was 0.31 and whose audio risk was 0.28 might have a merged risk of 0.74. Neither device would have escalated. The merged state demands it.

This model is the inference component for that scenario. It takes the merged node state and predicts merged risk and retokenization probability without any raw PHI leaving either device.


Performance

Trained on 8,000 synthetic CRDT merge scenarios, evaluated on 2,000.

Metric Value
Risk MAE 0.033
Retok accuracy 97.4%
Invisible-pre-merge recall 90.5%
n_invisible (test set) 315

Invisible-pre-merge recall is the metric that matters most here. It measures how often the model correctly triggers retokenization on cases where cross-modal linkage is only visible after the merge: both pre-merge device risks were at or below the 0.55 threshold, but the merged risk crossed it. 90.5% recall on 315 such cases in the test set.


Architecture

Student MLP distilled from CRDTGraph.risk_for() and threshold logic in dcpg_crdt.py.

CRDT merged state (53 features)
        |
   Linear 53->64 + ReLU
        |
   Linear 64->32 + ReLU
        |
   Linear 32->2 + Sigmoid
        |
  [risk_score, retok_prob]

Input features (53 dimensions)

Per modality (5 modalities x 9 dims = 45):

  • phi_units_norm, link_signals_norm, pseudonym_version_norm, risk_entropy
  • modality one-hot (5 dims)

Graph-level (8):

  • total_units_norm, degree_norm, max_risk_entropy, mean_risk_entropy
  • total_link_norm, cross_modal_flag, pre_risk_a, pre_risk_b

Training

  • 10,000 synthetic CRDT merge scenarios (8K train / 2K test)
  • 80 epochs, SGD, lr=0.005, batch size=64
  • Loss: MSE (risk) + BCE (retok)
  • Retok positive rate in training set: 17.4%

Usage

from fedcrdt_distill import FedCRDTDistill, CRDTNodeState

model = FedCRDTDistill.load("model_weights.bin")

nodes = [
    CRDTNodeState(patient_key="p1", modality="text"),
    CRDTNodeState(patient_key="p1", modality="audio_proxy"),
]
nodes[0].phi_unit_counts["dev_A"] = 3
nodes[1].phi_unit_counts["dev_B"] = 2
nodes[1].link_counts["dev_B"] = 2

result = model.predict(nodes, pre_risk_a=0.31, pre_risk_b=0.28)
# {
#   "risk_score": 0.74,
#   "retok_prob": 0.91,
#   "retok_trigger": True,
#   "invisible_pre_merge": True
# }

From a CRDTGraph.summary() output:

from inference import predict_merge

result = predict_merge(summary_device_a, summary_device_b)

Where it fits

Device A (text, image)          Device B (audio, asr)
    CRDTGraph                       CRDTGraph
        |                               |
        +-------- CRDT merge -----------+
                       |
              FedCRDTDistill.predict()
                       |
           +-----------+-----------+
        risk_score             retok_trigger
           |                        |
      DCPGEncoder            pseudonym_version++
  (exposureguard-dcpg-encoder)   (break linkage continuity)

Reproduce

The full training script is in train_fedcrdt.py. It generates synthetic CRDT merge scenarios from the same teacher logic as dcpg_crdt.py, trains the MLP, and saves model_weights.bin and train_metrics.json.

pip install numpy
python train_fedcrdt.py

Related


Citation

@software{exposureguard_fedcrdt,
  title  = {ExposureGuard-FedCRDT-Distill: Federated CRDT Risk Distillation for PHI De-identification},
  author = {Ganti, Venkata Krishna Azith Teja},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://huggingface.co/vkatg/exposureguard-fedcrdt-distill},
  note   = {US Provisional Patent filed 2025-07-05}
}

MIT License. All training data is fully synthetic.

Downloads last month
66
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train vkatg/exposureguard-fedcrdt-distill

Space using vkatg/exposureguard-fedcrdt-distill 1