README.md · vkatg/exposureguard-dcpg-encoder at main

File size: 6,469 Bytes

---
license: mit
language:
- en
tags:
- phi
- de-identification
- healthcare
- privacy
- hipaa
- graph-neural-network
- gat
- clinical-nlp
- multimodal
- patient-risk
- embeddings
- re-identification
- ehr
- streaming
pipeline_tag: feature-extraction
library_name: generic
datasets:
- vkatg/streaming-phi-deidentification-benchmark
- vkatg/multimodal-phi-masking-benchmark
---

# ExposureGuard-DCPG-Encoder

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882)

A PHI exposure graph is not a bag of records. It has structure. A patient's name appears in a clinical note. The same name appears in an ASR transcript 20 minutes later. A matching date shows up in an imaging header. A voice profile correlates with the ASR content. Each of these connections is an edge. Each modality is a node. The risk of re-identification depends on how that graph is connected, not on any single record in isolation.

This model encodes that graph.

---

## What it produces

A 16-dimensional patient embedding capturing the full cross-modal PHI exposure topology, plus a scalar risk score. Both come from a two-layer graph attention network that runs directly over the DCPG structure. No transformers, no external ML framework, no dependencies beyond Python stdlib. The whole thing is 22KB.

The embedding feeds downstream into PolicyNet for masking policy decisions and SynthRewrite-T5 for synthetic text generation. The risk score feeds into FedCRDT-Distill when operating in a federated setting.

---

## Why graph attention specifically

Standard PHI de-identification aggregates per-record features. This model treats the exposure history as a graph and runs attention over it, which means nodes with high risk entropy pull more weight during pooling. A text node carrying a name, date, and MRN gets more influence over the final embedding than a waveform node carrying only a timestamp. That weighting is learned from the graph structure, not hand-coded.

Cross-modal edges matter here. The attention mechanism propagates information across modality boundaries before pooling, so the final embedding reflects not just what each modality contains but how they link to each other.

---

## Architecture

```
Input graph (nodes + edges)
        |
  Layer 1: GAT  [19 -> 32]
        |
  Layer 2: GAT  [32 -> 16]
        |
  Attention pool (weighted by risk_entropy)
        |
  patient_embedding [16]  +  risk_score [0,1]
```

**Node features (19 dims)**

| Group | Dims | Content |
|---|---|---|
| Modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
| PHI type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
| Scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |

**Edge weights** from DCPGEdge:
```
w = 0.30*f_temporal + 0.30*f_semantic + 0.25*f_modality + 0.15*f_trust
```

Temporal and semantic similarity carry equal weight. Modality match matters less. Trust is a small correction term.

---

## Usage

```python
from dcpg_encoder import encode_patient

result = encode_patient(graph_summary)

result["patient_embedding"]  # List[float], dim=16, L2-normalized
result["node_embeddings"]    # Dict[node_id, List[float]]
result["risk_score"]         # float in [0, 1]
result["embed_dim"]          # 16
```

From a CRDT federated graph after a device merge:

```python
result = encode_patient(crdt_summary, source="crdt")
```

Batch encoding:

```python
from inference import predict_batch
results = predict_batch([summary_a, summary_b])
```

---

## Input

```json
{
  "nodes": [
    {
      "node_id": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "modality": "text",
      "phi_type": "NAME_DATE_MRN_FACILITY",
      "risk_entropy": 0.72,
      "context_confidence": 0.9,
      "pseudonym_version": 1
    },
    {
      "node_id": "patient_1::asr::NAME_DATE_MRN",
      "modality": "asr",
      "phi_type": "NAME_DATE_MRN",
      "risk_entropy": 0.61,
      "context_confidence": 0.7,
      "pseudonym_version": 1
    }
  ],
  "edges": [
    {
      "source": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "target": "patient_1::asr::NAME_DATE_MRN",
      "type": "co_occurrence",
      "weight": 0.71
    }
  ]
}
```

## Output

```json
{
  "patient_embedding": [0.0, 0.189, 0.0, 0.095, ...],
  "node_embeddings": {
    "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
  },
  "risk_score": 0.429,
  "embed_dim": 16
}
```

---

## Where it fits in the pipeline

```
DCPGAdapter.graph_summary()
        |
DCPGEncoder.encode()
        |
    +---+----------------------+
    |                          |
patient_embedding          risk_score
    |                          |
PolicyNet              FedCRDT-Distill
(masking policy)       (federated merge)
```

The graph summary comes from `DCPGAdapter.graph_summary()` in the main system or from `CRDTGraph.summary()` when operating in a federated deployment where two edge devices have merged their graphs.

---

## Related

- [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): the full system this model is part of
- [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): single-event risk scorer, runs before graph construction
- [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): takes this model's risk score as input in federated settings
- [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): takes the patient embedding as input for policy decisions
- [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): 10,000 records across 5 modalities with PHI spans, masking decisions, and leakage scores
- [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): event-level adaptive masking traces

---

## Citation

```bibtex
@software{exposureguard_dcpg_encoder,
  title  = {ExposureGuard-DCPG-Encoder: Graph Attention Encoder for Cross-Modal PHI Exposure Graphs},
  author = {Ganti, Venkata Krishna Azith Teja},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
  note   = {US Provisional Patent filed 2025-07-05}
}
```

MIT License. All development and testing used fully synthetic data.