File size: 6,469 Bytes
51a62e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e62f56
 
 
51a62e8
 
 
 
7e62f56
51a62e8
 
 
 
7e62f56
51a62e8
7e62f56
51a62e8
7e62f56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51a62e8
7e62f56
 
 
 
 
51a62e8
 
 
7e62f56
 
 
 
 
51a62e8
7e62f56
51a62e8
 
 
7e62f56
51a62e8
7e62f56
 
 
 
 
51a62e8
7e62f56
51a62e8
7e62f56
51a62e8
 
7e62f56
 
 
 
51a62e8
 
 
7e62f56
51a62e8
 
 
7e62f56
 
 
 
51a62e8
 
7e62f56
51a62e8
 
 
 
 
7e62f56
51a62e8
 
 
 
 
 
7e62f56
51a62e8
7e62f56
51a62e8
 
 
 
 
 
 
 
 
 
 
7e62f56
 
 
 
 
 
 
 
51a62e8
 
 
 
 
 
 
 
 
 
 
 
 
7e62f56
51a62e8
 
 
7e62f56
51a62e8
 
 
 
 
 
 
 
7e62f56
51a62e8
7e62f56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51a62e8
 
 
 
7e62f56
 
 
 
51a62e8
 
 
 
7e62f56
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
license: mit
language:
- en
tags:
- phi
- de-identification
- healthcare
- privacy
- hipaa
- graph-neural-network
- gat
- clinical-nlp
- multimodal
- patient-risk
- embeddings
- re-identification
- ehr
- streaming
pipeline_tag: feature-extraction
library_name: generic
datasets:
- vkatg/streaming-phi-deidentification-benchmark
- vkatg/multimodal-phi-masking-benchmark
---

# ExposureGuard-DCPG-Encoder

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882)

A PHI exposure graph is not a bag of records. It has structure. A patient's name appears in a clinical note. The same name appears in an ASR transcript 20 minutes later. A matching date shows up in an imaging header. A voice profile correlates with the ASR content. Each of these connections is an edge. Each modality is a node. The risk of re-identification depends on how that graph is connected, not on any single record in isolation.

This model encodes that graph.

---

## What it produces

A 16-dimensional patient embedding capturing the full cross-modal PHI exposure topology, plus a scalar risk score. Both come from a two-layer graph attention network that runs directly over the DCPG structure. No transformers, no external ML framework, no dependencies beyond Python stdlib. The whole thing is 22KB.

The embedding feeds downstream into PolicyNet for masking policy decisions and SynthRewrite-T5 for synthetic text generation. The risk score feeds into FedCRDT-Distill when operating in a federated setting.

---

## Why graph attention specifically

Standard PHI de-identification aggregates per-record features. This model treats the exposure history as a graph and runs attention over it, which means nodes with high risk entropy pull more weight during pooling. A text node carrying a name, date, and MRN gets more influence over the final embedding than a waveform node carrying only a timestamp. That weighting is learned from the graph structure, not hand-coded.

Cross-modal edges matter here. The attention mechanism propagates information across modality boundaries before pooling, so the final embedding reflects not just what each modality contains but how they link to each other.

---

## Architecture

```
Input graph (nodes + edges)
        |
  Layer 1: GAT  [19 -> 32]
        |
  Layer 2: GAT  [32 -> 16]
        |
  Attention pool (weighted by risk_entropy)
        |
  patient_embedding [16]  +  risk_score [0,1]
```

**Node features (19 dims)**

| Group | Dims | Content |
|---|---|---|
| Modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
| PHI type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
| Scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |

**Edge weights** from DCPGEdge:
```
w = 0.30*f_temporal + 0.30*f_semantic + 0.25*f_modality + 0.15*f_trust
```

Temporal and semantic similarity carry equal weight. Modality match matters less. Trust is a small correction term.

---

## Usage

```python
from dcpg_encoder import encode_patient

result = encode_patient(graph_summary)

result["patient_embedding"]  # List[float], dim=16, L2-normalized
result["node_embeddings"]    # Dict[node_id, List[float]]
result["risk_score"]         # float in [0, 1]
result["embed_dim"]          # 16
```

From a CRDT federated graph after a device merge:

```python
result = encode_patient(crdt_summary, source="crdt")
```

Batch encoding:

```python
from inference import predict_batch
results = predict_batch([summary_a, summary_b])
```

---

## Input

```json
{
  "nodes": [
    {
      "node_id": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "modality": "text",
      "phi_type": "NAME_DATE_MRN_FACILITY",
      "risk_entropy": 0.72,
      "context_confidence": 0.9,
      "pseudonym_version": 1
    },
    {
      "node_id": "patient_1::asr::NAME_DATE_MRN",
      "modality": "asr",
      "phi_type": "NAME_DATE_MRN",
      "risk_entropy": 0.61,
      "context_confidence": 0.7,
      "pseudonym_version": 1
    }
  ],
  "edges": [
    {
      "source": "patient_1::text::NAME_DATE_MRN_FACILITY",
      "target": "patient_1::asr::NAME_DATE_MRN",
      "type": "co_occurrence",
      "weight": 0.71
    }
  ]
}
```

## Output

```json
{
  "patient_embedding": [0.0, 0.189, 0.0, 0.095, ...],
  "node_embeddings": {
    "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
  },
  "risk_score": 0.429,
  "embed_dim": 16
}
```

---

## Where it fits in the pipeline

```
DCPGAdapter.graph_summary()
        |
DCPGEncoder.encode()
        |
    +---+----------------------+
    |                          |
patient_embedding          risk_score
    |                          |
PolicyNet              FedCRDT-Distill
(masking policy)       (federated merge)
```

The graph summary comes from `DCPGAdapter.graph_summary()` in the main system or from `CRDTGraph.summary()` when operating in a federated deployment where two edge devices have merged their graphs.

---

## Related

- [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): the full system this model is part of
- [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): single-event risk scorer, runs before graph construction
- [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): takes this model's risk score as input in federated settings
- [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): takes the patient embedding as input for policy decisions
- [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): 10,000 records across 5 modalities with PHI spans, masking decisions, and leakage scores
- [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): event-level adaptive masking traces

---

## Citation

```bibtex
@software{exposureguard_dcpg_encoder,
  title  = {ExposureGuard-DCPG-Encoder: Graph Attention Encoder for Cross-Modal PHI Exposure Graphs},
  author = {Ganti, Venkata Krishna Azith Teja},
  doi    = {10.5281/zenodo.18865882},
  url    = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
  note   = {US Provisional Patent filed 2025-07-05}
}
```

MIT License. All development and testing used fully synthetic data.