vkatg commited on
Commit
7e62f56
·
verified ·
1 Parent(s): f898c92

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -53
README.md CHANGED
@@ -14,92 +14,102 @@ tags:
14
  - multimodal
15
  - patient-risk
16
  - embeddings
 
 
 
17
  pipeline_tag: feature-extraction
18
  library_name: generic
19
  datasets:
20
  - vkatg/streaming-phi-deidentification-benchmark
 
21
  ---
22
 
23
  # ExposureGuard-DCPG-Encoder
24
 
25
- Graph attention encoder over the Dynamic Cross-modal PHI Graph (DCPG). Produces a fixed-dim patient embedding and risk score from a multi-modal PHI exposure graph.
26
 
27
- Part of the [ExposureGuard](https://huggingface.co/vkatg) ecosystem.
28
 
29
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- Two-layer GAT with attention pooling. No external ML framework required pure Python with no dependencies.
 
 
 
 
32
 
33
  ```
34
  Input graph (nodes + edges)
35
-
36
- Layer 1: GAT [18 32] (node features × edge weights)
37
-
38
- Layer 2: GAT [32 16]
39
-
40
  Attention pool (weighted by risk_entropy)
41
-
42
  patient_embedding [16] + risk_score [0,1]
43
  ```
44
 
45
- ### Node features (dim 18)
46
-
47
- | Group | Dim | Content |
48
- |-------|-----|---------|
49
- | modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
50
- | phi_type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
51
- | scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |
52
 
53
- ### Edge weights
54
-
55
- Inherited directly from DCPGEdge:
 
 
56
 
 
57
  ```
58
- w = 0.30·f_temporal + 0.30·f_semantic + 0.25·f_modality + 0.15·f_trust
59
  ```
60
 
 
 
 
 
61
  ## Usage
62
 
63
  ```python
64
- from dcpg_encoder import DCPGEncoder, encode_patient
65
 
66
- # graph_summary comes from DCPGAdapter.graph_summary() or CRDTGraph.summary()
67
  result = encode_patient(graph_summary)
68
 
69
- result["patient_embedding"] # List[float], dim=16, L2-normalized
70
- result["node_embeddings"] # Dict[node_id, List[float]]
71
- result["risk_score"] # float in [0, 1]
 
72
  ```
73
 
74
- ### From CRDT federated graph
75
 
76
  ```python
77
  result = encode_patient(crdt_summary, source="crdt")
78
  ```
79
 
80
- ### Batch
81
 
82
  ```python
83
  from inference import predict_batch
84
  results = predict_batch([summary_a, summary_b])
85
  ```
86
 
87
- ## Integration with ExposureGuard ecosystem
88
-
89
- ```
90
- DCPGAdapter.graph_summary()
91
-
92
- DCPGEncoder.encode() ← this model
93
-
94
- ┌───┴──────────────────┐
95
- │ │
96
- patient_embedding risk_score
97
- │ │
98
- PolicyNet SynthRewrite-T5
99
- (vkatg/exposureguard-policynet)
100
- ```
101
 
102
- ## Input format
103
 
104
  ```json
105
  {
@@ -111,6 +121,14 @@ PolicyNet SynthRewrite-T5
111
  "risk_entropy": 0.72,
112
  "context_confidence": 0.9,
113
  "pseudonym_version": 1
 
 
 
 
 
 
 
 
114
  }
115
  ],
116
  "edges": [
@@ -124,11 +142,11 @@ PolicyNet SynthRewrite-T5
124
  }
125
  ```
126
 
127
- ## Output format
128
 
129
  ```json
130
  {
131
- "patient_embedding": [0.0, 0.189, 0.0, ...],
132
  "node_embeddings": {
133
  "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
134
  },
@@ -137,20 +155,48 @@ PolicyNet SynthRewrite-T5
137
  }
138
  ```
139
 
140
- ## Related models
141
 
142
- - [vkatg/dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer)
143
- - [vkatg/exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet)
144
- - [vkatg/exposureguard-synthrewrite-t5](https://huggingface.co/vkatg/exposureguard-synthrewrite-t5)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## Citation
147
 
148
  ```bibtex
149
- @misc{exposureguard2025,
150
- title = {ExposureGuard: Cross-Modal PHI Re-identification Risk Scoring via Dynamic Graph Attention},
151
- author = {[Your Name]},
152
- year = {2025},
153
  url = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
154
  note = {US Provisional Patent filed 2025-07-05}
155
  }
156
  ```
 
 
 
14
  - multimodal
15
  - patient-risk
16
  - embeddings
17
+ - re-identification
18
+ - ehr
19
+ - streaming
20
  pipeline_tag: feature-extraction
21
  library_name: generic
22
  datasets:
23
  - vkatg/streaming-phi-deidentification-benchmark
24
+ - vkatg/multimodal-phi-masking-benchmark
25
  ---
26
 
27
  # ExposureGuard-DCPG-Encoder
28
 
29
+ [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882)
30
 
31
+ A PHI exposure graph is not a bag of records. It has structure. A patient's name appears in a clinical note. The same name appears in an ASR transcript 20 minutes later. A matching date shows up in an imaging header. A voice profile correlates with the ASR content. Each of these connections is an edge. Each modality is a node. The risk of re-identification depends on how that graph is connected, not on any single record in isolation.
32
 
33
+ This model encodes that graph.
34
+
35
+ ---
36
+
37
+ ## What it produces
38
+
39
+ A 16-dimensional patient embedding capturing the full cross-modal PHI exposure topology, plus a scalar risk score. Both come from a two-layer graph attention network that runs directly over the DCPG structure. No transformers, no external ML framework, no dependencies beyond Python stdlib. The whole thing is 22KB.
40
+
41
+ The embedding feeds downstream into PolicyNet for masking policy decisions and SynthRewrite-T5 for synthetic text generation. The risk score feeds into FedCRDT-Distill when operating in a federated setting.
42
+
43
+ ---
44
+
45
+ ## Why graph attention specifically
46
+
47
+ Standard PHI de-identification aggregates per-record features. This model treats the exposure history as a graph and runs attention over it, which means nodes with high risk entropy pull more weight during pooling. A text node carrying a name, date, and MRN gets more influence over the final embedding than a waveform node carrying only a timestamp. That weighting is learned from the graph structure, not hand-coded.
48
 
49
+ Cross-modal edges matter here. The attention mechanism propagates information across modality boundaries before pooling, so the final embedding reflects not just what each modality contains but how they link to each other.
50
+
51
+ ---
52
+
53
+ ## Architecture
54
 
55
  ```
56
  Input graph (nodes + edges)
57
+ |
58
+ Layer 1: GAT [19 -> 32]
59
+ |
60
+ Layer 2: GAT [32 -> 16]
61
+ |
62
  Attention pool (weighted by risk_entropy)
63
+ |
64
  patient_embedding [16] + risk_score [0,1]
65
  ```
66
 
67
+ **Node features (19 dims)**
 
 
 
 
 
 
68
 
69
+ | Group | Dims | Content |
70
+ |---|---|---|
71
+ | Modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
72
+ | PHI type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
73
+ | Scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |
74
 
75
+ **Edge weights** from DCPGEdge:
76
  ```
77
+ w = 0.30*f_temporal + 0.30*f_semantic + 0.25*f_modality + 0.15*f_trust
78
  ```
79
 
80
+ Temporal and semantic similarity carry equal weight. Modality match matters less. Trust is a small correction term.
81
+
82
+ ---
83
+
84
  ## Usage
85
 
86
  ```python
87
+ from dcpg_encoder import encode_patient
88
 
 
89
  result = encode_patient(graph_summary)
90
 
91
+ result["patient_embedding"] # List[float], dim=16, L2-normalized
92
+ result["node_embeddings"] # Dict[node_id, List[float]]
93
+ result["risk_score"] # float in [0, 1]
94
+ result["embed_dim"] # 16
95
  ```
96
 
97
+ From a CRDT federated graph after a device merge:
98
 
99
  ```python
100
  result = encode_patient(crdt_summary, source="crdt")
101
  ```
102
 
103
+ Batch encoding:
104
 
105
  ```python
106
  from inference import predict_batch
107
  results = predict_batch([summary_a, summary_b])
108
  ```
109
 
110
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
+ ## Input
113
 
114
  ```json
115
  {
 
121
  "risk_entropy": 0.72,
122
  "context_confidence": 0.9,
123
  "pseudonym_version": 1
124
+ },
125
+ {
126
+ "node_id": "patient_1::asr::NAME_DATE_MRN",
127
+ "modality": "asr",
128
+ "phi_type": "NAME_DATE_MRN",
129
+ "risk_entropy": 0.61,
130
+ "context_confidence": 0.7,
131
+ "pseudonym_version": 1
132
  }
133
  ],
134
  "edges": [
 
142
  }
143
  ```
144
 
145
+ ## Output
146
 
147
  ```json
148
  {
149
+ "patient_embedding": [0.0, 0.189, 0.0, 0.095, ...],
150
  "node_embeddings": {
151
  "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
152
  },
 
155
  }
156
  ```
157
 
158
+ ---
159
 
160
+ ## Where it fits in the pipeline
161
+
162
+ ```
163
+ DCPGAdapter.graph_summary()
164
+ |
165
+ DCPGEncoder.encode()
166
+ |
167
+ +---+----------------------+
168
+ | |
169
+ patient_embedding risk_score
170
+ | |
171
+ PolicyNet FedCRDT-Distill
172
+ (masking policy) (federated merge)
173
+ ```
174
+
175
+ The graph summary comes from `DCPGAdapter.graph_summary()` in the main system or from `CRDTGraph.summary()` when operating in a federated deployment where two edge devices have merged their graphs.
176
+
177
+ ---
178
+
179
+ ## Related
180
+
181
+ - [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): the full system this model is part of
182
+ - [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): single-event risk scorer, runs before graph construction
183
+ - [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): takes this model's risk score as input in federated settings
184
+ - [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): takes the patient embedding as input for policy decisions
185
+ - [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): 10,000 records across 5 modalities with PHI spans, masking decisions, and leakage scores
186
+ - [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): event-level adaptive masking traces
187
+
188
+ ---
189
 
190
  ## Citation
191
 
192
  ```bibtex
193
+ @software{exposureguard_dcpg_encoder,
194
+ title = {ExposureGuard-DCPG-Encoder: Graph Attention Encoder for Cross-Modal PHI Exposure Graphs},
195
+ author = {Ganti, Venkata Krishna Azith Teja},
196
+ doi = {10.5281/zenodo.18865882},
197
  url = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
198
  note = {US Provisional Patent filed 2025-07-05}
199
  }
200
  ```
201
+
202
+ MIT License. All development and testing used fully synthetic data.