Update README.md

Browse files

Files changed (1) hide show

README.md +99 -53

README.md CHANGED Viewed

@@ -14,92 +14,102 @@ tags:
 - multimodal
 - patient-risk
 - embeddings
 pipeline_tag: feature-extraction
 library_name: generic
 datasets:
 - vkatg/streaming-phi-deidentification-benchmark
 ---
 # ExposureGuard-DCPG-Encoder
-Graph attention encoder over the Dynamic Cross-modal PHI Graph (DCPG). Produces a fixed-dim patient embedding and risk score from a multi-modal PHI exposure graph.
-Part of the [ExposureGuard](https://huggingface.co/vkatg) ecosystem.
-## Architecture
-Two-layer GAT with attention pooling. No external ML framework required — pure Python with no dependencies.
 ```
 Input graph (nodes + edges)
-      │
-  Layer 1: GAT  [18 → 32]   (node features × edge weights)
-      │
-  Layer 2: GAT  [32 → 16]
-      │
   Attention pool (weighted by risk_entropy)
-      │
   patient_embedding [16]  +  risk_score [0,1]
 ```
-### Node features (dim 18)
-| Group | Dim | Content |
-|-------|-----|---------|
-| modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
-| phi_type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
-| scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |
-### Edge weights
-Inherited directly from DCPGEdge:
 ```
-w = 0.30·f_temporal + 0.30·f_semantic + 0.25·f_modality + 0.15·f_trust
 ```
 ## Usage
 ```python
-from dcpg_encoder import DCPGEncoder, encode_patient
-# graph_summary comes from DCPGAdapter.graph_summary() or CRDTGraph.summary()
 result = encode_patient(graph_summary)
-result["patient_embedding"]   # List[float], dim=16, L2-normalized
-result["node_embeddings"]     # Dict[node_id, List[float]]
-result["risk_score"]          # float in [0, 1]
 ```
-### From CRDT federated graph
 ```python
 result = encode_patient(crdt_summary, source="crdt")
 ```
-### Batch
 ```python
 from inference import predict_batch
 results = predict_batch([summary_a, summary_b])
 ```
-## Integration with ExposureGuard ecosystem
-```
-DCPGAdapter.graph_summary()
-        │
-DCPGEncoder.encode()         ← this model
-        │
-    ┌───┴──────────────────┐
-    │                      │
-patient_embedding     risk_score
-    │                      │
-PolicyNet           SynthRewrite-T5
-(vkatg/exposureguard-policynet)
-```
-## Input format
 ```json
 {
@@ -111,6 +121,14 @@ PolicyNet           SynthRewrite-T5
       "risk_entropy": 0.72,
       "context_confidence": 0.9,
       "pseudonym_version": 1
     }
   ],
   "edges": [
@@ -124,11 +142,11 @@ PolicyNet           SynthRewrite-T5
 }
 ```
-## Output format
 ```json
 {
-  "patient_embedding": [0.0, 0.189, 0.0, ...],
   "node_embeddings": {
     "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
   },
@@ -137,20 +155,48 @@ PolicyNet           SynthRewrite-T5
 }
 ```
-## Related models
-- [vkatg/dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer)
-- [vkatg/exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet)
-- [vkatg/exposureguard-synthrewrite-t5](https://huggingface.co/vkatg/exposureguard-synthrewrite-t5)
 ## Citation
 ```bibtex
-@misc{exposureguard2025,
-  title  = {ExposureGuard: Cross-Modal PHI Re-identification Risk Scoring via Dynamic Graph Attention},
-  author = {[Your Name]},
-  year   = {2025},
   url    = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
   note   = {US Provisional Patent filed 2025-07-05}
 }
 ```

 - multimodal
 - patient-risk
 - embeddings
+- re-identification
+- ehr
+- streaming
 pipeline_tag: feature-extraction
 library_name: generic
 datasets:
 - vkatg/streaming-phi-deidentification-benchmark
+- vkatg/multimodal-phi-masking-benchmark
 ---
 # ExposureGuard-DCPG-Encoder
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18865882.svg)](https://doi.org/10.5281/zenodo.18865882)
+A PHI exposure graph is not a bag of records. It has structure. A patient's name appears in a clinical note. The same name appears in an ASR transcript 20 minutes later. A matching date shows up in an imaging header. A voice profile correlates with the ASR content. Each of these connections is an edge. Each modality is a node. The risk of re-identification depends on how that graph is connected, not on any single record in isolation.
+This model encodes that graph.
+---
+## What it produces
+A 16-dimensional patient embedding capturing the full cross-modal PHI exposure topology, plus a scalar risk score. Both come from a two-layer graph attention network that runs directly over the DCPG structure. No transformers, no external ML framework, no dependencies beyond Python stdlib. The whole thing is 22KB.
+The embedding feeds downstream into PolicyNet for masking policy decisions and SynthRewrite-T5 for synthetic text generation. The risk score feeds into FedCRDT-Distill when operating in a federated setting.
+---
+## Why graph attention specifically
+Standard PHI de-identification aggregates per-record features. This model treats the exposure history as a graph and runs attention over it, which means nodes with high risk entropy pull more weight during pooling. A text node carrying a name, date, and MRN gets more influence over the final embedding than a waveform node carrying only a timestamp. That weighting is learned from the graph structure, not hand-coded.
+Cross-modal edges matter here. The attention mechanism propagates information across modality boundaries before pooling, so the final embedding reflects not just what each modality contains but how they link to each other.
+---
+## Architecture
 ```
 Input graph (nodes + edges)
+        |
+  Layer 1: GAT  [19 -> 32]
+        |
+  Layer 2: GAT  [32 -> 16]
+        |
   Attention pool (weighted by risk_entropy)
+        |
   patient_embedding [16]  +  risk_score [0,1]
 ```
+**Node features (19 dims)**
+| Group | Dims | Content |
+|---|---|---|
+| Modality one-hot | 8 | text, asr, image_proxy, waveform_proxy, audio_proxy, image_link, audio_link, unknown |
+| PHI type one-hot | 8 | NAME_DATE_MRN_FACILITY, NAME_DATE_MRN, FACE_IMAGE, WAVEFORM_HEADER, VOICE, FACE_LINK, VOICE_LINK, unknown |
+| Scalars | 3 | risk_entropy, context_confidence, pseudonym_version_norm |
+**Edge weights** from DCPGEdge:
 ```
+w = 0.30*f_temporal + 0.30*f_semantic + 0.25*f_modality + 0.15*f_trust
 ```
+Temporal and semantic similarity carry equal weight. Modality match matters less. Trust is a small correction term.
+---
 ## Usage
 ```python
+from dcpg_encoder import encode_patient
 result = encode_patient(graph_summary)
+result["patient_embedding"]  # List[float], dim=16, L2-normalized
+result["node_embeddings"]    # Dict[node_id, List[float]]
+result["risk_score"]         # float in [0, 1]
+result["embed_dim"]          # 16
 ```
+From a CRDT federated graph after a device merge:
 ```python
 result = encode_patient(crdt_summary, source="crdt")
 ```
+Batch encoding:
 ```python
 from inference import predict_batch
 results = predict_batch([summary_a, summary_b])
 ```
+---
+## Input
 ```json
 {
       "risk_entropy": 0.72,
       "context_confidence": 0.9,
       "pseudonym_version": 1
+    },
+    {
+      "node_id": "patient_1::asr::NAME_DATE_MRN",
+      "modality": "asr",
+      "phi_type": "NAME_DATE_MRN",
+      "risk_entropy": 0.61,
+      "context_confidence": 0.7,
+      "pseudonym_version": 1
     }
   ],
   "edges": [
 }
 ```
+## Output
 ```json
 {
+  "patient_embedding": [0.0, 0.189, 0.0, 0.095, ...],
   "node_embeddings": {
     "patient_1::text::NAME_DATE_MRN_FACILITY": [0.0, 0.188, ...]
   },
 }
 ```
+---
+## Where it fits in the pipeline
+```
+DCPGAdapter.graph_summary()
+        |
+DCPGEncoder.encode()
+        |
+    +---+----------------------+
+    |                          |
+patient_embedding          risk_score
+    |                          |
+PolicyNet              FedCRDT-Distill
+(masking policy)       (federated merge)
+```
+The graph summary comes from `DCPGAdapter.graph_summary()` in the main system or from `CRDTGraph.summary()` when operating in a federated deployment where two edge devices have merged their graphs.
+---
+## Related
+- [phi-exposure-guard](https://github.com/azithteja91/phi-exposure-guard): the full system this model is part of
+- [dcpg-cross-modal-phi-risk-scorer](https://huggingface.co/vkatg/dcpg-cross-modal-phi-risk-scorer): single-event risk scorer, runs before graph construction
+- [exposureguard-fedcrdt-distill](https://huggingface.co/vkatg/exposureguard-fedcrdt-distill): takes this model's risk score as input in federated settings
+- [exposureguard-policynet](https://huggingface.co/vkatg/exposureguard-policynet): takes the patient embedding as input for policy decisions
+- [multimodal-phi-masking-benchmark](https://huggingface.co/datasets/vkatg/multimodal-phi-masking-benchmark): 10,000 records across 5 modalities with PHI spans, masking decisions, and leakage scores
+- [streaming-phi-deidentification-benchmark](https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark): event-level adaptive masking traces
+---
 ## Citation
 ```bibtex
+@software{exposureguard_dcpg_encoder,
+  title  = {ExposureGuard-DCPG-Encoder: Graph Attention Encoder for Cross-Modal PHI Exposure Graphs},
+  author = {Ganti, Venkata Krishna Azith Teja},
+  doi    = {10.5281/zenodo.18865882},
   url    = {https://huggingface.co/vkatg/exposureguard-dcpg-encoder},
   note   = {US Provisional Patent filed 2025-07-05}
 }
 ```
+MIT License. All development and testing used fully synthetic data.