riggsmed commited on
Commit
cd7adee
·
verified ·
1 Parent(s): d743963

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -141
README.md CHANGED
@@ -1,86 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # deid-LONGFORMER-NemPII
2
 
3
  **HIPAA-compliant clinical de-identification that beats commercial solutions—at zero cost.**
4
 
5
- A fine-tuned Clinical-Longformer model for Protected Health Information (PHI) detection and replacement in clinical text, achieving **97.74% F1** on held-out test data.
 
 
6
 
7
- ## Why This Exists
8
 
9
- Commercial de-identification solutions are expensive and produce unusable output:
10
 
11
  | Solution | F1 Score | Cost | Replacement Quality |
12
  |----------|----------|------|---------------------|
13
  | AWS Comprehend Medical | ~83-93% | $14.5K/1M notes | Basic placeholders |
14
  | John Snow Labs | 96-97% | Enterprise license | Basic placeholders |
15
- | **deid-LONGFORMER-NemPII** | **97.74%** | **Free/self-hosted** | **Realistic surrogates** |
16
-
17
- Most tools just redact PHI with `[REDACTED]` or `***`, leaving text that's difficult to read and impossible to use for downstream NLP tasks. This model generates **realistic surrogate data** that preserves clinical meaning while protecting patient privacy.
18
-
19
- ## Acknowledgments
20
-
21
- This project stands on the shoulders of excellent prior work:
22
 
23
- ### Inspiration: obi/deid_bert_i2b2
24
 
25
- This model was directly inspired by [**obi/deid_bert_i2b2**](https://huggingface.co/obi/deid_bert_i2b2) from the Open Biomedical Informatics team (Prajwal Kailas, Max Homilius, Shinichi Goto). Their work on ClinicalBERT-based de-identification using the I2B2 2014 dataset demonstrated the viability of transformer-based approaches for this task. The [robust-deid](https://github.com/obi-ml-public/ehr_deidentification) framework they developed provided invaluable reference for architecture decisions and evaluation methodology.
26
 
27
- ### Base Model: yikuan8/Clinical-Longformer
28
 
29
- The base model is [**yikuan8/Clinical-Longformer**](https://huggingface.co/yikuan8/Clinical-Longformer) by Li, Yikuan et al. This clinical knowledge-enriched Longformer was pre-trained on MIMIC-III clinical notes and supports sequences up to 4,096 tokens—critical for processing real-world clinical documents that often exceed BERT's 512-token limit.
30
 
31
- > Li, Yikuan, et al. "A comparative study of pretrained language models for long clinical text." *Journal of the American Medical Informatics Association* 30.2 (2023): 340-347.
32
 
33
- ### Training Data: NVIDIA Nemotron-PII
34
 
35
- Training data comes from the healthcare subset of [**NVIDIA's Nemotron-PII**](https://huggingface.co/datasets/nvidia/Nemotron-PII) dataset (3,630 records, CC BY 4.0 license). This synthetic dataset provides diverse PHI patterns without exposing real patient data.
36
 
37
- ## Key Differentiators
38
 
39
- Unlike competitors that just redact text, this system generates **clinically useful surrogates**:
40
 
41
- ### Age-Preserving DOB Replacement
42
- Dates of birth are replaced with fake DOBs that preserve the patient's age within ±2 years. A 67-year-old patient stays clinically 65-69, not "[REDACTED]".
 
 
43
 
44
- ### Context-Aware Detection
45
- The model recognizes that a DATE entity following "DOB:" should receive age-preserving treatment, not standard date shifting.
46
 
47
- ### Name Consistency
48
- Multiple references to the same person map to the same fake name:
49
- - "Dr. Sarah Elizabeth Johnson, MD" → "Dr. Maria Rodriguez, MD"
50
- - "Sarah E. Johnson" → "Maria Rodriguez"
51
- - "Dr. Johnson" → "Dr. Rodriguez"
52
 
53
- ### Temporal Consistency
54
- All dates in a document shift by the same random offset. If admission was January 15 and discharge was January 20, the 5-day relationship is preserved.
 
 
 
 
55
 
56
- ### Geographic Consistency
57
- City, state, and ZIP code replacements are coherent—you won't get "Phoenix, NY 33101".
58
 
59
- ### Format Preservation
60
- Phone numbers, SSNs, and dates maintain their original format:
61
- - `(555) 123-4567` → `(555) 987-6543` (not `5559876543`)
62
- - `01/15/2024` → `03/22/2024` (not `2024-03-22`)
63
 
64
- ### Medical Term Protection
65
- A whitelist prevents false positives on medical terms:
66
- - "Anion Gap" stays "Anion Gap" (not replaced as a name)
67
- - "BUN" stays "BUN"
68
- - "2 weeks" stays "2 weeks" (not detected as a date)
69
-
70
- ### Adjacent Entity Merging
71
- When the model fragments entities across tokens, post-processing merges them:
72
- - `["Jan", "uary", " ", "15"]` → `"January 15"` (single DATE entity)
73
-
74
- ## Model Architecture
75
-
76
- ```
77
- Base Model: yikuan8/Clinical-Longformer
78
- Parameters: 148M
79
- Max Length: 4,096 tokens
80
- Task: Token Classification (NER)
81
- Tagging: BILOU scheme
82
- Classes: 101 (25 PHI types × 4 BILOU tags + O)
83
- ```
84
 
85
  ### PHI Categories (25 types)
86
 
@@ -92,128 +103,115 @@ EMAIL, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY,
92
  BIOMETRIC_IDENTIFIER, UNIQUE_ID, CUSTOMER_ID, EMPLOYEE_ID
93
  ```
94
 
95
- ## Installation
96
 
97
- ```bash
98
- git clone https://github.com/Hrygt/deid-longformer-nempii.git
99
- cd deid-longformer-nempii
100
- pip install -r requirements.txt
101
- ```
 
 
102
 
103
- Download model weights from HuggingFace:
104
 
105
- ```python
106
- from huggingface_hub import snapshot_download
107
- snapshot_download('riggsmed/deid-LONGFORMER-NemPII', local_dir='model')
108
- ```
 
109
 
110
- ## Quick Start
111
 
112
- ### Python API
113
 
114
  ```python
115
- from deid import deidentify_text
 
116
 
117
- text = """
118
- PATIENT: John Smith
119
- DOB: 01/15/1957
120
- MRN: 123456789
121
 
122
- Mr. Smith presented to the ED on 12/09/2024 with chest pain.
123
- Contact: (405) 555-1234
124
- """
125
 
126
- result = deidentify_text(text)
127
- print(result["deidentified_text"])
128
- ```
129
 
130
- Output:
131
- ```
132
- PATIENT: Robert Johnson
133
- DOB: 03/22/1955
134
- MRN: 987654321
135
 
136
- Mr. Johnson presented to the ED on 02/15/2025 with chest pain.
137
- Contact: (555) 987-6543
 
138
  ```
139
 
140
- ### FastAPI Service
141
 
142
- ```bash
143
- uvicorn api:app --host 0.0.0.0 --port 8001
144
- ```
145
 
146
- ```bash
147
- curl -X POST http://localhost:8001/deidentify \
148
- -H "Content-Type: application/json" \
149
- -d '{"text": "Patient John Smith, DOB 01/15/1957"}'
 
 
 
 
 
150
  ```
151
 
152
- ## Training
153
 
154
- To train on your own data:
155
 
156
  ```bash
157
- python train.py \
158
- --model_name yikuan8/Clinical-Longformer \
159
- --train_file data/train.json \
160
- --val_file data/val.json \
161
- --output_dir checkpoints \
162
- --epochs 10 \
163
- --batch_size 4 \
164
- --learning_rate 2e-5
165
  ```
166
 
167
- Data format (JSONL):
168
- ```json
169
- {"text": "Patient John Smith...", "entities": [{"start": 8, "end": 18, "label": "NAME"}]}
170
- ```
171
 
172
- ## Evaluation Results
 
 
 
173
 
174
- Evaluated on 20% held-out split from Nemotron-PII healthcare subset:
175
 
176
- | Metric | Score |
177
- |--------|-------|
178
- | **F1** | **97.74%** |
179
- | Precision | 97.62% |
180
- | Recall | 97.86% |
181
 
182
- ### Per-Entity Performance
183
 
184
- | Entity Type | F1 | Support |
185
- |-------------|-----|---------|
186
- | NAME | 98.2% | 1,247 |
187
- | DATE | 97.9% | 2,103 |
188
- | PHONE_NUMBER | 99.1% | 312 |
189
- | SSN | 98.7% | 89 |
190
- | STREET_ADDRESS | 96.4% | 445 |
191
- | ... | ... | ... |
192
 
193
  ## Live Demo
194
 
195
  Try it at: **https://deid.riggsmedai.com**
196
 
197
- ## License
198
-
199
- - **Model weights**: Apache 2.0
200
- - **Code**: Apache 2.0
201
- - **Training data**: CC BY 4.0 (NVIDIA Nemotron-PII)
202
-
203
  ## Citation
204
 
205
- If you use this model in your research, please cite:
206
-
207
  ```bibtex
208
  @software{riggs2024deid,
209
  author = {Riggs, Gary},
210
  title = {deid-LONGFORMER-NemPII: Clinical De-identification with Realistic Surrogates},
211
  year = {2024},
212
- url = {https://github.com/Hrygt/deid-longformer-nempii}
213
  }
214
  ```
215
 
216
- And please also cite the foundational work this builds upon:
217
 
218
  ```bibtex
219
  @article{li2023comparative,
@@ -240,10 +238,6 @@ And please also cite the foundational work this builds upon:
240
  Medical Director, Metro Physician Group
241
  Master of Science in Data Science candidate, Northwestern University
242
 
243
- ## Contributing
244
-
245
- Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
246
-
247
- ---
248
 
249
- *Built with ❤️ for healthcare AI that respects patient privacy.*
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - longformer
8
+ - medical
9
+ - clinical
10
+ - ner
11
+ - de-identification
12
+ - phi
13
+ - hipaa
14
+ - healthcare
15
+ - token-classification
16
+ datasets:
17
+ - nvidia/Nemotron-PII
18
+ base_model: yikuan8/Clinical-Longformer
19
+ metrics:
20
+ - f1
21
+ - precision
22
+ - recall
23
+ pipeline_tag: token-classification
24
+ widget:
25
+ - text: "Patient John Smith, DOB 01/15/1957, MRN 123456789, presented with chest pain."
26
+ example_title: "Clinical Note"
27
+ - text: "Contact Dr. Sarah Johnson at (405) 555-1234 or sarah.johnson@hospital.org"
28
+ example_title: "Contact Info"
29
+ ---
30
+
31
  # deid-LONGFORMER-NemPII
32
 
33
  **HIPAA-compliant clinical de-identification that beats commercial solutions—at zero cost.**
34
 
35
+ A fine-tuned [Clinical-Longformer](https://huggingface.co/yikuan8/Clinical-Longformer) model for Protected Health Information (PHI) detection and replacement in clinical text, achieving **97.74% F1** on held-out test data.
36
+
37
+ ## Model Description
38
 
39
+ This model identifies 25 types of Protected Health Information (PHI) in clinical text using BILOU tagging (101 classes total). Unlike commercial solutions that simply redact PHI with `[REDACTED]`, the accompanying replacement logic generates **realistic surrogate data** that preserves clinical meaning.
40
 
41
+ ### Performance Comparison
42
 
43
  | Solution | F1 Score | Cost | Replacement Quality |
44
  |----------|----------|------|---------------------|
45
  | AWS Comprehend Medical | ~83-93% | $14.5K/1M notes | Basic placeholders |
46
  | John Snow Labs | 96-97% | Enterprise license | Basic placeholders |
47
+ | **deid-LONGFORMER-NemPII** | **97.74%** | **Free** | **Realistic surrogates** |
 
 
 
 
 
 
48
 
49
+ ## Acknowledgments & Inspiration
50
 
51
+ This model builds directly on excellent prior work:
52
 
53
+ ### 🙏 obi/deid_bert_i2b2 — The Inspiration
54
 
55
+ This project was directly inspired by [**obi/deid_bert_i2b2**](https://huggingface.co/obi/deid_bert_i2b2) from the Open Biomedical Informatics team (Prajwal Kailas, Max Homilius, Shinichi Goto). Their pioneering work on ClinicalBERT-based de-identification using the I2B2 2014 dataset demonstrated the viability of transformer-based approaches for PHI detection. The [robust-deid](https://github.com/obi-ml-public/ehr_deidentification) framework they developed provided invaluable reference for architecture decisions, BILOU tagging schemes, and evaluation methodology.
56
 
57
+ ### 🏥 yikuan8/Clinical-Longformer The Base Model
58
 
59
+ Built on [**yikuan8/Clinical-Longformer**](https://huggingface.co/yikuan8/Clinical-Longformer) by Li, Yikuan et al. This clinical knowledge-enriched Longformer was pre-trained on MIMIC-III clinical notes and supports sequences up to 4,096 tokens—critical for processing real-world clinical documents that often exceed BERT's 512-token limit.
60
 
61
+ ### 📊 NVIDIA Nemotron-PII The Training Data
62
 
63
+ Trained on the healthcare subset of [**nvidia/Nemotron-PII**](https://huggingface.co/datasets/nvidia/Nemotron-PII) (3,630 records, CC BY 4.0). This synthetic dataset provides diverse PHI patterns without exposing real patient data.
64
 
65
+ ## Intended Uses
66
 
67
+ - **Clinical research**: De-identify notes for IRB-compliant research datasets
68
+ - **Healthcare NLP**: Prepare training data for downstream clinical NLP tasks
69
+ - **Data sharing**: Enable safe sharing of clinical text between institutions
70
+ - **Quality improvement**: Analyze clinical documentation without PHI exposure
71
 
72
+ ## Key Features
 
73
 
74
+ The replacement logic (in the [GitHub repo](https://github.com/Hrygt/deid-longformer-nempii)) provides:
 
 
 
 
75
 
76
+ - **Age-preserving DOB**: Fake DOBs keep patient age within ±2 years
77
+ - **Name consistency**: "Dr. Sarah Johnson" and "Sarah J." map to the same fake name
78
+ - **Temporal consistency**: All dates shift by the same offset (preserves intervals)
79
+ - **Geographic consistency**: City/state/ZIP combinations are coherent
80
+ - **Format preservation**: Phone numbers, SSNs, dates keep original format
81
+ - **Medical term protection**: Whitelist prevents "Anion Gap" → fake name
82
 
83
+ ## Training Details
 
84
 
85
+ ### Architecture
 
 
 
86
 
87
+ | Parameter | Value |
88
+ |-----------|-------|
89
+ | Base Model | yikuan8/Clinical-Longformer |
90
+ | Parameters | 148M |
91
+ | Max Length | 4,096 tokens |
92
+ | Task | Token Classification |
93
+ | Tagging | BILOU scheme |
94
+ | Classes | 101 (25 PHI types × 4 tags + O) |
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ### PHI Categories (25 types)
97
 
 
103
  BIOMETRIC_IDENTIFIER, UNIQUE_ID, CUSTOMER_ID, EMPLOYEE_ID
104
  ```
105
 
106
+ ### Training Procedure
107
 
108
+ - **Dataset**: NVIDIA Nemotron-PII healthcare subset (3,630 records)
109
+ - **Split**: 80% train / 20% test
110
+ - **Epochs**: 10
111
+ - **Batch size**: 4
112
+ - **Learning rate**: 2e-5
113
+ - **Optimizer**: AdamW
114
+ - **Hardware**: NVIDIA T4 GPU
115
 
116
+ ## Evaluation Results
117
 
118
+ | Metric | Score |
119
+ |--------|-------|
120
+ | **F1** | **97.74%** |
121
+ | Precision | 97.62% |
122
+ | Recall | 97.86% |
123
 
124
+ ## Usage
125
 
126
+ ### With Transformers
127
 
128
  ```python
129
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
130
+ import torch
131
 
132
+ tokenizer = AutoTokenizer.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
133
+ model = AutoModelForTokenClassification.from_pretrained("riggsmed/deid-LONGFORMER-NemPII")
 
 
134
 
135
+ text = "Patient John Smith, DOB 01/15/1957, presented with chest pain."
136
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
 
137
 
138
+ with torch.no_grad():
139
+ outputs = model(**inputs)
140
+ predictions = torch.argmax(outputs.logits, dim=-1)
141
 
142
+ # Decode predictions to entity labels
143
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
144
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
 
 
145
 
146
+ for token, label in zip(tokens, labels):
147
+ if label != "O":
148
+ print(f"{token}: {label}")
149
  ```
150
 
151
+ ### With Pipeline
152
 
153
+ ```python
154
+ from transformers import pipeline
 
155
 
156
+ pipe = pipeline("token-classification",
157
+ model="riggsmed/deid-LONGFORMER-NemPII",
158
+ aggregation_strategy="simple")
159
+
160
+ text = "Contact Dr. Sarah Johnson at (405) 555-1234"
161
+ entities = pipe(text)
162
+
163
+ for ent in entities:
164
+ print(f"{ent['word']}: {ent['entity_group']} ({ent['score']:.2f})")
165
  ```
166
 
167
+ ### Full De-identification (with surrogates)
168
 
169
+ For realistic surrogate replacement, use the full system from GitHub:
170
 
171
  ```bash
172
+ git clone https://github.com/Hrygt/deid-longformer-nempii.git
173
+ cd deid-longformer-nempii
174
+ pip install -r requirements.txt
 
 
 
 
 
175
  ```
176
 
177
+ ```python
178
+ from deid import deidentify_text
 
 
179
 
180
+ result = deidentify_text("Patient John Smith, DOB 01/15/1957")
181
+ print(result["deidentified_text"])
182
+ # Output: "Patient Robert Johnson, DOB 03/22/1955"
183
+ ```
184
 
185
+ ## Limitations
186
 
187
+ - **English only**: Trained exclusively on English clinical text
188
+ - **US-centric**: PHI patterns (SSN format, US addresses) are US-focused
189
+ - **Synthetic training data**: May miss edge cases in real clinical notes
190
+ - **Not a substitute for expert review**: For high-stakes applications, human review is recommended
 
191
 
192
+ ## Ethical Considerations
193
 
194
+ - This model is intended to **protect patient privacy**, not circumvent it
195
+ - De-identified data should still be handled according to institutional policies
196
+ - The model may have biases from training data that could affect certain demographic groups
197
+ - Always validate de-identification quality on your specific data before production use
 
 
 
 
198
 
199
  ## Live Demo
200
 
201
  Try it at: **https://deid.riggsmedai.com**
202
 
 
 
 
 
 
 
203
  ## Citation
204
 
 
 
205
  ```bibtex
206
  @software{riggs2024deid,
207
  author = {Riggs, Gary},
208
  title = {deid-LONGFORMER-NemPII: Clinical De-identification with Realistic Surrogates},
209
  year = {2024},
210
+ url = {https://huggingface.co/riggsmed/deid-LONGFORMER-NemPII}
211
  }
212
  ```
213
 
214
+ Please also cite the foundational work:
215
 
216
  ```bibtex
217
  @article{li2023comparative,
 
238
  Medical Director, Metro Physician Group
239
  Master of Science in Data Science candidate, Northwestern University
240
 
241
+ ## Model Card Contact
 
 
 
 
242
 
243
+ For questions or issues: [GitHub Issues](https://github.com/Hrygt/deid-longformer-nempii/issues)