|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- token-classification |
|
|
- ner |
|
|
- privacy |
|
|
- healthcare |
|
|
- deidentification |
|
|
- security |
|
|
- compliance |
|
|
pipeline_tag: token-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# PHI Span Detector (BIO NER) — Synthetic |
|
|
|
|
|
This model detects **Protected Health Information (PHI)** spans in clinical-note-like text and log-like text using **BIO tagging** (token classification). It is intended to power **deterministic redaction** and **zero-trust logging guardrails**. |
|
|
|
|
|
## PHI Types |
|
|
|
|
|
The model predicts spans for the following categories: |
|
|
|
|
|
- **NAME** |
|
|
- **DATE** |
|
|
- **AGE** |
|
|
- **PHONE** |
|
|
- **EMAIL** |
|
|
- **ADDRESS** |
|
|
- **ID** (e.g., MRN/account/record IDs) |
|
|
- **PROVIDER** |
|
|
- **FACILITY** |
|
|
- **LOCATION** |
|
|
|
|
|
Output is BIO-formatted per token (e.g., `B-NAME`, `I-NAME`, …). |
|
|
|
|
|
--- |
|
|
|
|
|
## How it works |
|
|
|
|
|
This is a **token-classification** model trained on **synthetic** examples to keep the project openly shareable: |
|
|
|
|
|
1. Synthetic clinical notes and log lines are generated using templates. |
|
|
2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.). |
|
|
3. Gold labels are produced automatically as character spans and converted to BIO token labels. |
|
|
|
|
|
This produces clean supervision without using real patient data. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
✅ **Appropriate uses** |
|
|
- PHI span detection for research prototypes |
|
|
- Pre-log / post-log redaction guardrails |
|
|
- De-identification pipelines when paired with deterministic redaction |
|
|
|
|
|
❌ **Not intended for** |
|
|
- Medical diagnosis or treatment advice |
|
|
- Sole control for compliance (HIPAA/GDPR) decisions |
|
|
- High-stakes production usage without additional safeguards and evaluation |
|
|
|
|
|
**Recommended pipeline:** Detect spans → deterministic redaction → secondary leak-check gate. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on **synthetic** text: real-world clinical documentation can include unseen formats and edge cases. |
|
|
- May over-redact (false positives) on numeric identifiers or location-like strings. |
|
|
- May miss rare PHI patterns not represented in synthetic templates. |
|
|
|
|
|
If using in a real system, evaluate on your organization’s internal test set and consider adding: |
|
|
- regex backstops (email/phone/date patterns) |
|
|
- human-in-the-loop review for flagged cases |
|
|
- a secondary “PHI leak checker” model |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### 1) Transformers token-classification pipeline |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
ner = pipeline( |
|
|
"token-classification", |
|
|
model="bharathja/phi-span-detector-deberta-v3", |
|
|
aggregation_strategy="simple" |
|
|
) |
|
|
|
|
|
text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025." |
|
|
print(ner(text)) |
|
|
``` |
|
|
### 2) Deterministic redaction (recommended) |
|
|
|
|
|
Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc. |
|
|
(See companion project: PHI Guardrails.) |
|
|
|
|
|
Output Schema (recommended) |
|
|
```python |
|
|
A practical production-friendly span format: |
|
|
|
|
|
[ |
|
|
{"start": 8, "end": 18, "label": "NAME", "score": 0.97}, |
|
|
{"start": 25, "end": 36, "label": "ID", "score": 0.94}, |
|
|
{"start": 68, "end": 78, "label": "FACILITY", "score": 0.91}, |
|
|
{"start": 82, "end": 92, "label": "DATE", "score": 0.89} |
|
|
] |
|
|
``` |
|
|
|
|
|
### Safety & Privacy |
|
|
|
|
|
This model is trained on synthetic data and is published for research and tooling purposes. |
|
|
Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments. |
|
|
```python |
|
|
Citation |
|
|
@misc{janumpally_phi_span_detector_2025, |
|
|
title = {PHI Span Detector (Synthetic)}, |
|
|
author = {Bharath Kumar Reddy Janumpally}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {Model on Hugging Face} |
|
|
} |
|
|
|
|
|
```` |