Janumpally
Update README.md
83dc153 verified
---
language: en
license: apache-2.0
tags:
- token-classification
- ner
- privacy
- healthcare
- deidentification
- security
- compliance
pipeline_tag: token-classification
library_name: transformers
---
# PHI Span Detector (BIO NER) — Synthetic
This model detects **Protected Health Information (PHI)** spans in clinical-note-like text and log-like text using **BIO tagging** (token classification). It is intended to power **deterministic redaction** and **zero-trust logging guardrails**.
## PHI Types
The model predicts spans for the following categories:
- **NAME**
- **DATE**
- **AGE**
- **PHONE**
- **EMAIL**
- **ADDRESS**
- **ID** (e.g., MRN/account/record IDs)
- **PROVIDER**
- **FACILITY**
- **LOCATION**
Output is BIO-formatted per token (e.g., `B-NAME`, `I-NAME`, …).
---
## How it works
This is a **token-classification** model trained on **synthetic** examples to keep the project openly shareable:
1. Synthetic clinical notes and log lines are generated using templates.
2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
3. Gold labels are produced automatically as character spans and converted to BIO token labels.
This produces clean supervision without using real patient data.
---
## Intended Use
**Appropriate uses**
- PHI span detection for research prototypes
- Pre-log / post-log redaction guardrails
- De-identification pipelines when paired with deterministic redaction
**Not intended for**
- Medical diagnosis or treatment advice
- Sole control for compliance (HIPAA/GDPR) decisions
- High-stakes production usage without additional safeguards and evaluation
**Recommended pipeline:** Detect spans → deterministic redaction → secondary leak-check gate.
---
## Limitations
- Trained on **synthetic** text: real-world clinical documentation can include unseen formats and edge cases.
- May over-redact (false positives) on numeric identifiers or location-like strings.
- May miss rare PHI patterns not represented in synthetic templates.
If using in a real system, evaluate on your organization’s internal test set and consider adding:
- regex backstops (email/phone/date patterns)
- human-in-the-loop review for flagged cases
- a secondary “PHI leak checker” model
---
## Usage
### 1) Transformers token-classification pipeline
```python
from transformers import pipeline
ner = pipeline(
"token-classification",
model="bharathja/phi-span-detector-deberta-v3",
aggregation_strategy="simple"
)
text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
print(ner(text))
```
### 2) Deterministic redaction (recommended)
Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc.
(See companion project: PHI Guardrails.)
Output Schema (recommended)
```python
A practical production-friendly span format:
[
{"start": 8, "end": 18, "label": "NAME", "score": 0.97},
{"start": 25, "end": 36, "label": "ID", "score": 0.94},
{"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
{"start": 82, "end": 92, "label": "DATE", "score": 0.89}
]
```
### Safety & Privacy
This model is trained on synthetic data and is published for research and tooling purposes.
Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.
```python
Citation
@misc{janumpally_phi_span_detector_2025,
title = {PHI Span Detector (Synthetic)},
author = {Bharath Kumar Reddy Janumpally},
year = {2025},
publisher = {Hugging Face},
howpublished = {Model on Hugging Face}
}
````