PHI Span Detector (BIO NER) — Synthetic
This model detects Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging (token classification). It is intended to power deterministic redaction and zero-trust logging guardrails.
PHI Types
The model predicts spans for the following categories:
- NAME
- DATE
- AGE
- PHONE
- ADDRESS
- ID (e.g., MRN/account/record IDs)
- PROVIDER
- FACILITY
- LOCATION
Output is BIO-formatted per token (e.g., B-NAME, I-NAME, …).
How it works
This is a token-classification model trained on synthetic examples to keep the project openly shareable:
- Synthetic clinical notes and log lines are generated using templates.
- PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
- Gold labels are produced automatically as character spans and converted to BIO token labels.
This produces clean supervision without using real patient data.
Intended Use
✅ Appropriate uses
- PHI span detection for research prototypes
- Pre-log / post-log redaction guardrails
- De-identification pipelines when paired with deterministic redaction
❌ Not intended for
- Medical diagnosis or treatment advice
- Sole control for compliance (HIPAA/GDPR) decisions
- High-stakes production usage without additional safeguards and evaluation
Recommended pipeline: Detect spans → deterministic redaction → secondary leak-check gate.
Limitations
- Trained on synthetic text: real-world clinical documentation can include unseen formats and edge cases.
- May over-redact (false positives) on numeric identifiers or location-like strings.
- May miss rare PHI patterns not represented in synthetic templates.
If using in a real system, evaluate on your organization’s internal test set and consider adding:
- regex backstops (email/phone/date patterns)
- human-in-the-loop review for flagged cases
- a secondary “PHI leak checker” model
Usage
1) Transformers token-classification pipeline
from transformers import pipeline
ner = pipeline(
"token-classification",
model="bharathja/phi-span-detector-deberta-v3",
aggregation_strategy="simple"
)
text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
print(ner(text))
2) Deterministic redaction (recommended)
Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc. (See companion project: PHI Guardrails.)
Output Schema (recommended)
A practical production-friendly span format:
[
{"start": 8, "end": 18, "label": "NAME", "score": 0.97},
{"start": 25, "end": 36, "label": "ID", "score": 0.94},
{"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
{"start": 82, "end": 92, "label": "DATE", "score": 0.89}
]
Safety & Privacy
This model is trained on synthetic data and is published for research and tooling purposes. Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.
Citation
@misc{janumpally_phi_span_detector_2025,
title = {PHI Span Detector (Synthetic)},
author = {Bharath Kumar Reddy Janumpally},
year = {2025},
publisher = {Hugging Face},
howpublished = {Model on Hugging Face}
}
- Downloads last month
- 674