PHI Span Detector (BIO NER) — Synthetic

This model detects Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging (token classification). It is intended to power deterministic redaction and zero-trust logging guardrails.

PHI Types

The model predicts spans for the following categories:

NAME
DATE
AGE
PHONE
EMAIL
ADDRESS
ID (e.g., MRN/account/record IDs)
PROVIDER
FACILITY
LOCATION

Output is BIO-formatted per token (e.g., B-NAME, I-NAME, …).

How it works

This is a token-classification model trained on synthetic examples to keep the project openly shareable:

Synthetic clinical notes and log lines are generated using templates.
PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
Gold labels are produced automatically as character spans and converted to BIO token labels.

This produces clean supervision without using real patient data.

Intended Use

✅ Appropriate uses

PHI span detection for research prototypes
Pre-log / post-log redaction guardrails
De-identification pipelines when paired with deterministic redaction

❌ Not intended for

Medical diagnosis or treatment advice
Sole control for compliance (HIPAA/GDPR) decisions
High-stakes production usage without additional safeguards and evaluation

Recommended pipeline: Detect spans → deterministic redaction → secondary leak-check gate.

Limitations

Trained on synthetic text: real-world clinical documentation can include unseen formats and edge cases.
May over-redact (false positives) on numeric identifiers or location-like strings.
May miss rare PHI patterns not represented in synthetic templates.

If using in a real system, evaluate on your organization’s internal test set and consider adding:

regex backstops (email/phone/date patterns)
human-in-the-loop review for flagged cases
a secondary “PHI leak checker” model

Usage

1) Transformers token-classification pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathja/phi-span-detector-deberta-v3",
    aggregation_strategy="simple"
)

text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
print(ner(text))

2) Deterministic redaction (recommended)

Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc. (See companion project: PHI Guardrails.)

Output Schema (recommended)

A practical production-friendly span format:

[
  {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
  {"start": 25, "end": 36, "label": "ID", "score": 0.94},
  {"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
  {"start": 82, "end": 92, "label": "DATE", "score": 0.89}
]

Safety & Privacy

This model is trained on synthetic data and is published for research and tooling purposes. Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.

Citation
@misc{janumpally_phi_span_detector_2025,
  title        = {PHI Span Detector (Synthetic)},
  author       = {Bharath Kumar Reddy Janumpally},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {Model on Hugging Face}
}

Downloads last month: 674

Safetensors

Model size

0.2B params

Tensor type

F32