PHI Span Detector (BIO NER) — Synthetic

This model detects Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging (token classification). It is intended to power deterministic redaction and zero-trust logging guardrails.

PHI Types

The model predicts spans for the following categories:

  • NAME
  • DATE
  • AGE
  • PHONE
  • EMAIL
  • ADDRESS
  • ID (e.g., MRN/account/record IDs)
  • PROVIDER
  • FACILITY
  • LOCATION

Output is BIO-formatted per token (e.g., B-NAME, I-NAME, …).


How it works

This is a token-classification model trained on synthetic examples to keep the project openly shareable:

  1. Synthetic clinical notes and log lines are generated using templates.
  2. PHI-like fields are inserted (names, IDs, phone numbers, dates, addresses, etc.).
  3. Gold labels are produced automatically as character spans and converted to BIO token labels.

This produces clean supervision without using real patient data.


Intended Use

Appropriate uses

  • PHI span detection for research prototypes
  • Pre-log / post-log redaction guardrails
  • De-identification pipelines when paired with deterministic redaction

Not intended for

  • Medical diagnosis or treatment advice
  • Sole control for compliance (HIPAA/GDPR) decisions
  • High-stakes production usage without additional safeguards and evaluation

Recommended pipeline: Detect spans → deterministic redaction → secondary leak-check gate.


Limitations

  • Trained on synthetic text: real-world clinical documentation can include unseen formats and edge cases.
  • May over-redact (false positives) on numeric identifiers or location-like strings.
  • May miss rare PHI patterns not represented in synthetic templates.

If using in a real system, evaluate on your organization’s internal test set and consider adding:

  • regex backstops (email/phone/date patterns)
  • human-in-the-loop review for flagged cases
  • a secondary “PHI leak checker” model

Usage

1) Transformers token-classification pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathja/phi-span-detector-deberta-v3",
    aggregation_strategy="simple"
)

text = "Patient John Smith (MRN: 001-23-4567) visited Boston Medical Center on 12/19/2025."
print(ner(text))

2) Deterministic redaction (recommended)

Use detected spans to redact with placeholders such as [NAME], [ID], [DATE], etc. (See companion project: PHI Guardrails.)

Output Schema (recommended)

A practical production-friendly span format:

[
  {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
  {"start": 25, "end": 36, "label": "ID", "score": 0.94},
  {"start": 68, "end": 78, "label": "FACILITY", "score": 0.91},
  {"start": 82, "end": 92, "label": "DATE", "score": 0.89}
]

Safety & Privacy

This model is trained on synthetic data and is published for research and tooling purposes. Do not upload real PHI to public endpoints or demos. Use private infrastructure for real deployments.

Citation
@misc{janumpally_phi_span_detector_2025,
  title        = {PHI Span Detector (Synthetic)},
  author       = {Bharath Kumar Reddy Janumpally},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {Model on Hugging Face}
}
Downloads last month
674
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support