PHI Span Detector (BIO NER) - Synthetic

phi-span-detector-deberta-v3 is a DeBERTa v3 token-classification model for detecting Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging.

It is designed for privacy tooling workflows such as:

  • deterministic redaction pipelines
  • pre-log and post-log PHI guardrails
  • research prototypes for de-identification

Recommended pipeline:

  1. detect PHI spans
  2. apply deterministic redaction
  3. run a secondary leak-check gate before downstream use

Companion model: bharathjanumpally/phi-leak-checker-deberta-v3

Model at a glance

  • Task: token classification
  • Architecture: DebertaV2ForTokenClassification
  • Base model: microsoft/deberta-v3-base
  • Max sequence length: 512
  • Labeling scheme: BIO
  • Training data: synthetic text only

PHI label set

The model predicts the following entity families:

Label Meaning
NAME patient or person names
DATE visit dates, birth dates, service dates
AGE age mentions that may be identifying in context
PHONE phone and callback numbers
EMAIL email addresses
ADDRESS street or mailing addresses
ID MRN, account, encounter, record, or similar identifiers
PROVIDER clinician or provider names
FACILITY hospitals, clinics, centers, departments
LOCATION city, state, and other place references

Token-level outputs use BIO labels from the model config:

O, B-*, and I-* across the ten PHI families above.

How the training data was built

This model was trained on synthetic examples to keep the project openly shareable.

High-level training recipe:

  1. Generate synthetic clinical notes and log-like text with templates.
  2. Insert PHI-like fields such as names, dates, IDs, facilities, phone numbers, and addresses.
  3. Convert gold character spans into BIO token labels for token classification.

This provides clean supervision without exposing real patient data, but it also means real-world formatting and writing styles may differ from training-time distributions.

Evaluation

The repository includes a full seqeval_report.txt. Key held-out results from that report are summarized below.

Overall metrics

Metric Value
Micro precision 0.6657
Micro recall 0.6394
Micro F1 0.6523
Macro precision 0.6583
Macro recall 0.6224
Macro F1 0.6362
Weighted F1 0.6495

Per-label metrics

Label Precision Recall F1 Support
ADDRESS 0.6652 0.6481 0.6565 233
AGE 0.6758 0.3834 0.4893 386
DATE 0.6553 0.6492 0.6522 1297
EMAIL 0.6474 0.6455 0.6465 347
FACILITY 0.6320 0.6494 0.6406 656
ID 0.6652 0.6519 0.6585 451
LOCATION 0.6600 0.6600 0.6600 350
NAME 0.7810 0.7802 0.7806 1001
PHONE 0.5358 0.5025 0.5186 595
PROVIDER 0.6652 0.6537 0.6594 231

Interpretation

  • Strongest label in the current report: NAME
  • Weakest labels in the current report: PHONE and AGE
  • The model is usable as a PHI span detector for research and tooling, but it should be paired with deterministic rules and internal evaluation before higher-stakes deployment

Intended use

Appropriate uses:

  • PHI span detection in research prototypes
  • de-identification pipelines when paired with deterministic redaction
  • zero-trust logging guardrails
  • preprocessing before a secondary PHI leak checker

Not intended for:

  • medical diagnosis or treatment advice
  • sole control for HIPAA, GDPR, or other compliance decisions
  • unsupervised high-stakes production usage without internal validation

Limitations and failure modes

  • The model was trained on synthetic text, so real clinical documentation may include unseen abbreviations, formatting quirks, shorthand, OCR noise, and edge cases.
  • Numeric strings may be over-flagged when they resemble IDs, dates, or phone numbers.
  • Some rare PHI patterns may be missed if they were not well represented in the synthetic templates.
  • Partial tokens and tokenizer boundary effects can require careful post-processing in downstream systems.
  • Label performance is uneven; current metrics suggest extra caution around PHONE and AGE.

Recommended mitigations:

  • add regex backstops for structured entities like email, phone, and date
  • apply deterministic placeholder redaction after detection
  • run a second PHI leak-check model before downstream release
  • evaluate on an internal, policy-approved test set that matches your real document style
  • keep a human-review path for ambiguous or high-risk content

Usage

Transformers pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathjanumpally/phi-span-detector-deberta-v3",
    aggregation_strategy="simple",
)

text = (
    "Patient John Smith (MRN: 001-23-4567) visited "
    "Boston Medical Center on 12/19/2025."
)

print(ner(text))

AutoModel and AutoTokenizer

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_id = "bharathjanumpally/phi-span-detector-deberta-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
)

print(ner("Call Jane Doe at 617-555-0182 before 04/14/2025."))

Deterministic redaction example

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathjanumpally/phi-span-detector-deberta-v3",
    aggregation_strategy="simple",
)

text = (
    "Patient John Smith (MRN: 001-23-4567) visited "
    "Boston Medical Center on 12/19/2025."
)

spans = ner(text)

redacted = text
for item in sorted(spans, key=lambda x: x["start"], reverse=True):
    label = item["entity_group"]
    redacted = redacted[: item["start"]] + f"[{label}]" + redacted[item["end"] :]

print(spans)
print(redacted)

Example output schema

For downstream systems, a practical span schema is:

[
  {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
  {"start": 25, "end": 36, "label": "ID", "score": 0.94},
  {"start": 45, "end": 66, "label": "FACILITY", "score": 0.91},
  {"start": 70, "end": 80, "label": "DATE", "score": 0.89}
]

Safety and privacy

This model was trained on synthetic data and is published for research and tooling purposes. Do not send real PHI to public demos or public inference endpoints. Use private infrastructure, access controls, and organization-approved evaluation workflows for real deployments.

Citation

@misc{janumpally_phi_span_detector_2025,
  title        = {PHI Span Detector (Synthetic)},
  author       = {Bharath Kumar Reddy Janumpally},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {Model on Hugging Face}
}

Contact

If you use this model in a serious workflow, validate it against your own internal test cases and document the operating policy around false positives, false negatives, and escalation paths.

Downloads last month
13
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bharathjanumpally/phi-span-detector-deberta-v3

Finetuned
(609)
this model

Evaluation results