PHI Span Detector (BIO NER) - Synthetic

phi-span-detector-deberta-v3 is a DeBERTa v3 token-classification model for detecting Protected Health Information (PHI) spans in clinical-note-like text and log-like text using BIO tagging.

It is designed for privacy tooling workflows such as:

deterministic redaction pipelines
pre-log and post-log PHI guardrails
research prototypes for de-identification

Recommended pipeline:

detect PHI spans
apply deterministic redaction
run a secondary leak-check gate before downstream use

Companion model: bharathjanumpally/phi-leak-checker-deberta-v3

Model at a glance

Task: token classification
Architecture: DebertaV2ForTokenClassification
Base model: microsoft/deberta-v3-base
Max sequence length: 512
Labeling scheme: BIO
Training data: synthetic text only

PHI label set

The model predicts the following entity families:

Label	Meaning
`NAME`	patient or person names
`DATE`	visit dates, birth dates, service dates
`AGE`	age mentions that may be identifying in context
`PHONE`	phone and callback numbers
`EMAIL`	email addresses
`ADDRESS`	street or mailing addresses
`ID`	MRN, account, encounter, record, or similar identifiers
`PROVIDER`	clinician or provider names
`FACILITY`	hospitals, clinics, centers, departments
`LOCATION`	city, state, and other place references

Token-level outputs use BIO labels from the model config:

O, B-*, and I-* across the ten PHI families above.

How the training data was built

This model was trained on synthetic examples to keep the project openly shareable.

High-level training recipe:

Generate synthetic clinical notes and log-like text with templates.
Insert PHI-like fields such as names, dates, IDs, facilities, phone numbers, and addresses.
Convert gold character spans into BIO token labels for token classification.

This provides clean supervision without exposing real patient data, but it also means real-world formatting and writing styles may differ from training-time distributions.

Evaluation

The repository includes a full seqeval_report.txt. Key held-out results from that report are summarized below.

Overall metrics

Metric	Value
Micro precision	0.6657
Micro recall	0.6394
Micro F1	0.6523
Macro precision	0.6583
Macro recall	0.6224
Macro F1	0.6362
Weighted F1	0.6495

Per-label metrics

Label	Precision	Recall	F1	Support
ADDRESS	0.6652	0.6481	0.6565	233
AGE	0.6758	0.3834	0.4893	386
DATE	0.6553	0.6492	0.6522	1297
EMAIL	0.6474	0.6455	0.6465	347
FACILITY	0.6320	0.6494	0.6406	656
ID	0.6652	0.6519	0.6585	451
LOCATION	0.6600	0.6600	0.6600	350
NAME	0.7810	0.7802	0.7806	1001
PHONE	0.5358	0.5025	0.5186	595
PROVIDER	0.6652	0.6537	0.6594	231

Interpretation

Strongest label in the current report: NAME
Weakest labels in the current report: PHONE and AGE
The model is usable as a PHI span detector for research and tooling, but it should be paired with deterministic rules and internal evaluation before higher-stakes deployment

Intended use

Appropriate uses:

PHI span detection in research prototypes
de-identification pipelines when paired with deterministic redaction
zero-trust logging guardrails
preprocessing before a secondary PHI leak checker

Not intended for:

medical diagnosis or treatment advice
sole control for HIPAA, GDPR, or other compliance decisions
unsupervised high-stakes production usage without internal validation

Limitations and failure modes

The model was trained on synthetic text, so real clinical documentation may include unseen abbreviations, formatting quirks, shorthand, OCR noise, and edge cases.
Numeric strings may be over-flagged when they resemble IDs, dates, or phone numbers.
Some rare PHI patterns may be missed if they were not well represented in the synthetic templates.
Partial tokens and tokenizer boundary effects can require careful post-processing in downstream systems.
Label performance is uneven; current metrics suggest extra caution around PHONE and AGE.

Recommended mitigations:

add regex backstops for structured entities like email, phone, and date
apply deterministic placeholder redaction after detection
run a second PHI leak-check model before downstream release
evaluate on an internal, policy-approved test set that matches your real document style
keep a human-review path for ambiguous or high-risk content

Usage

Transformers pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathjanumpally/phi-span-detector-deberta-v3",
    aggregation_strategy="simple",
)

text = (
    "Patient John Smith (MRN: 001-23-4567) visited "
    "Boston Medical Center on 12/19/2025."
)

print(ner(text))

AutoModel and AutoTokenizer

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_id = "bharathjanumpally/phi-span-detector-deberta-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
)

print(ner("Call Jane Doe at 617-555-0182 before 04/14/2025."))

Deterministic redaction example

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="bharathjanumpally/phi-span-detector-deberta-v3",
    aggregation_strategy="simple",
)

text = (
    "Patient John Smith (MRN: 001-23-4567) visited "
    "Boston Medical Center on 12/19/2025."
)

spans = ner(text)

redacted = text
for item in sorted(spans, key=lambda x: x["start"], reverse=True):
    label = item["entity_group"]
    redacted = redacted[: item["start"]] + f"[{label}]" + redacted[item["end"] :]

print(spans)
print(redacted)

Example output schema

For downstream systems, a practical span schema is:

[
  {"start": 8, "end": 18, "label": "NAME", "score": 0.97},
  {"start": 25, "end": 36, "label": "ID", "score": 0.94},
  {"start": 45, "end": 66, "label": "FACILITY", "score": 0.91},
  {"start": 70, "end": 80, "label": "DATE", "score": 0.89}
]

Safety and privacy

This model was trained on synthetic data and is published for research and tooling purposes. Do not send real PHI to public demos or public inference endpoints. Use private infrastructure, access controls, and organization-approved evaluation workflows for real deployments.

Citation

@misc{janumpally_phi_span_detector_2025,
  title        = {PHI Span Detector (Synthetic)},
  author       = {Bharath Kumar Reddy Janumpally},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {Model on Hugging Face}
}

Contact

If you use this model in a serious workflow, validate it against your own internal test cases and document the operating policy around false positives, false negatives, and escalation paths.

Downloads last month: 13

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for bharathjanumpally/phi-span-detector-deberta-v3

Base model

microsoft/deberta-v3-base

Finetuned

(609)

this model

Evaluation results

Micro F1 on Synthetic PHI span test set
self-reported

0.652
Micro Precision on Synthetic PHI span test set
self-reported

0.666
Micro Recall on Synthetic PHI span test set
self-reported

0.639
Macro F1 on Synthetic PHI span test set
self-reported

0.636