ClarioScope PHI detector v1

A 125M-parameter RoBERTa-base fine-tune that locates protected health information (PHI) in inbound patient text. It tags spans across the 18 HIPAA Safe Harbor identifier categories at 45× lower latency than frontier APIs and $0 per inference when self-hosted. This is model 2 of the ClarioScope SLM Suite — a three-model intake intelligence pipeline for healthcare practices.

A note on naming. The repo is named clarioscope-phi-deberta-v1 because the original plan was DeBERTa-v3. During training, DeBERTa-v3-base produced reproducible NaN gradients on this 37-label token classification setup (forward loss healthy, backward NaN on the first step, across both fp16 and bf16, with explicit classifier head re-init and gradient clipping). RoBERTa-base trained stably with the same training script and is what's actually published here. The repo name is kept for URL stability.

This model is NOT a HIPAA-compliance tool. It detects 18 PHI entity types as defined in the HIPAA Safe Harbor rule. Whether your overall system, infrastructure, and data flow are HIPAA-compliant is a regulatory determination that no ML model can make. Use this model as a building block, not as a compliance verdict.

TL;DR

Property	Value
Task	Token classification over 18 HIPAA Safe Harbor PHI entity types (BIO tags, 37 labels)
Base model	FacebookAI/roberta-base (125M params)
Training data	8,594 synthetic examples (8,594 train / 952 val after cleanup) generated via `gpt-4o-mini-2024-07-18`
Test data	548 synthetic examples generated by Claude — different generation prompt to mitigate train/test leakage
Val macro F1	85.61%
Val weighted F1	91.12%
Test macro F1 vs frontier	63.01% vs claude-sonnet-4-6 89.46% — frontier wins on aggregate
Where SLM beats frontier	LOC 0.82 vs 0.29, AGE_OVER_89 0.98 vs 0.84–0.97
Where SLM matches frontier	NAME, DATE, PHONE, FAX, IP — all within 1–5 pp of frontier
Where SLM loses badly	MRN (0.28 vs 1.00), LICENSE (0.17 vs 1.00), HEALTH_PLAN (0.26 vs 0.85+) — structured-ID memorization
Test latency	28.6 ms/example on CPU — 45× faster than Claude Haiku, 70× faster than Sonnet
Per-inference cost	$0 self-hosted vs $1.00–$2.53 per 1K for frontier APIs
License	MIT

Why this model exists and where to actually use it

PHI detection over inbound patient text is one of those tasks where the obvious 2026 answer — "throw a frontier LLM at it" — is correct on aggregate accuracy but expensive in latency, dollars, and data residency. A self-hosted small model is the inverse: free per inference, ~30 ms per example on a CPU, never sends patient text to a third party.

The interesting result from the benchmark below is that the trade-off is not uniform across entity types:

For linguistic entities (geographic locations, ages, person names, dates, phone numbers, fax numbers, IP addresses), this model matches or beats Claude Haiku 4.5, Claude Sonnet 4.6, and GPT-4o.
For structured-ID entities (medical record numbers, health-plan IDs, license numbers, device serial numbers, biometric references), frontier models dominate because they've memorized far more ID format conventions than 8,000 synthetic examples can teach.

The honest production recipe is hybrid: use this model for the seven categories where it ties or beats frontier, and pair it with regex matchers and/or a frontier-API fallback for the structured-ID categories. The cost and latency of the hybrid is dominated by this small model; the frontier API only runs when a structured-ID hit is suspected.

This is model 2 of the ClarioScope SLM Suite:

Intent classifier (clarioscope-intent-deberta-v1) — what does this message want? (91.16% accuracy, 22× faster than frontier)
PHI detector (this model) — where is protected information in this message?
Insurance extractor (in development) — what billing-relevant structured data is in this message?

The 18 entity types (HIPAA Safe Harbor)

Label	What it captures
`NAME`	Patient, family member, or provider names
`LOC`	Geographic subdivisions smaller than a state — street, city, county, ZIP
`DATE`	Calendar dates associated with an individual (DOB, appointment dates)
`PHONE`	Telephone numbers
`FAX`	Fax numbers
`EMAIL`	Email addresses
`SSN`	Social Security numbers (full or partial)
`MRN`	Medical record numbers
`HEALTH_PLAN`	Health-plan / insurance member IDs
`ACCOUNT`	Account numbers (financial, patient billing)
`LICENSE`	Certificate or license numbers
`VEHICLE`	Vehicle identifiers (license plate, VIN)
`DEVICE`	Medical device identifiers and serial numbers
`URL`	Personal URLs
`IP`	IP addresses
`BIOMETRIC`	Text references to biometric identifiers (fingerprint IDs, voiceprint IDs)
`PHOTO_REF`	Text references to a full-face photograph or comparable image
`AGE_OVER_89`	Ages over 89 only (ages 89 and under are not PHI under Safe Harbor)

The model outputs BIO-style token labels (37 total: O + 18 × {B-, I-}), which downstream code converts back into character-offset spans.

Architecture

Standard RoBERTa-base encoder with a token-classification head: a linear layer over the contextualized representation of each token, producing 37 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision — see the naming note at the top), the explicit adamw_torch optimizer with max_grad_norm=1.0, a re-initialized classifier head (std=0.02 normal, zero bias), batch size 8, sequence length 256 tokens, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~5 minutes on a single RTX A4000.

The training script also runs a one-batch sanity probe before letting the Trainer take over, catching any NaN-gradient or label-shape issues before they consume 5 minutes of compute.

Training data — synthetic, transparent about it

All training and evaluation data is synthetic. There is no real patient data in this model or its evaluation. Using synthetic data for v1 sidesteps HIPAA constraints entirely and ships a fast first iteration. A v2 trained on real PHI would require HIPAA-eligible training infrastructure (AWS SageMaker or Azure ML with a Business Associate Agreement); that's a separate, more careful project.

Training set (8,594 examples after data cleanup). Generated via the OpenAI API (gpt-4o-mini-2024-07-18, JSON-object response format, temperature 1.0) across 18 entity types × healthcare practice types × channels. Each generation produces both the inquiry text and a list of entity annotations as {text, label} pairs; a post-processing step locates each entity text in the inquiry and assigns character offsets. The generation prompt enforces a realism mix — roughly 40% polished, 40% casual, 20% messy — and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).

Data cleanup. A first version of this model trained on raw generated annotations scored only 0.57 macro F1 on test. Investigating the gap exposed a pervasive label-noise pattern: the LLM was returning entity texts that included the cue word that introduced the entity ("MRN 8472301" labeled as the MRN span, "SSN" labeled as the SSN span, "phone 555-1234" as the PHONE span). The training data had 1,676 entities like this — about 8.6% of all spans — while the test set, generated by a different model with a stricter prompt, had only 2. The cleanup script (clean_data.py in the repo) strips known cue-word prefixes from entity texts, re-locates the cleaned text in the inquiry, and drops entities that no longer have a valid span. Importantly, it preserves natural prefix characters like the opening ( in (617) 555-1234.

Test set (548 examples). Generated by Claude with a deliberately different prompt style. This cross-model split mitigates the failure mode where train and test come from the same generation distribution and the benchmark inflates. The test set also has tighter, more uniform structured-ID formats than the training set, which is partly responsible for the train/test gap on structured-ID entities (the model never saw the specific test-set ID formats during training).

Benchmark — entity-level F1, span boundaries matter

Evaluated on the 548-example held-out test set using seqeval, which requires both the entity type AND the exact span boundary to match for a true positive. Latency is wall-clock single-example latency. Cost is API-token spend per 1,000 inferences.

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-phi-deberta-v1` (CPU)	0.6301	0.7639	28.6 ms	$0.00
`claude-haiku-4-5-20251001`	0.8492	0.9213	1294 ms	$1.00
`claude-sonnet-4-6`	0.8946	0.9396	1980 ms	$2.53
`gpt-4o-2024-11-20`	0.8094	0.8912	1111 ms	$1.64

The aggregate F1 favors frontier models. What's actually happening underneath is more interesting than the macro number suggests.

Per-entity breakdown — where small wins, where small loses

Entities are grouped by what kind of feature they require the model to recognize.

Linguistic entities — small model matches or beats frontier:

Entity	This model	Haiku	Sonnet	GPT-4o
`PHONE`	0.983	1.000	0.994	1.000
`AGE_OVER_89`	0.976	0.967	0.967	0.836
`NAME`	0.961	0.996	0.994	0.980
`IP`	0.949	1.000	1.000	0.967
`FAX`	0.949	1.000	0.984	1.000
`DATE`	0.945	0.949	0.970	0.909
`LOC`	0.818	0.328	0.289	0.301

LOC is the standout. The fine-tuned model nearly triples the frontier APIs' F1 on geographic locations — frontier models systematically under-flag informal location mentions ("at the Roxbury location," "she lives in Allston") because their training has uncertainty about whether informal context cues constitute PHI. A specialized model trained explicitly to tag these does not hesitate.

AGE_OVER_89 is another quiet win — frontier models occasionally tag ages 89-and-under as PHI (they aren't, under Safe Harbor) or miss the "over 89" qualifier ("she's 96") that determines whether the age is reportable. The fine-tuned model learned the rule directly.

Structured-ID entities — frontier wins, often by a wide margin:

Entity	This model	Haiku	Sonnet	GPT-4o
`MRN`	0.276	1.000	1.000	0.997
`LICENSE`	0.170	1.000	1.000	0.933
`HEALTH_PLAN`	0.264	0.855	0.983	0.717
`BIOMETRIC`	0.095	0.410	1.000	0.314
`DEVICE`	0.341	0.732	1.000	0.800
`VEHICLE`	0.640	1.000	1.000	0.970
`SSN`	0.583	0.983	0.949	0.915
`ACCOUNT`	0.759	0.985	0.969	1.000
`EMAIL`	0.815	1.000	1.000	1.000
`URL`	0.738	0.967	0.967	0.931

Structured-ID entities follow surface conventions that vary widely across institutions: an MRN might look like OMK-44291, RMR-882034, DENT-12345-A, or just 8472301. The model can only learn the patterns it saw during training; if the test-set MRNs use a slightly different convention (and they do, because the test set was generated by a different model with a different ID-format style), the fine-tuned model produces over- or under-extended span boundaries even when it correctly identifies that "something MRN-like is here."

Frontier models win these categories because they've seen a much wider distribution of ID formats during pretraining, and because their attention mechanism is strong enough to anchor an ID span to its context cue ("MRN" or "member ID") regardless of the specific token pattern that follows.

Hard for everyone — PHOTO_REF: "She sent a photo of her rash" is technically PHI under Safe Harbor (photographic references), but neither this model nor any frontier API tags it reliably. F1 is 0.08 / 0.12 / 0.04 / 0.00 respectively. The category is borderline and the data sparse.

Recommended production pattern: hybrid

Given the per-entity breakdown, the right production architecture is not "use this model alone" or "use a frontier API alone." It's:

Run this model first for every inbound message. ~30 ms, $0, never sends text off-host. Captures NAME, LOC, DATE, PHONE, FAX, IP, AGE_OVER_89 reliably.
Add regex matchers for the highly-structured patterns the model misses: SSN (\d{3}-\d{2}-\d{4}), basic MRN patterns, account numbers. Regex is fast, free, and brittle but correct when it matches.
Fall back to a frontier API only when the message contains likely structured-ID content the local pipeline didn't resolve. This pays the latency and dollar cost only on a small fraction of traffic.

For a practice receiving 10,000 messages a day where ~10% have unresolved structured-ID content, the hybrid sends 1,000 calls/day to Haiku ($0.30/day) instead of 10,000 ($3.00/day), and most messages never leave the host at all.

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-phi-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = "Hi Dr. Okafor, this is Iniko Adeleke, DOB 11/03/1985. My phone is 312.555.7820, email jordan@workmail.io. I live in Brookline."

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

# Decode BIO tags back to character spans
id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type, "start": start, "end": end})
        i = j
    else:
        i += 1

for s in spans:
    print(s)
# {'text': 'Dr. Okafor', 'label': 'NAME', ...}
# {'text': 'Iniko Adeleke', 'label': 'NAME', ...}
# {'text': '11/03/1985', 'label': 'DATE', ...}
# {'text': '312.555.7820', 'label': 'PHONE', ...}
# {'text': 'jordan@workmail.io', 'label': 'EMAIL', ...}
# {'text': 'Brookline', 'label': 'LOC', ...}

Limitations

All training and evaluation data is synthetic. This model has not been evaluated on real patient inquiries. Production deployment should include a calibration check against a sample of real inbound text.
English only.
Structured-ID weakness. Per the benchmark above, this model is materially worse than frontier APIs on MRN, LICENSE, HEALTH_PLAN, BIOMETRIC, and several others. Do not deploy it as a sole PHI redactor for these categories — pair with regex matchers or a frontier-API fallback.
Span-boundary brittleness on novel formats. When the model sees an ID format it didn't encounter during training, it tends to produce span boundaries that include or exclude one or two tokens incorrectly — enough to register as an entity-level miss under seqeval's strict matching.
PHOTO_REF is unreliable for every model on this benchmark, including this one. The category itself is borderline.
Synthetic-data style bias. Despite using two different LLMs for train and test, both are LLM-generated and may share systematic biases (overconfident phrasing, well-formed scenarios) that don't fully match real-world tail distribution.

Intended use

A first-pass PHI tagger in a hybrid redaction pipeline for healthcare practice intake software. Strongest on the linguistic entity categories (NAME, LOC, DATE, PHONE, FAX, IP, AGE_OVER_89). Should be paired with regex matchers and/or a frontier-API fallback for structured-ID entities.

Out-of-scope use

HIPAA compliance verification. Detecting PHI entity types and certifying HIPAA compliance are different problems. This model only does the first.
Sole reliance for high-stakes redaction. A model that misses 70% of MRNs on test data is not a complete redactor on its own.
Adversarial inputs. Not hardened against prompt injection or adversarial text.
Non-healthcare PHI. Trained only on healthcare practice inbound text.

Citation

@misc{sikder2026clarioscope_phi,
  author = {Sikder, Akteruzzaman Raihan},
  title  = {ClarioScope PHI detector v1: a 125M-parameter RoBERTa fine-tune for Safe Harbor PHI span detection},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-phi-deberta-v1}},
}

A detailed methodology writeup — including the data-cleanup story, the cross-model train/test split, and the per-entity benchmark interpretation — is on dev.to: Where small models beat frontier LLMs (and where they don't): a 125M PHI detector.

Author

Built by Akteruzzaman Raihan Sikder — CTO, ClarioScope AI. Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) — a three-model intake intelligence pipeline.

Downloads last month: 337

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for raihan-js/clarioscope-phi-deberta-v1

Base model

FacebookAI/roberta-base

Finetuned

(2325)

this model

Evaluation results

macro-f1 on clarioscope-phi-suite
self-reported

0.630
weighted-f1 on clarioscope-phi-suite
self-reported

0.764
latency_ms_per_example on clarioscope-phi-suite
self-reported

28.600

raihan-js
/

clarioscope-phi-deberta-v1