You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

pii-modernbert-large-v2

PII / PHI named-entity recognition fine-tune of answerdotai/ModernBERT-large on the v2 harmonized + synthetic-augmented English-only corpus at Vrandan/pii-harmonized-corpus-v2.

Eval (held-out test split)

Metric	Value
F1	0.5975
Precision	0.5341
Recall	0.6780

Architecture vs v1

46 ML labels (was 49) — 3 dropped to regex layer (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE)
BILOU tagging (was BIO) — 1 + 4×46 = 185 labels
English-only training data (v1 was multilingual leaking)
Rebalanced spans: PERSON capped at 150K, rare classes floored
Synthetic data blended (Tier-A six-failure-mode + Tier-B malformation training)
Class weight cap 1.5× mean (was 3×)

Recipe

Context: native 8192 tokens (ModernBERT alternating attention)
Optimizer: AdamW (fused on CUDA), cosine LR, peak 2e-05, warmup 0.1
Effective batch: 32 (per-device 8 × grad-accum 4 × world 1)
Precision: bf16, gradient checkpointing (use_reentrant=False), SDPA
Loss: class-weighted CE, capped at 1.5× mean (rebalanced data)
Epochs: 3, early-stop patience 4

Inference

from transformers import pipeline
pii = pipeline(
    "token-classification",
    model="Vrandan/pii-modernbert-large-v2",
    aggregation_strategy="simple",
)
pii("Patient John Smith, MRN-2024-88432, called 555-FAKE-1234 about Rx refill.")

Hybrid pipeline note

This model is the ML half of a regex+ML hybrid. Inference output should be merged with the regex layer for the 3 dropped labels (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE) which the model is trained to NOT fire on.

Downloads last month: -

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for Vrandan/pii-modernbert-large-v2

Base model

answerdotai/ModernBERT-large

Finetuned

(324)

this model

Vrandan
/

pii-modernbert-large-v2