PII & De-Identification
Collection
Models for extracting PII entities and de-identifying clinical text, with support for HIPAA and GDPR compliance. • 173 items • Updated
• 32
PII Detection Model | 149M Parameters | Open Source
OpenMed-PII-BioClinicalModern-Base-149M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in text. This model identifies and classifies 54 types of sensitive information including names, addresses, SSNs, medical record numbers, and more.
Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:
| Metric | Score |
|---|---|
| Micro F1 | 0.9509 |
| Precision | 0.9611 |
| Recall | 0.9409 |
| Macro F1 | 0.9523 |
| Weighted F1 | 0.9489 |
| Accuracy | 0.9932 |
| Rank | Model | F1 | Precision | Recall |
|---|---|---|---|---|
| 1 | OpenMed-PII-SuperClinical-Large-434M-v1 | 0.9608 | 0.9685 | 0.9532 |
| 2 | OpenMed-PII-BigMed-Large-560M-v1 | 0.9604 | 0.9644 | 0.9565 |
| 3 | OpenMed-PII-EuroMed-210M-v1 | 0.9600 | 0.9681 | 0.9521 |
| 4 | OpenMed-PII-SnowflakeMed-568M-v1 | 0.9594 | 0.9640 | 0.9548 |
| 5 | OpenMed-PII-SuperMedical-Large-355M-v1 | 0.9592 | 0.9632 | 0.9553 |
| 6 | OpenMed-PII-ClinicalBGE-568M-v1 | 0.9587 | 0.9636 | 0.9538 |
| 7 | OpenMed-PII-mClinicalE5-Large-560M-v1 | 0.9582 | 0.9631 | 0.9533 |
| 8 | OpenMed-PII-ModernMed-Large-395M-v1 | 0.9579 | 0.9639 | 0.9520 |
| 9 | OpenMed-PII-BioClinicalModern-Large-395M-v1 | 0.9579 | 0.9656 | 0.9502 |
| 10 | OpenMed-PII-ClinicalE5-Large-335M-v1 | 0.9577 | 0.9604 | 0.9550 |
| Entity | F1 | Precision | Recall | Support |
|---|---|---|---|---|
biometric_identifier |
0.998 | 0.996 | 1.000 | 228 |
credit_debit_card |
0.998 | 0.995 | 1.000 | 213 |
race_ethnicity |
0.997 | 1.000 | 0.995 | 193 |
blood_type |
0.996 | 0.993 | 1.000 | 133 |
email |
0.993 | 0.993 | 0.993 | 745 |
These entity types have lower performance and may benefit from additional post-processing:
| Entity | F1 | Precision | Recall | Support |
|---|---|---|---|---|
unique_id |
0.889 | 0.919 | 0.861 | 79 |
education_level |
0.875 | 0.916 | 0.837 | 196 |
fax_number |
0.856 | 0.786 | 0.939 | 98 |
time |
0.848 | 0.886 | 0.813 | 460 |
occupation |
0.602 | 0.704 | 0.526 | 688 |
This model detects 54 PII entity types organized into categories:
| Entity | Description |
|---|---|
account_number |
Account Number |
api_key |
Api Key |
bank_routing_number |
Bank Routing Number |
certificate_license_number |
Certificate License Number |
credit_debit_card |
Credit Debit Card |
cvv |
Cvv |
employee_id |
Employee Id |
health_plan_beneficiary_number |
Health Plan Beneficiary Number |
mac_address |
Mac Address |
medical_record_number |
Medical Record Number |
| ... | and 6 more |
| Entity | Description |
|---|---|
age |
Age |
biometric_identifier |
Biometric Identifier |
blood_type |
Blood Type |
date_of_birth |
Date Of Birth |
education_level |
Education Level |
first_name |
First Name |
last_name |
Last Name |
gender |
Gender |
language |
Language |
occupation |
Occupation |
| ... | and 4 more |
| Entity | Description |
|---|---|
email |
|
phone_number |
Phone Number |
fax_number |
Fax Number |
url |
Url |
| Entity | Description |
|---|---|
city |
City |
coordinate |
Coordinate |
country |
Country |
county |
County |
state |
State |
street_address |
Street Address |
| Entity | Description |
|---|---|
device_identifier |
Device Identifier |
ipv4 |
Ipv4 |
ipv6 |
Ipv6 |
| Entity | Description |
|---|---|
date |
Date |
date_time |
Date Time |
time |
Time |
| Entity | Description |
|---|---|
company_name |
Company Name |
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1", aggregation_strategy="simple")
text = """
Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.
Contact: john.smith@email.com, Phone: (555) 123-4567.
Address: 456 Oak Street, Boston, MA 02108.
"""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
def redact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""
# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Contact Dr. Jane Doe at jane.doe@hospital.org",
"Patient SSN: 987-65-4321, MRN: 12345678",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
label_all_tokens=False)⚠️ Important: This model is intended as an assistive tool, not a replacement for human review.
occupation, time, and sexuality have lower F1 scores@misc{openmed-pii-2026,
title = {OpenMed-PII-BioClinicalModern-Base-149M-v1: PII Detection Model},
author = {OpenMed Science},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1}
}