OpenMed-PII-BigMed-Large-560M-v1
PII Detection Model | 560M Parameters | Open Source

Model Description
OpenMed-PII-BigMed-Large-560M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in text. This model identifies and classifies 54 types of sensitive information including names, addresses, SSNs, medical record numbers, and more.
Key Features
- High Accuracy: Achieves strong F1 scores across diverse PII categories
- Comprehensive Coverage: Detects 50+ entity types spanning personal, financial, medical, and contact information
- Privacy-Focused: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations
- Production-Ready: Optimized for real-world text processing pipelines
Performance
Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:
| Metric |
Score |
| Micro F1 |
0.9604 |
| Precision |
0.9644 |
| Recall |
0.9565 |
| Macro F1 |
0.9647 |
| Weighted F1 |
0.9597 |
| Accuracy |
0.9940 |
Top 10 PII Models
Best Performing Entities
| Entity |
F1 |
Precision |
Recall |
Support |
biometric_identifier |
1.000 |
1.000 |
1.000 |
218 |
blood_type |
1.000 |
1.000 |
1.000 |
132 |
credit_debit_card |
1.000 |
1.000 |
1.000 |
211 |
health_plan_beneficiary_number |
1.000 |
1.000 |
1.000 |
210 |
medical_record_number |
0.998 |
0.996 |
1.000 |
262 |
Challenging Entities
These entity types have lower performance and may benefit from additional post-processing:
| Entity |
F1 |
Precision |
Recall |
Support |
language |
0.907 |
0.988 |
0.838 |
198 |
unique_id |
0.907 |
0.919 |
0.895 |
76 |
sexuality |
0.878 |
0.837 |
0.923 |
78 |
time |
0.859 |
0.852 |
0.867 |
450 |
occupation |
0.679 |
0.724 |
0.639 |
682 |
Supported Entity Types
This model detects 54 PII entity types organized into categories:
Identifiers (16 types)
| Entity |
Description |
account_number |
Account Number |
api_key |
Api Key |
bank_routing_number |
Bank Routing Number |
certificate_license_number |
Certificate License Number |
credit_debit_card |
Credit Debit Card |
cvv |
Cvv |
employee_id |
Employee Id |
health_plan_beneficiary_number |
Health Plan Beneficiary Number |
mac_address |
Mac Address |
medical_record_number |
Medical Record Number |
| ... |
and 6 more |
Personal Info (14 types)
| Entity |
Description |
age |
Age |
biometric_identifier |
Biometric Identifier |
blood_type |
Blood Type |
date_of_birth |
Date Of Birth |
education_level |
Education Level |
first_name |
First Name |
last_name |
Last Name |
gender |
Gender |
language |
Language |
occupation |
Occupation |
| ... |
and 4 more |
Contact Info (4 types)
| Entity |
Description |
email |
Email |
phone_number |
Phone Number |
fax_number |
Fax Number |
url |
Url |
Location (6 types)
| Entity |
Description |
city |
City |
coordinate |
Coordinate |
country |
Country |
county |
County |
state |
State |
street_address |
Street Address |
Network Info (3 types)
| Entity |
Description |
device_identifier |
Device Identifier |
ipv4 |
Ipv4 |
ipv6 |
Ipv6 |
Temporal (3 types)
| Entity |
Description |
date |
Date |
date_time |
Date Time |
time |
Time |
Organization (1 types)
| Entity |
Description |
company_name |
Company Name |
Usage
Quick Start
from transformers import pipeline
ner = pipeline("ner", model="openmed/OpenMed-PII-BigMed-Large-560M-v1", aggregation_strategy="simple")
text = """
Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.
Contact: john.smith@email.com, Phone: (555) 123-4567.
Address: 456 Oak Street, Boston, MA 02108.
"""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
De-identification Example
def redact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
redacted_text = redact_pii(text, entities)
print(redacted_text)
Batch Processing
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "openmed/OpenMed-PII-BigMed-Large-560M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Contact Dr. Jane Doe at jane.doe@hospital.org",
"Patient SSN: 987-65-4321, MRN: 12345678",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
Training Details
Dataset
- Source: NVIDIA Nemotron-PII
- Format: BIO-tagged token classification
- Labels: 106 total (53 entity types × 2 BIO tags + O)
- Splits: 50K train / 5K validation / 45K test
Training Configuration
- Max Sequence Length: 384 tokens
- Label Strategy: First token only (
label_all_tokens=False)
- Framework: Hugging Face Transformers + Trainer API
Intended Use & Limitations
Intended Use
- De-identification: Automated redaction of PII in clinical notes, medical records, and documents
- Compliance: Supporting HIPAA, GDPR, and privacy regulation compliance
- Data Preprocessing: Preparing datasets for research by removing sensitive information
- Audit Support: Identifying PII in document collections
Limitations
⚠️ Important: This model is intended as an assistive tool, not a replacement for human review.
- False Negatives: Some PII may not be detected; always verify critical applications
- Context Sensitivity: Performance may vary with domain-specific terminology
- Challenging Categories:
occupation, time, and sexuality have lower F1 scores
- Language: Primarily trained on English text
Citation
@misc{openmed-pii-2026,
title = {OpenMed-PII-BigMed-Large-560M-v1: PII Detection Model},
author = {OpenMed Science},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/openmed/OpenMed-PII-BigMed-Large-560M-v1}
}
Links