---
language:
- en
license: apache-2.0
base_model: openmedscience/BioClinical-ModernBERT-base
tags:
- token-classification
- ner
- pii
- pii-detection
- de-identification
- privacy
- healthcare
- medical
- clinical
- phi
- hipaa
- pytorch
- transformers
- openmed
datasets:
- nvidia/Nemotron-PII
pipeline_tag: token-classification
library_name: transformers
metrics:
- f1
- precision
- recall
model-index:
- name: OpenMed-PII-BioClinicalModern-Base-149M-v1
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: nvidia/Nemotron-PII (test_strat)
type: nvidia/Nemotron-PII
split: test
metrics:
- type: f1
value: 0.9509
name: F1 (micro)
- type: precision
value: 0.9611
name: Precision
- type: recall
value: 0.9409
name: Recall
widget:
- text: "Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street, Boston, MA 02108."
example_title: Clinical Note with PII
---
# OpenMed-PII-BioClinicalModern-Base-149M-v1
**PII Detection Model** | 149M Parameters | Open Source
[]() []() []()
## Model Description
**OpenMed-PII-BioClinicalModern-Base-149M-v1** is a transformer-based token classification model fine-tuned for **Personally Identifiable Information (PII) detection** in text. This model identifies and classifies **54 types of sensitive information** including names, addresses, SSNs, medical record numbers, and more.
### Key Features
- **High Accuracy**: Achieves strong F1 scores across diverse PII categories
- **Comprehensive Coverage**: Detects 50+ entity types spanning personal, financial, medical, and contact information
- **Privacy-Focused**: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations
- **Production-Ready**: Optimized for real-world text processing pipelines
## Performance
Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:
| Metric | Score |
|:---|:---:|
| **Micro F1** | **0.9509** |
| Precision | 0.9611 |
| Recall | 0.9409 |
| Macro F1 | 0.9523 |
| Weighted F1 | 0.9489 |
| Accuracy | 0.9932 |
### Top 10 PII Models
| Rank | Model | F1 | Precision | Recall |
|:---:|:---|:---:|:---:|:---:|
| 1 | [OpenMed-PII-SuperClinical-Large-434M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperClinical-Large-434M-v1) | 0.9608 | 0.9685 | 0.9532 |
| 2 | [OpenMed-PII-BigMed-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-BigMed-Large-560M-v1) | 0.9604 | 0.9644 | 0.9565 |
| 3 | [OpenMed-PII-EuroMed-210M-v1](https://huggingface.co/openmed/OpenMed-PII-EuroMed-210M-v1) | 0.9600 | 0.9681 | 0.9521 |
| 4 | [OpenMed-PII-SnowflakeMed-568M-v1](https://huggingface.co/openmed/OpenMed-PII-SnowflakeMed-568M-v1) | 0.9594 | 0.9640 | 0.9548 |
| 5 | [OpenMed-PII-SuperMedical-Large-355M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperMedical-Large-355M-v1) | 0.9592 | 0.9632 | 0.9553 |
| 6 | [OpenMed-PII-ClinicalBGE-568M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalBGE-568M-v1) | 0.9587 | 0.9636 | 0.9538 |
| 7 | [OpenMed-PII-mClinicalE5-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-mClinicalE5-Large-560M-v1) | 0.9582 | 0.9631 | 0.9533 |
| 8 | [OpenMed-PII-ModernMed-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-ModernMed-Large-395M-v1) | 0.9579 | 0.9639 | 0.9520 |
| 9 | [OpenMed-PII-BioClinicalModern-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Large-395M-v1) | 0.9579 | 0.9656 | 0.9502 |
| 10 | [OpenMed-PII-ClinicalE5-Large-335M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalE5-Large-335M-v1) | 0.9577 | 0.9604 | 0.9550 |
### Best Performing Entities
| Entity | F1 | Precision | Recall | Support |
|:---|:---:|:---:|:---:|:---:|
| `biometric_identifier` | 0.998 | 0.996 | 1.000 | 228 |
| `credit_debit_card` | 0.998 | 0.995 | 1.000 | 213 |
| `race_ethnicity` | 0.997 | 1.000 | 0.995 | 193 |
| `blood_type` | 0.996 | 0.993 | 1.000 | 133 |
| `email` | 0.993 | 0.993 | 0.993 | 745 |
### Challenging Entities
These entity types have lower performance and may benefit from additional post-processing:
| Entity | F1 | Precision | Recall | Support |
|:---|:---:|:---:|:---:|:---:|
| `unique_id` | 0.889 | 0.919 | 0.861 | 79 |
| `education_level` | 0.875 | 0.916 | 0.837 | 196 |
| `fax_number` | 0.856 | 0.786 | 0.939 | 98 |
| `time` | 0.848 | 0.886 | 0.813 | 460 |
| `occupation` | 0.602 | 0.704 | 0.526 | 688 |
## Supported Entity Types
This model detects **54 PII entity types** organized into categories:
Identifiers (16 types)
| Entity | Description |
|:---|:---|
| `account_number` | Account Number |
| `api_key` | Api Key |
| `bank_routing_number` | Bank Routing Number |
| `certificate_license_number` | Certificate License Number |
| `credit_debit_card` | Credit Debit Card |
| `cvv` | Cvv |
| `employee_id` | Employee Id |
| `health_plan_beneficiary_number` | Health Plan Beneficiary Number |
| `mac_address` | Mac Address |
| `medical_record_number` | Medical Record Number |
| ... | *and 6 more* |
Personal Info (14 types)
| Entity | Description |
|:---|:---|
| `age` | Age |
| `biometric_identifier` | Biometric Identifier |
| `blood_type` | Blood Type |
| `date_of_birth` | Date Of Birth |
| `education_level` | Education Level |
| `first_name` | First Name |
| `last_name` | Last Name |
| `gender` | Gender |
| `language` | Language |
| `occupation` | Occupation |
| ... | *and 4 more* |
Contact Info (4 types)
| Entity | Description |
|:---|:---|
| `email` | Email |
| `phone_number` | Phone Number |
| `fax_number` | Fax Number |
| `url` | Url |
Location (6 types)
| Entity | Description |
|:---|:---|
| `city` | City |
| `coordinate` | Coordinate |
| `country` | Country |
| `county` | County |
| `state` | State |
| `street_address` | Street Address |
Network Info (3 types)
| Entity | Description |
|:---|:---|
| `device_identifier` | Device Identifier |
| `ipv4` | Ipv4 |
| `ipv6` | Ipv6 |
Temporal (3 types)
| Entity | Description |
|:---|:---|
| `date` | Date |
| `date_time` | Date Time |
| `time` | Time |
Organization (1 types)
| Entity | Description |
|:---|:---|
| `company_name` | Company Name |
## Usage
### Quick Start
```python
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1", aggregation_strategy="simple")
text = """
Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.
Contact: john.smith@email.com, Phone: (555) 123-4567.
Address: 456 Oak Street, Boston, MA 02108.
"""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
```
### De-identification Example
```python
def redact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""
# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
```
### Batch Processing
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Contact Dr. Jane Doe at jane.doe@hospital.org",
"Patient SSN: 987-65-4321, MRN: 12345678",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
```
## Training Details
### Dataset
- **Source**: [NVIDIA Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII)
- **Format**: BIO-tagged token classification
- **Labels**: 106 total (53 entity types × 2 BIO tags + O)
- **Splits**: 50K train / 5K validation / 45K test
### Training Configuration
- **Max Sequence Length**: 384 tokens
- **Label Strategy**: First token only (`label_all_tokens=False`)
- **Framework**: Hugging Face Transformers + Trainer API
## Intended Use & Limitations
### Intended Use
- **De-identification**: Automated redaction of PII in clinical notes, medical records, and documents
- **Compliance**: Supporting HIPAA, GDPR, and privacy regulation compliance
- **Data Preprocessing**: Preparing datasets for research by removing sensitive information
- **Audit Support**: Identifying PII in document collections
### Limitations
⚠️ **Important**: This model is intended as an **assistive tool**, not a replacement for human review.
- **False Negatives**: Some PII may not be detected; always verify critical applications
- **Context Sensitivity**: Performance may vary with domain-specific terminology
- **Challenging Categories**: `occupation`, `time`, and `sexuality` have lower F1 scores
- **Language**: Primarily trained on English text
## Citation
```bibtex
@misc{openmed-pii-2026,
title = {OpenMed-PII-BioClinicalModern-Base-149M-v1: PII Detection Model},
author = {OpenMed Science},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1}
}
```
## Links
- **Organization**: [OpenMed](https://huggingface.co/OpenMed)