MaziyarPanahi's picture
Upload PII detection model OpenMed-PII-BioClinicalModern-Base-149M-v1
732b1c5 verified
|
raw
history blame
10.3 kB
metadata
language:
  - en
license: apache-2.0
base_model: openmedscience/BioClinical-ModernBERT-base
tags:
  - token-classification
  - ner
  - pii
  - pii-detection
  - de-identification
  - privacy
  - healthcare
  - medical
  - clinical
  - phi
  - hipaa
  - pytorch
  - transformers
  - openmed
datasets:
  - nvidia/Nemotron-PII
pipeline_tag: token-classification
library_name: transformers
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: OpenMed-PII-BioClinicalModern-Base-149M-v1
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: nvidia/Nemotron-PII (test_strat)
          type: nvidia/Nemotron-PII
          split: test
        metrics:
          - type: f1
            value: 0.9509
            name: F1 (micro)
          - type: precision
            value: 0.9611
            name: Precision
          - type: recall
            value: 0.9409
            name: Recall
widget:
  - text: >-
      Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at
      sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street,
      Boston, MA 02108.
    example_title: Clinical Note with PII

OpenMed-PII-BioClinicalModern-Base-149M-v1

PII Detection Model | 149M Parameters | Open Source

F1 Score Precision Recall

Model Description

OpenMed-PII-BioClinicalModern-Base-149M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in text. This model identifies and classifies 54 types of sensitive information including names, addresses, SSNs, medical record numbers, and more.

Key Features

  • High Accuracy: Achieves strong F1 scores across diverse PII categories
  • Comprehensive Coverage: Detects 50+ entity types spanning personal, financial, medical, and contact information
  • Privacy-Focused: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations
  • Production-Ready: Optimized for real-world text processing pipelines

Performance

Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:

Metric Score
Micro F1 0.9509
Precision 0.9611
Recall 0.9409
Macro F1 0.9523
Weighted F1 0.9489
Accuracy 0.9932

Top 10 PII Models

Rank Model F1 Precision Recall
1 OpenMed-PII-SuperClinical-Large-434M-v1 0.9608 0.9685 0.9532
2 OpenMed-PII-BigMed-Large-560M-v1 0.9604 0.9644 0.9565
3 OpenMed-PII-EuroMed-210M-v1 0.9600 0.9681 0.9521
4 OpenMed-PII-SnowflakeMed-568M-v1 0.9594 0.9640 0.9548
5 OpenMed-PII-SuperMedical-Large-355M-v1 0.9592 0.9632 0.9553
6 OpenMed-PII-ClinicalBGE-568M-v1 0.9587 0.9636 0.9538
7 OpenMed-PII-mClinicalE5-Large-560M-v1 0.9582 0.9631 0.9533
8 OpenMed-PII-ModernMed-Large-395M-v1 0.9579 0.9639 0.9520
9 OpenMed-PII-BioClinicalModern-Large-395M-v1 0.9579 0.9656 0.9502
10 OpenMed-PII-ClinicalE5-Large-335M-v1 0.9577 0.9604 0.9550

Best Performing Entities

Entity F1 Precision Recall Support
biometric_identifier 0.998 0.996 1.000 228
credit_debit_card 0.998 0.995 1.000 213
race_ethnicity 0.997 1.000 0.995 193
blood_type 0.996 0.993 1.000 133
email 0.993 0.993 0.993 745

Challenging Entities

These entity types have lower performance and may benefit from additional post-processing:

Entity F1 Precision Recall Support
unique_id 0.889 0.919 0.861 79
education_level 0.875 0.916 0.837 196
fax_number 0.856 0.786 0.939 98
time 0.848 0.886 0.813 460
occupation 0.602 0.704 0.526 688

Supported Entity Types

This model detects 54 PII entity types organized into categories:

Identifiers (16 types)
Entity Description
account_number Account Number
api_key Api Key
bank_routing_number Bank Routing Number
certificate_license_number Certificate License Number
credit_debit_card Credit Debit Card
cvv Cvv
employee_id Employee Id
health_plan_beneficiary_number Health Plan Beneficiary Number
mac_address Mac Address
medical_record_number Medical Record Number
... and 6 more
Personal Info (14 types)
Entity Description
age Age
biometric_identifier Biometric Identifier
blood_type Blood Type
date_of_birth Date Of Birth
education_level Education Level
first_name First Name
last_name Last Name
gender Gender
language Language
occupation Occupation
... and 4 more
Contact Info (4 types)
Entity Description
email Email
phone_number Phone Number
fax_number Fax Number
url Url
Location (6 types)
Entity Description
city City
coordinate Coordinate
country Country
county County
state State
street_address Street Address
Network Info (3 types)
Entity Description
device_identifier Device Identifier
ipv4 Ipv4
ipv6 Ipv6
Temporal (3 types)
Entity Description
date Date
date_time Date Time
time Time
Organization (1 types)
Entity Description
company_name Company Name

Usage

Quick Start

from transformers import pipeline

# Load the PII detection pipeline
ner = pipeline("ner", model="openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1", aggregation_strategy="simple")

text = """
Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.
Contact: john.smith@email.com, Phone: (555) 123-4567.
Address: 456 Oak Street, Boston, MA 02108.
"""

entities = ner(text)
for entity in entities:
    print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")

De-identification Example

def redact_pii(text, entities, placeholder='[REDACTED]'):
    """Replace detected PII with placeholders."""
    # Sort entities by start position (descending) to preserve offsets
    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
    redacted = text
    for ent in sorted_entities:
        redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
    return redacted

# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)

Batch Processing

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model_name = "openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

texts = [
    "Contact Dr. Jane Doe at jane.doe@hospital.org",
    "Patient SSN: 987-65-4321, MRN: 12345678",
]

inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

Training Details

Dataset

  • Source: NVIDIA Nemotron-PII
  • Format: BIO-tagged token classification
  • Labels: 106 total (53 entity types × 2 BIO tags + O)
  • Splits: 50K train / 5K validation / 45K test

Training Configuration

  • Max Sequence Length: 384 tokens
  • Label Strategy: First token only (label_all_tokens=False)
  • Framework: Hugging Face Transformers + Trainer API

Intended Use & Limitations

Intended Use

  • De-identification: Automated redaction of PII in clinical notes, medical records, and documents
  • Compliance: Supporting HIPAA, GDPR, and privacy regulation compliance
  • Data Preprocessing: Preparing datasets for research by removing sensitive information
  • Audit Support: Identifying PII in document collections

Limitations

⚠️ Important: This model is intended as an assistive tool, not a replacement for human review.

  • False Negatives: Some PII may not be detected; always verify critical applications
  • Context Sensitivity: Performance may vary with domain-specific terminology
  • Challenging Categories: occupation, time, and sexuality have lower F1 scores
  • Language: Primarily trained on English text

Citation

@misc{openmed-pii-2026,
  title = {OpenMed-PII-BioClinicalModern-Base-149M-v1: PII Detection Model},
  author = {OpenMed Science},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1}
}

Links