--- language: - en license: apache-2.0 base_model: openmedscience/BioClinical-ModernBERT-base tags: - token-classification - ner - pii - pii-detection - de-identification - privacy - healthcare - medical - clinical - phi - hipaa - pytorch - transformers - openmed datasets: - nvidia/Nemotron-PII pipeline_tag: token-classification library_name: transformers metrics: - f1 - precision - recall model-index: - name: OpenMed-PII-BioClinicalModern-Base-149M-v1 results: - task: type: token-classification name: Named Entity Recognition dataset: name: nvidia/Nemotron-PII (test_strat) type: nvidia/Nemotron-PII split: test metrics: - type: f1 value: 0.9509 name: F1 (micro) - type: precision value: 0.9611 name: Precision - type: recall value: 0.9409 name: Recall widget: - text: "Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street, Boston, MA 02108." example_title: Clinical Note with PII --- # OpenMed-PII-BioClinicalModern-Base-149M-v1 **PII Detection Model** | 149M Parameters | Open Source [![F1 Score](https://img.shields.io/badge/F1-95.09%25-brightgreen)]() [![Precision](https://img.shields.io/badge/Precision-96.11%25-blue)]() [![Recall](https://img.shields.io/badge/Recall-94.09%25-orange)]() ## Model Description **OpenMed-PII-BioClinicalModern-Base-149M-v1** is a transformer-based token classification model fine-tuned for **Personally Identifiable Information (PII) detection** in text. This model identifies and classifies **54 types of sensitive information** including names, addresses, SSNs, medical record numbers, and more. ### Key Features - **High Accuracy**: Achieves strong F1 scores across diverse PII categories - **Comprehensive Coverage**: Detects 50+ entity types spanning personal, financial, medical, and contact information - **Privacy-Focused**: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations - **Production-Ready**: Optimized for real-world text processing pipelines ## Performance Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII: | Metric | Score | |:---|:---:| | **Micro F1** | **0.9509** | | Precision | 0.9611 | | Recall | 0.9409 | | Macro F1 | 0.9523 | | Weighted F1 | 0.9489 | | Accuracy | 0.9932 | ### Top 10 PII Models | Rank | Model | F1 | Precision | Recall | |:---:|:---|:---:|:---:|:---:| | 1 | [OpenMed-PII-SuperClinical-Large-434M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperClinical-Large-434M-v1) | 0.9608 | 0.9685 | 0.9532 | | 2 | [OpenMed-PII-BigMed-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-BigMed-Large-560M-v1) | 0.9604 | 0.9644 | 0.9565 | | 3 | [OpenMed-PII-EuroMed-210M-v1](https://huggingface.co/openmed/OpenMed-PII-EuroMed-210M-v1) | 0.9600 | 0.9681 | 0.9521 | | 4 | [OpenMed-PII-SnowflakeMed-568M-v1](https://huggingface.co/openmed/OpenMed-PII-SnowflakeMed-568M-v1) | 0.9594 | 0.9640 | 0.9548 | | 5 | [OpenMed-PII-SuperMedical-Large-355M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperMedical-Large-355M-v1) | 0.9592 | 0.9632 | 0.9553 | | 6 | [OpenMed-PII-ClinicalBGE-568M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalBGE-568M-v1) | 0.9587 | 0.9636 | 0.9538 | | 7 | [OpenMed-PII-mClinicalE5-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-mClinicalE5-Large-560M-v1) | 0.9582 | 0.9631 | 0.9533 | | 8 | [OpenMed-PII-ModernMed-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-ModernMed-Large-395M-v1) | 0.9579 | 0.9639 | 0.9520 | | 9 | [OpenMed-PII-BioClinicalModern-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Large-395M-v1) | 0.9579 | 0.9656 | 0.9502 | | 10 | [OpenMed-PII-ClinicalE5-Large-335M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalE5-Large-335M-v1) | 0.9577 | 0.9604 | 0.9550 | ### Best Performing Entities | Entity | F1 | Precision | Recall | Support | |:---|:---:|:---:|:---:|:---:| | `biometric_identifier` | 0.998 | 0.996 | 1.000 | 228 | | `credit_debit_card` | 0.998 | 0.995 | 1.000 | 213 | | `race_ethnicity` | 0.997 | 1.000 | 0.995 | 193 | | `blood_type` | 0.996 | 0.993 | 1.000 | 133 | | `email` | 0.993 | 0.993 | 0.993 | 745 | ### Challenging Entities These entity types have lower performance and may benefit from additional post-processing: | Entity | F1 | Precision | Recall | Support | |:---|:---:|:---:|:---:|:---:| | `unique_id` | 0.889 | 0.919 | 0.861 | 79 | | `education_level` | 0.875 | 0.916 | 0.837 | 196 | | `fax_number` | 0.856 | 0.786 | 0.939 | 98 | | `time` | 0.848 | 0.886 | 0.813 | 460 | | `occupation` | 0.602 | 0.704 | 0.526 | 688 | ## Supported Entity Types This model detects **54 PII entity types** organized into categories:
Identifiers (16 types) | Entity | Description | |:---|:---| | `account_number` | Account Number | | `api_key` | Api Key | | `bank_routing_number` | Bank Routing Number | | `certificate_license_number` | Certificate License Number | | `credit_debit_card` | Credit Debit Card | | `cvv` | Cvv | | `employee_id` | Employee Id | | `health_plan_beneficiary_number` | Health Plan Beneficiary Number | | `mac_address` | Mac Address | | `medical_record_number` | Medical Record Number | | ... | *and 6 more* |
Personal Info (14 types) | Entity | Description | |:---|:---| | `age` | Age | | `biometric_identifier` | Biometric Identifier | | `blood_type` | Blood Type | | `date_of_birth` | Date Of Birth | | `education_level` | Education Level | | `first_name` | First Name | | `last_name` | Last Name | | `gender` | Gender | | `language` | Language | | `occupation` | Occupation | | ... | *and 4 more* |
Contact Info (4 types) | Entity | Description | |:---|:---| | `email` | Email | | `phone_number` | Phone Number | | `fax_number` | Fax Number | | `url` | Url |
Location (6 types) | Entity | Description | |:---|:---| | `city` | City | | `coordinate` | Coordinate | | `country` | Country | | `county` | County | | `state` | State | | `street_address` | Street Address |
Network Info (3 types) | Entity | Description | |:---|:---| | `device_identifier` | Device Identifier | | `ipv4` | Ipv4 | | `ipv6` | Ipv6 |
Temporal (3 types) | Entity | Description | |:---|:---| | `date` | Date | | `date_time` | Date Time | | `time` | Time |
Organization (1 types) | Entity | Description | |:---|:---| | `company_name` | Company Name |
## Usage ### Quick Start ```python from transformers import pipeline # Load the PII detection pipeline ner = pipeline("ner", model="openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1", aggregation_strategy="simple") text = """ Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today. Contact: john.smith@email.com, Phone: (555) 123-4567. Address: 456 Oak Street, Boston, MA 02108. """ entities = ner(text) for entity in entities: print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})") ``` ### De-identification Example ```python def redact_pii(text, entities, placeholder='[REDACTED]'): """Replace detected PII with placeholders.""" # Sort entities by start position (descending) to preserve offsets sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True) redacted = text for ent in sorted_entities: redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:] return redacted # Apply de-identification redacted_text = redact_pii(text, entities) print(redacted_text) ``` ### Batch Processing ```python from transformers import AutoModelForTokenClassification, AutoTokenizer import torch model_name = "openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1" model = AutoModelForTokenClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) texts = [ "Contact Dr. Jane Doe at jane.doe@hospital.org", "Patient SSN: 987-65-4321, MRN: 12345678", ] inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) ``` ## Training Details ### Dataset - **Source**: [NVIDIA Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII) - **Format**: BIO-tagged token classification - **Labels**: 106 total (53 entity types × 2 BIO tags + O) - **Splits**: 50K train / 5K validation / 45K test ### Training Configuration - **Max Sequence Length**: 384 tokens - **Label Strategy**: First token only (`label_all_tokens=False`) - **Framework**: Hugging Face Transformers + Trainer API ## Intended Use & Limitations ### Intended Use - **De-identification**: Automated redaction of PII in clinical notes, medical records, and documents - **Compliance**: Supporting HIPAA, GDPR, and privacy regulation compliance - **Data Preprocessing**: Preparing datasets for research by removing sensitive information - **Audit Support**: Identifying PII in document collections ### Limitations ⚠️ **Important**: This model is intended as an **assistive tool**, not a replacement for human review. - **False Negatives**: Some PII may not be detected; always verify critical applications - **Context Sensitivity**: Performance may vary with domain-specific terminology - **Challenging Categories**: `occupation`, `time`, and `sexuality` have lower F1 scores - **Language**: Primarily trained on English text ## Citation ```bibtex @misc{openmed-pii-2026, title = {OpenMed-PII-BioClinicalModern-Base-149M-v1: PII Detection Model}, author = {OpenMed Science}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Base-149M-v1} } ``` ## Links - **Organization**: [OpenMed](https://huggingface.co/OpenMed)