metadata
license: apache-2.0
language:
- en
tags:
- pii
- ner
- token-classification
- privacy
- deberta
datasets:
- ai4privacy/pii-masking-200k
- nvidia/Nemotron-PII
- gretelai/synthetic_pii_finance_multilingual
- gretelai/gretel-pii-masking-en-v1
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
model-index:
- name: pii-small-en
results:
- task:
type: token-classification
name: PII NER
metrics:
- name: F1
type: f1
value: 0.8894
- name: Precision
type: precision
value: 0.8698
- name: Recall
type: recall
value: 0.9098
DataFog PII-Small-EN (v1.4)
A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.
Architecture
- Backbone: DeBERTa-v3-xsmall (22M params)
- Head: CharCNN + CRF
- Total params: ~45M
- Entity types: 41 PII categories across 4 sensitivity tiers
Performance (v1.4, Epoch 3)
| Tier | Recall | Target |
|---|---|---|
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
| Overall F1 | 0.889 |
Training Data
- AI4Privacy (~43K examples, English subset)
- NVIDIA Nemotron-PII (100K examples)
- Gretel Synthetic PII Finance (26K examples)
- Gretel PII Masking EN v1 (50K examples)
- Synthetic data for rare entity types (22K examples)
- Total: ~241K examples
Entity Types (41)
Tier 1 — Critical PII
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID
Tier 2 — High Sensitivity
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS
Tier 3 — Moderate Sensitivity
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS
Tier 4 — Domain-Specific
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION
License
Apache 2.0