pii-small-en / README.md
smoh's picture
v1.4: epoch 3 best checkpoint (F1=0.889, 41 entity types, 241K training examples)
98ca2c3 verified
---
license: apache-2.0
language:
- en
tags:
- pii
- ner
- token-classification
- privacy
- deberta
datasets:
- ai4privacy/pii-masking-200k
- nvidia/Nemotron-PII
- gretelai/synthetic_pii_finance_multilingual
- gretelai/gretel-pii-masking-en-v1
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
model-index:
- name: pii-small-en
results:
- task:
type: token-classification
name: PII NER
metrics:
- name: F1
type: f1
value: 0.8894
- name: Precision
type: precision
value: 0.8698
- name: Recall
type: recall
value: 0.9098
---
# DataFog PII-Small-EN (v1.4)
A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.
## Architecture
- **Backbone:** DeBERTa-v3-xsmall (22M params)
- **Head:** CharCNN + CRF
- **Total params:** ~45M
- **Entity types:** 41 PII categories across 4 sensitivity tiers
## Performance (v1.4, Epoch 3)
| Tier | Recall | Target |
|------|--------|--------|
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
| **Overall F1** | **0.889** | |
## Training Data
- AI4Privacy (~43K examples, English subset)
- NVIDIA Nemotron-PII (100K examples)
- Gretel Synthetic PII Finance (26K examples)
- Gretel PII Masking EN v1 (50K examples)
- Synthetic data for rare entity types (22K examples)
- **Total: ~241K examples**
## Entity Types (41)
### Tier 1 — Critical PII
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID
### Tier 2 — High Sensitivity
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS
### Tier 3 — Moderate Sensitivity
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS
### Tier 4 — Domain-Specific
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION
## License
Apache 2.0