DataFog PII-Small-EN (v1.4)
A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.
Architecture
- Backbone: DeBERTa-v3-xsmall (22M params)
- Head: CharCNN + CRF
- Total params: ~45M
- Entity types: 41 PII categories across 4 sensitivity tiers
Performance (v1.4, Epoch 3)
| Tier | Recall | Target |
|---|---|---|
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
| Overall F1 | 0.889 |
Training Data
- AI4Privacy (~43K examples, English subset)
- NVIDIA Nemotron-PII (100K examples)
- Gretel Synthetic PII Finance (26K examples)
- Gretel PII Masking EN v1 (50K examples)
- Synthetic data for rare entity types (22K examples)
- Total: ~241K examples
Entity Types (41)
Tier 1 β Critical PII
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID
Tier 2 β High Sensitivity
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS
Tier 3 β Moderate Sensitivity
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS
Tier 4 β Domain-Specific
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION
License
Apache 2.0
- Downloads last month
- 21
Datasets used to train DataFog/pii-small-en
Evaluation results
- F1self-reported0.889
- Precisionself-reported0.870
- Recallself-reported0.910