|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- pii |
|
|
- ner |
|
|
- token-classification |
|
|
- privacy |
|
|
- deberta |
|
|
datasets: |
|
|
- ai4privacy/pii-masking-200k |
|
|
- nvidia/Nemotron-PII |
|
|
- gretelai/synthetic_pii_finance_multilingual |
|
|
- gretelai/gretel-pii-masking-en-v1 |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
pipeline_tag: token-classification |
|
|
model-index: |
|
|
- name: pii-small-en |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: PII NER |
|
|
metrics: |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.8894 |
|
|
- name: Precision |
|
|
type: precision |
|
|
value: 0.8698 |
|
|
- name: Recall |
|
|
type: recall |
|
|
value: 0.9098 |
|
|
--- |
|
|
|
|
|
# DataFog PII-Small-EN (v1.4) |
|
|
|
|
|
A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Backbone:** DeBERTa-v3-xsmall (22M params) |
|
|
- **Head:** CharCNN + CRF |
|
|
- **Total params:** ~45M |
|
|
- **Entity types:** 41 PII categories across 4 sensitivity tiers |
|
|
|
|
|
## Performance (v1.4, Epoch 3) |
|
|
|
|
|
| Tier | Recall | Target | |
|
|
|------|--------|--------| |
|
|
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 | |
|
|
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 | |
|
|
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 | |
|
|
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 | |
|
|
| **Overall F1** | **0.889** | | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- AI4Privacy (~43K examples, English subset) |
|
|
- NVIDIA Nemotron-PII (100K examples) |
|
|
- Gretel Synthetic PII Finance (26K examples) |
|
|
- Gretel PII Masking EN v1 (50K examples) |
|
|
- Synthetic data for rare entity types (22K examples) |
|
|
- **Total: ~241K examples** |
|
|
|
|
|
## Entity Types (41) |
|
|
|
|
|
### Tier 1 — Critical PII |
|
|
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID |
|
|
|
|
|
### Tier 2 — High Sensitivity |
|
|
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS |
|
|
|
|
|
### Tier 3 — Moderate Sensitivity |
|
|
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS |
|
|
|
|
|
### Tier 4 — Domain-Specific |
|
|
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|