File size: 2,238 Bytes
6ba9c65
 
 
 
 
 
98ca2c3
 
6ba9c65
 
 
98ca2c3
 
 
6ba9c65
98ca2c3
 
 
 
6ba9c65
 
98ca2c3
6ba9c65
 
 
98ca2c3
6ba9c65
98ca2c3
 
 
 
 
 
 
 
 
6ba9c65
 
98ca2c3
6ba9c65
98ca2c3
6ba9c65
 
 
98ca2c3
 
 
 
6ba9c65
98ca2c3
6ba9c65
98ca2c3
 
 
 
 
 
 
6ba9c65
98ca2c3
6ba9c65
98ca2c3
 
 
 
 
 
6ba9c65
98ca2c3
6ba9c65
98ca2c3
 
6ba9c65
98ca2c3
 
6ba9c65
98ca2c3
 
6ba9c65
98ca2c3
 
6ba9c65
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
language:
- en
tags:
- pii
- ner
- token-classification
- privacy
- deberta
datasets:
- ai4privacy/pii-masking-200k
- nvidia/Nemotron-PII
- gretelai/synthetic_pii_finance_multilingual
- gretelai/gretel-pii-masking-en-v1
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
model-index:
- name: pii-small-en
  results:
  - task:
      type: token-classification
      name: PII NER
    metrics:
    - name: F1
      type: f1
      value: 0.8894
    - name: Precision
      type: precision
      value: 0.8698
    - name: Recall
      type: recall
      value: 0.9098
---

# DataFog PII-Small-EN (v1.4)

A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.

## Architecture

- **Backbone:** DeBERTa-v3-xsmall (22M params)
- **Head:** CharCNN + CRF
- **Total params:** ~45M
- **Entity types:** 41 PII categories across 4 sensitivity tiers

## Performance (v1.4, Epoch 3)

| Tier | Recall | Target |
|------|--------|--------|
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
| **Overall F1** | **0.889** | |

## Training Data

- AI4Privacy (~43K examples, English subset)
- NVIDIA Nemotron-PII (100K examples)
- Gretel Synthetic PII Finance (26K examples)
- Gretel PII Masking EN v1 (50K examples)
- Synthetic data for rare entity types (22K examples)
- **Total: ~241K examples**

## Entity Types (41)

### Tier 1 — Critical PII
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID

### Tier 2 — High Sensitivity
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS

### Tier 3 — Moderate Sensitivity
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS

### Tier 4 — Domain-Specific
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION

## License

Apache 2.0