DataFog PII-Small-EN (v1.4)

A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.

Architecture

  • Backbone: DeBERTa-v3-xsmall (22M params)
  • Head: CharCNN + CRF
  • Total params: ~45M
  • Entity types: 41 PII categories across 4 sensitivity tiers

Performance (v1.4, Epoch 3)

Tier Recall Target
T1 (Critical: SSN, Credit Card, etc.) 0.814 >=0.98
T2 (High: Person, Email, Phone, etc.) 0.937 >=0.95
T3 (Moderate: Username, Date, Location, etc.) 0.945 >=0.90
T4 (Domain-specific: Employee ID, Crypto, etc.) 0.937 >=0.85
Overall F1 0.889

Training Data

  • AI4Privacy (~43K examples, English subset)
  • NVIDIA Nemotron-PII (100K examples)
  • Gretel Synthetic PII Finance (26K examples)
  • Gretel PII Masking EN v1 (50K examples)
  • Synthetic data for rare entity types (22K examples)
  • Total: ~241K examples

Entity Types (41)

Tier 1 β€” Critical PII

SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID

Tier 2 β€” High Sensitivity

PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS

Tier 3 β€” Moderate Sensitivity

USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS

Tier 4 β€” Domain-Specific

MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION

License

Apache 2.0

Downloads last month
21
Safetensors
Model size
71.1M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train DataFog/pii-small-en

Evaluation results