pii-small-en / README.md
smoh's picture
v1.4: epoch 3 best checkpoint (F1=0.889, 41 entity types, 241K training examples)
98ca2c3 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - pii
  - ner
  - token-classification
  - privacy
  - deberta
datasets:
  - ai4privacy/pii-masking-200k
  - nvidia/Nemotron-PII
  - gretelai/synthetic_pii_finance_multilingual
  - gretelai/gretel-pii-masking-en-v1
metrics:
  - f1
  - precision
  - recall
pipeline_tag: token-classification
model-index:
  - name: pii-small-en
    results:
      - task:
          type: token-classification
          name: PII NER
        metrics:
          - name: F1
            type: f1
            value: 0.8894
          - name: Precision
            type: precision
            value: 0.8698
          - name: Recall
            type: recall
            value: 0.9098

DataFog PII-Small-EN (v1.4)

A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.

Architecture

  • Backbone: DeBERTa-v3-xsmall (22M params)
  • Head: CharCNN + CRF
  • Total params: ~45M
  • Entity types: 41 PII categories across 4 sensitivity tiers

Performance (v1.4, Epoch 3)

Tier Recall Target
T1 (Critical: SSN, Credit Card, etc.) 0.814 >=0.98
T2 (High: Person, Email, Phone, etc.) 0.937 >=0.95
T3 (Moderate: Username, Date, Location, etc.) 0.945 >=0.90
T4 (Domain-specific: Employee ID, Crypto, etc.) 0.937 >=0.85
Overall F1 0.889

Training Data

  • AI4Privacy (~43K examples, English subset)
  • NVIDIA Nemotron-PII (100K examples)
  • Gretel Synthetic PII Finance (26K examples)
  • Gretel PII Masking EN v1 (50K examples)
  • Synthetic data for rare entity types (22K examples)
  • Total: ~241K examples

Entity Types (41)

Tier 1 — Critical PII

SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID

Tier 2 — High Sensitivity

PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS

Tier 3 — Moderate Sensitivity

USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS

Tier 4 — Domain-Specific

MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION

License

Apache 2.0