PII Detection โ€” DistilBERT (English)

Fine-tuned distilbert-base-cased for PII (Personally Identifiable Information) detection on the English subset of ai4privacy/pii-masking-300k.

Results

  • F1: 0.96
  • 30k training examples, 5k validation, 6 epochs on a Colab T4

Entity types

PERSON, EMAIL, PHONE, USERNAME, ID_NUMBER, ADDRESS, IP_ADDRESS, URL, DATE_TIME, PASSWORD, DEMOGRAPHIC, OTHER

Usage

from transformers import pipeline

ner = pipeline("ner", model="munibz/pii-distilbert-en", aggregation_strategy="simple")

text = "Hi, I'm John Smith. Email me at john.smith@gmail.com."
print(ner(text))

Training

  • Base model: distilbert-base-cased
  • Max length: 384
  • Batch size: 32
  • Learning rate: 3e-5 (cosine schedule, 10% warmup)
  • Epochs: 6 (early stopping patience 2)
  • FP16 on T4 GPU
Downloads last month
46
Safetensors
Model size
65.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for munibz/pii-distilbert-en

Finetuned
(347)
this model

Dataset used to train munibz/pii-distilbert-en

Space using munibz/pii-distilbert-en 1