---
license: mit
base_model: distilbert-base-uncased
tags:
- token-classification
- pii
- privacy
- personal-information
- bert
- distilbert
language:
- en
pipeline_tag: token-classification
library_name: transformers
datasets:
- ai4privacy/pii-masking-200k
metrics:
- f1
- precision
- recall
widget:
- text: "Hi, my name is John Smith and my email is john.smith@company.com"
  example_title: "Example with PII"
---

# BERT PII Detection Model

Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.

## Model Details

- **Base Model**: `distilbert-base-uncased`
- **Task**: Token Classification (Named Entity Recognition)
- **Languages**: English
- **License**: MIT
- **Fine-tuned on**: AI4Privacy PII-42k dataset

## Supported PII Entity Types

This model can detect 56 different types of PII entities including:

**Personal Information:**
- FIRSTNAME, LASTNAME, MIDDLENAME
- EMAIL, PHONENUMBER, USERNAME
- DATE, TIME, DOB, AGE

**Address Information:**
- STREET, CITY, STATE, COUNTY
- ZIPCODE, BUILDINGNUMBER
- SECONDARYADDRESS

**Financial Information:**
- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL

**Identification:**
- SSN, PIN, PASSWORD
- IP, IPV4, IPV6, MAC
- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS

**Professional Information:**
- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME


## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example usage
text = "Hi, my name is John Smith and my email is john.smith@company.com"
entities = ner_pipeline(text)
print(entities)
```

## Training Data

- **Dataset**: AI4Privacy PII-200k
- **Size**: ~209k examples
- **Languages**: English, French, German, Italian (this model: English only)
- **Entity Types**: 56 different PII categories

## Performance

The model achieves high performance on PII detection tasks with good precision and recall across different entity types.

## Intended Use

This model is designed for:
- PII detection and masking in text
- Privacy compliance applications
- Data anonymization pipelines
- Content moderation systems

## Limitations

- Trained primarily on English text
- May not generalize to domain-specific jargon
- Performance may vary on very short or very long texts
- Should be validated on your specific use case