bert-pii-detection / README.md
SoelMgd's picture
Update README.md
6a385a5 verified
---
license: mit
base_model: distilbert-base-uncased
tags:
- token-classification
- pii
- privacy
- personal-information
- bert
- distilbert
language:
- en
pipeline_tag: token-classification
library_name: transformers
datasets:
- ai4privacy/pii-masking-200k
metrics:
- f1
- precision
- recall
widget:
- text: "Hi, my name is John Smith and my email is john.smith@company.com"
example_title: "Example with PII"
---
# BERT PII Detection Model
Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.
## Model Details
- **Base Model**: `distilbert-base-uncased`
- **Task**: Token Classification (Named Entity Recognition)
- **Languages**: English
- **License**: MIT
- **Fine-tuned on**: AI4Privacy PII-42k dataset
## Supported PII Entity Types
This model can detect 56 different types of PII entities including:
**Personal Information:**
- FIRSTNAME, LASTNAME, MIDDLENAME
- EMAIL, PHONENUMBER, USERNAME
- DATE, TIME, DOB, AGE
**Address Information:**
- STREET, CITY, STATE, COUNTY
- ZIPCODE, BUILDINGNUMBER
- SECONDARYADDRESS
**Financial Information:**
- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL
**Identification:**
- SSN, PIN, PASSWORD
- IP, IPV4, IPV6, MAC
- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS
**Professional Information:**
- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME
## Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
ner_pipeline = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
# Example usage
text = "Hi, my name is John Smith and my email is john.smith@company.com"
entities = ner_pipeline(text)
print(entities)
```
## Training Data
- **Dataset**: AI4Privacy PII-200k
- **Size**: ~209k examples
- **Languages**: English, French, German, Italian (this model: English only)
- **Entity Types**: 56 different PII categories
## Performance
The model achieves high performance on PII detection tasks with good precision and recall across different entity types.
## Intended Use
This model is designed for:
- PII detection and masking in text
- Privacy compliance applications
- Data anonymization pipelines
- Content moderation systems
## Limitations
- Trained primarily on English text
- May not generalize to domain-specific jargon
- Performance may vary on very short or very long texts
- Should be validated on your specific use case