--- license: mit base_model: distilbert-base-uncased tags: - token-classification - pii - privacy - personal-information - bert - distilbert language: - en pipeline_tag: token-classification library_name: transformers datasets: - ai4privacy/pii-masking-200k metrics: - f1 - precision - recall widget: - text: "Hi, my name is John Smith and my email is john.smith@company.com" example_title: "Example with PII" --- # BERT PII Detection Model Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification. ## Model Details - **Base Model**: `distilbert-base-uncased` - **Task**: Token Classification (Named Entity Recognition) - **Languages**: English - **License**: MIT - **Fine-tuned on**: AI4Privacy PII-42k dataset ## Supported PII Entity Types This model can detect 56 different types of PII entities including: **Personal Information:** - FIRSTNAME, LASTNAME, MIDDLENAME - EMAIL, PHONENUMBER, USERNAME - DATE, TIME, DOB, AGE **Address Information:** - STREET, CITY, STATE, COUNTY - ZIPCODE, BUILDINGNUMBER - SECONDARYADDRESS **Financial Information:** - CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV - ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC - AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL **Identification:** - SSN, PIN, PASSWORD - IP, IPV4, IPV6, MAC - ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS **Professional Information:** - JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline # Load model and tokenizer model_name = "SoelMgd/bert-pii-detection" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Create NER pipeline ner_pipeline = pipeline( "ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple" ) # Example usage text = "Hi, my name is John Smith and my email is john.smith@company.com" entities = ner_pipeline(text) print(entities) ``` ## Training Data - **Dataset**: AI4Privacy PII-200k - **Size**: ~209k examples - **Languages**: English, French, German, Italian (this model: English only) - **Entity Types**: 56 different PII categories ## Performance The model achieves high performance on PII detection tasks with good precision and recall across different entity types. ## Intended Use This model is designed for: - PII detection and masking in text - Privacy compliance applications - Data anonymization pipelines - Content moderation systems ## Limitations - Trained primarily on English text - May not generalize to domain-specific jargon - Performance may vary on very short or very long texts - Should be validated on your specific use case