--- title: Multilingual PII Detection emoji: 🌍 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 --- # Multilingual PII Detection with BERT This Space demonstrates a **multilingual BERT model** fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages. ## Model Details - **Base Model**: [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) - **Task**: Token Classification / Named Entity Recognition (NER) - **Number of Entity Types**: 39 - **Languages**: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more ## Detectable PII Types The model can identify 39 different types of personal information: ### Identity Information - NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE ### Contact Information - EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE ### Financial Information - CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS ### Government IDs - SSN (Social Security Number) ### Vehicle Information - VEHICLEVIN (Vehicle Identification Number) - VEHICLEVRM (Vehicle Registration Mark) ### Technical Information - IP, MAC, URL, PASSWORD ### Organization - ORG ### Temporal Information - DATE, TIME ### Physical Attributes - HEIGHT, WEIGHTS, COLOR ### Other - NUM, ORDINALDIRECTION, MISC ## How It Works 1. **Input**: User provides text that may contain personal information 2. **Tokenization**: Text is split into tokens using BERT tokenizer 3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity) 4. **Visualization**: Detected entities are highlighted with different colors ## Training Details - Learning Rate: 5e-05 - Batch Size: 16 (train), 64 (eval) - Epochs: 3 - Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08) - Warmup Steps: 500 ## Use Cases - **Data Privacy**: Identify PII before sharing documents - **Data Anonymization**: Find information that needs masking - **Compliance**: Help meet GDPR, CCPA requirements - **Security**: Detect sensitive information leaks ## Limitations - Maximum input length: 512 tokens - Optimized for English text - May not detect all variations of PII - Performance depends on text format and quality ## Example Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification model_name = "your-username/your-space-name" # Update after deployment tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) text = "My name is John Smith and my email is john@example.com" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ## License Apache 2.0