Spaces:
Sleeping
Sleeping
| title: Multilingual PII Detection | |
| emoji: 🌍 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| # Multilingual PII Detection with BERT | |
| This Space demonstrates a **multilingual BERT model** fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages. | |
| ## Model Details | |
| - **Base Model**: [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) | |
| - **Task**: Token Classification / Named Entity Recognition (NER) | |
| - **Number of Entity Types**: 39 | |
| - **Languages**: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more | |
| ## Detectable PII Types | |
| The model can identify 39 different types of personal information: | |
| ### Identity Information | |
| - NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE | |
| ### Contact Information | |
| - EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE | |
| ### Financial Information | |
| - CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS | |
| ### Government IDs | |
| - SSN (Social Security Number) | |
| ### Vehicle Information | |
| - VEHICLEVIN (Vehicle Identification Number) | |
| - VEHICLEVRM (Vehicle Registration Mark) | |
| ### Technical Information | |
| - IP, MAC, URL, PASSWORD | |
| ### Organization | |
| - ORG | |
| ### Temporal Information | |
| - DATE, TIME | |
| ### Physical Attributes | |
| - HEIGHT, WEIGHTS, COLOR | |
| ### Other | |
| - NUM, ORDINALDIRECTION, MISC | |
| ## How It Works | |
| 1. **Input**: User provides text that may contain personal information | |
| 2. **Tokenization**: Text is split into tokens using BERT tokenizer | |
| 3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity) | |
| 4. **Visualization**: Detected entities are highlighted with different colors | |
| ## Training Details | |
| - Learning Rate: 5e-05 | |
| - Batch Size: 16 (train), 64 (eval) | |
| - Epochs: 3 | |
| - Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08) | |
| - Warmup Steps: 500 | |
| ## Use Cases | |
| - **Data Privacy**: Identify PII before sharing documents | |
| - **Data Anonymization**: Find information that needs masking | |
| - **Compliance**: Help meet GDPR, CCPA requirements | |
| - **Security**: Detect sensitive information leaks | |
| ## Limitations | |
| - Maximum input length: 512 tokens | |
| - Optimized for English text | |
| - May not detect all variations of PII | |
| - Performance depends on text format and quality | |
| ## Example Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification | |
| model_name = "your-username/your-space-name" # Update after deployment | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| text = "My name is John Smith and my email is john@example.com" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| outputs = model(**inputs) | |
| ``` | |
| ## License | |
| Apache 2.0 | |