Spaces:
Sleeping
Sleeping
| title: PII Detection with BERT | |
| emoji: 🔍 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| # PII Detection with BERT | |
| This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text. | |
| ## Model Details | |
| - **Base Model**: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) | |
| - **Training Dataset**: [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k) | |
| - **Task**: Token Classification / Named Entity Recognition (NER) | |
| - **Number of Entity Types**: 27 | |
| ## Detectable PII Types | |
| The model can identify 27 different types of personal information: | |
| ### Identity Information | |
| - NAME, USERNAME, DISPLAYNAME, GENDER, JOB | |
| ### Contact Information | |
| - EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE | |
| ### Financial Information | |
| - CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC | |
| - ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS | |
| ### Technical Information | |
| - IP, MAC, URL, USERAGENT, PASSWORD | |
| ### Other | |
| - NUM, ORDINALDIRECTION | |
| ## How It Works | |
| 1. **Input**: User provides text that may contain personal information | |
| 2. **Tokenization**: Text is split into tokens using BERT tokenizer | |
| 3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity) | |
| 4. **Visualization**: Detected entities are highlighted with different colors | |
| ## Training Details | |
| - Learning Rate: 5e-05 | |
| - Batch Size: 16 (train), 64 (eval) | |
| - Epochs: 3 | |
| - Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08) | |
| - Warmup Steps: 500 | |
| ## Use Cases | |
| - **Data Privacy**: Identify PII before sharing documents | |
| - **Data Anonymization**: Find information that needs masking | |
| - **Compliance**: Help meet GDPR, CCPA requirements | |
| - **Security**: Detect sensitive information leaks | |
| ## Limitations | |
| - Maximum input length: 512 tokens | |
| - Optimized for English text | |
| - May not detect all variations of PII | |
| - Performance depends on text format and quality | |
| ## Example Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification | |
| model_name = "your-username/your-space-name" # Update after deployment | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| text = "My name is John Smith and my email is john@example.com" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| outputs = model(**inputs) | |
| ``` | |
| ## License | |
| Apache 2.0 | |