Spaces:
Sleeping
Sleeping
File size: 2,467 Bytes
765b531 2a4d835 765b531 2a4d835 765b531 2a4d835 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | ---
title: PII Detection with BERT
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
---
# PII Detection with BERT
This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text.
## Model Details
- **Base Model**: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
- **Training Dataset**: [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k)
- **Task**: Token Classification / Named Entity Recognition (NER)
- **Number of Entity Types**: 27
## Detectable PII Types
The model can identify 27 different types of personal information:
### Identity Information
- NAME, USERNAME, DISPLAYNAME, GENDER, JOB
### Contact Information
- EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE
### Financial Information
- CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC
- ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS
### Technical Information
- IP, MAC, URL, USERAGENT, PASSWORD
### Other
- NUM, ORDINALDIRECTION
## How It Works
1. **Input**: User provides text that may contain personal information
2. **Tokenization**: Text is split into tokens using BERT tokenizer
3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity)
4. **Visualization**: Detected entities are highlighted with different colors
## Training Details
- Learning Rate: 5e-05
- Batch Size: 16 (train), 64 (eval)
- Epochs: 3
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
- Warmup Steps: 500
## Use Cases
- **Data Privacy**: Identify PII before sharing documents
- **Data Anonymization**: Find information that needs masking
- **Compliance**: Help meet GDPR, CCPA requirements
- **Security**: Detect sensitive information leaks
## Limitations
- Maximum input length: 512 tokens
- Optimized for English text
- May not detect all variations of PII
- Performance depends on text format and quality
## Example Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = "your-username/your-space-name" # Update after deployment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
## License
Apache 2.0
|