Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.2.0
metadata
title: Multilingual PII Detection
emoji: 🌍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
Multilingual PII Detection with BERT
This Space demonstrates a multilingual BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages.
Model Details
- Base Model: google-bert/bert-base-multilingual-uncased
- Task: Token Classification / Named Entity Recognition (NER)
- Number of Entity Types: 39
- Languages: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more
Detectable PII Types
The model can identify 39 different types of personal information:
Identity Information
- NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE
Contact Information
- EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE
Financial Information
- CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS
Government IDs
- SSN (Social Security Number)
Vehicle Information
- VEHICLEVIN (Vehicle Identification Number)
- VEHICLEVRM (Vehicle Registration Mark)
Technical Information
- IP, MAC, URL, PASSWORD
Organization
- ORG
Temporal Information
- DATE, TIME
Physical Attributes
- HEIGHT, WEIGHTS, COLOR
Other
- NUM, ORDINALDIRECTION, MISC
How It Works
- Input: User provides text that may contain personal information
- Tokenization: Text is split into tokens using BERT tokenizer
- Classification: Each token is classified into one of 27 entity types or "O" (no entity)
- Visualization: Detected entities are highlighted with different colors
Training Details
- Learning Rate: 5e-05
- Batch Size: 16 (train), 64 (eval)
- Epochs: 3
- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
- Warmup Steps: 500
Use Cases
- Data Privacy: Identify PII before sharing documents
- Data Anonymization: Find information that needs masking
- Compliance: Help meet GDPR, CCPA requirements
- Security: Detect sensitive information leaks
Limitations
- Maximum input length: 512 tokens
- Optimized for English text
- May not detect all variations of PII
- Performance depends on text format and quality
Example Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = "your-username/your-space-name" # Update after deployment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
License
Apache 2.0