Spaces:

vuminhtue
/

NER_PII_Bert_Multilingual

Sleeping

App Files Files Community

NER_PII_Bert_Multilingual / README.md

vuminhtue

Upload 3 files

a5567e1 verified 3 months ago

preview code

raw

history blame contribute delete

2.85 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

metadata

title: Multilingual PII Detection
emoji: 🌍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

Multilingual PII Detection with BERT

This Space demonstrates a multilingual BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages.

Model Details

Base Model: google-bert/bert-base-multilingual-uncased
Task: Token Classification / Named Entity Recognition (NER)
Number of Entity Types: 39
Languages: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more

Detectable PII Types

The model can identify 39 different types of personal information:

Identity Information

NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE

Contact Information

EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE

Financial Information

CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS

Government IDs

SSN (Social Security Number)

Vehicle Information

VEHICLEVIN (Vehicle Identification Number)
VEHICLEVRM (Vehicle Registration Mark)

Technical Information

IP, MAC, URL, PASSWORD

Organization

Temporal Information

DATE, TIME

Physical Attributes

HEIGHT, WEIGHTS, COLOR

Other

NUM, ORDINALDIRECTION, MISC

How It Works

Input: User provides text that may contain personal information
Tokenization: Text is split into tokens using BERT tokenizer
Classification: Each token is classified into one of 27 entity types or "O" (no entity)
Visualization: Detected entities are highlighted with different colors

Training Details

Learning Rate: 5e-05
Batch Size: 16 (train), 64 (eval)
Epochs: 3
Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
Warmup Steps: 500

Use Cases

Data Privacy: Identify PII before sharing documents
Data Anonymization: Find information that needs masking
Compliance: Help meet GDPR, CCPA requirements
Security: Detect sensitive information leaks

Limitations

Maximum input length: 512 tokens
Optimized for English text
May not detect all variations of PII
Performance depends on text format and quality

Example Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "your-username/your-space-name"  # Update after deployment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

License

Apache 2.0