Spaces:

vuminhtue
/

Bert_base_NER_PII_43k

Sleeping

App Files Files Community

Bert_base_NER_PII_43k / README.md

vuminhtue

Upload 12 files

2a4d835 verified 3 months ago

preview code

raw

history blame contribute delete

2.47 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

metadata

title: PII Detection with BERT
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

PII Detection with BERT

This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text.

Model Details

Base Model: google-bert/bert-base-uncased
Training Dataset: ai4privacy/pii-masking-43k
Task: Token Classification / Named Entity Recognition (NER)
Number of Entity Types: 27

Detectable PII Types

The model can identify 27 different types of personal information:

Identity Information

NAME, USERNAME, DISPLAYNAME, GENDER, JOB

Contact Information

EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE

Financial Information

CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC
ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS

Technical Information

IP, MAC, URL, USERAGENT, PASSWORD

Other

NUM, ORDINALDIRECTION

How It Works

Input: User provides text that may contain personal information
Tokenization: Text is split into tokens using BERT tokenizer
Classification: Each token is classified into one of 27 entity types or "O" (no entity)
Visualization: Detected entities are highlighted with different colors

Training Details

Learning Rate: 5e-05
Batch Size: 16 (train), 64 (eval)
Epochs: 3
Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
Warmup Steps: 500

Use Cases

Data Privacy: Identify PII before sharing documents
Data Anonymization: Find information that needs masking
Compliance: Help meet GDPR, CCPA requirements
Security: Detect sensitive information leaks

Limitations

Maximum input length: 512 tokens
Optimized for English text
May not detect all variations of PII
Performance depends on text format and quality

Example Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "your-username/your-space-name"  # Update after deployment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

License

Apache 2.0