vuminhtue's picture
Upload 3 files
a5567e1 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade
metadata
title: Multilingual PII Detection
emoji: 🌍
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0

Multilingual PII Detection with BERT

This Space demonstrates a multilingual BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages.

Model Details

  • Base Model: google-bert/bert-base-multilingual-uncased
  • Task: Token Classification / Named Entity Recognition (NER)
  • Number of Entity Types: 39
  • Languages: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more

Detectable PII Types

The model can identify 39 different types of personal information:

Identity Information

  • NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE

Contact Information

  • EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE

Financial Information

  • CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS

Government IDs

  • SSN (Social Security Number)

Vehicle Information

  • VEHICLEVIN (Vehicle Identification Number)
  • VEHICLEVRM (Vehicle Registration Mark)

Technical Information

  • IP, MAC, URL, PASSWORD

Organization

  • ORG

Temporal Information

  • DATE, TIME

Physical Attributes

  • HEIGHT, WEIGHTS, COLOR

Other

  • NUM, ORDINALDIRECTION, MISC

How It Works

  1. Input: User provides text that may contain personal information
  2. Tokenization: Text is split into tokens using BERT tokenizer
  3. Classification: Each token is classified into one of 27 entity types or "O" (no entity)
  4. Visualization: Detected entities are highlighted with different colors

Training Details

  • Learning Rate: 5e-05
  • Batch Size: 16 (train), 64 (eval)
  • Epochs: 3
  • Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
  • Warmup Steps: 500

Use Cases

  • Data Privacy: Identify PII before sharing documents
  • Data Anonymization: Find information that needs masking
  • Compliance: Help meet GDPR, CCPA requirements
  • Security: Detect sensitive information leaks

Limitations

  • Maximum input length: 512 tokens
  • Optimized for English text
  • May not detect all variations of PII
  • Performance depends on text format and quality

Example Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "your-username/your-space-name"  # Update after deployment
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

License

Apache 2.0