Spaces:

vuminhtue
/

NER_PII_Bert_Multilingual

Sleeping

App Files Files Community

NER_PII_Bert_Multilingual / README.md

vuminhtue

Upload 3 files

a5567e1 verified 3 months ago

preview code

raw

history blame contribute delete

2.85 kB

	---
	title: Multilingual PII Detection
	emoji: 🌍
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	---

	# Multilingual PII Detection with BERT

	This Space demonstrates a multilingual BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text across multiple languages.

	## Model Details

	- Base Model: [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)
	- Task: Token Classification / Named Entity Recognition (NER)
	- Number of Entity Types: 39
	- Languages: Supports 100+ languages including English, Spanish, French, German, Chinese, Arabic, and more

	## Detectable PII Types

	The model can identify 39 different types of personal information:

	### Identity Information
	- NAME, USERNAME, PREFIX, GENDER, AGE, JOB, BLOODTYPE

	### Contact Information
	- EMAIL, PHONENUMBER, PHONEIMEI, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE

	### Financial Information
	- CREDITCARDNUMBER, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, CURRENCY, COINADDRESS

	### Government IDs
	- SSN (Social Security Number)

	### Vehicle Information
	- VEHICLEVIN (Vehicle Identification Number)
	- VEHICLEVRM (Vehicle Registration Mark)

	### Technical Information
	- IP, MAC, URL, PASSWORD

	### Organization
	- ORG

	### Temporal Information
	- DATE, TIME

	### Physical Attributes
	- HEIGHT, WEIGHTS, COLOR

	### Other
	- NUM, ORDINALDIRECTION, MISC

	## How It Works

	1. Input: User provides text that may contain personal information
	2. Tokenization: Text is split into tokens using BERT tokenizer
	3. Classification: Each token is classified into one of 27 entity types or "O" (no entity)
	4. Visualization: Detected entities are highlighted with different colors

	## Training Details

	- Learning Rate: 5e-05
	- Batch Size: 16 (train), 64 (eval)
	- Epochs: 3
	- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
	- Warmup Steps: 500

	## Use Cases

	- Data Privacy: Identify PII before sharing documents
	- Data Anonymization: Find information that needs masking
	- Compliance: Help meet GDPR, CCPA requirements
	- Security: Detect sensitive information leaks

	## Limitations

	- Maximum input length: 512 tokens
	- Optimized for English text
	- May not detect all variations of PII
	- Performance depends on text format and quality

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	model_name = "your-username/your-space-name" # Update after deployment
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	text = "My name is John Smith and my email is john@example.com"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	## License

	Apache 2.0