Spaces:

vuminhtue
/

Bert_base_NER_PII_43k

Sleeping

App Files Files Community

Bert_base_NER_PII_43k / README.md

vuminhtue

Upload 12 files

2a4d835 verified 3 months ago

preview code

raw

history blame contribute delete

2.47 kB

	---
	title: PII Detection with BERT
	emoji: 🔍
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	---

	# PII Detection with BERT

	This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text.

	## Model Details

	- Base Model: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
	- Training Dataset: [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k)
	- Task: Token Classification / Named Entity Recognition (NER)
	- Number of Entity Types: 27

	## Detectable PII Types

	The model can identify 27 different types of personal information:

	### Identity Information
	- NAME, USERNAME, DISPLAYNAME, GENDER, JOB

	### Contact Information
	- EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE

	### Financial Information
	- CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC
	- ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS

	### Technical Information
	- IP, MAC, URL, USERAGENT, PASSWORD

	### Other
	- NUM, ORDINALDIRECTION

	## How It Works

	1. Input: User provides text that may contain personal information
	2. Tokenization: Text is split into tokens using BERT tokenizer
	3. Classification: Each token is classified into one of 27 entity types or "O" (no entity)
	4. Visualization: Detected entities are highlighted with different colors

	## Training Details

	- Learning Rate: 5e-05
	- Batch Size: 16 (train), 64 (eval)
	- Epochs: 3
	- Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-08)
	- Warmup Steps: 500

	## Use Cases

	- Data Privacy: Identify PII before sharing documents
	- Data Anonymization: Find information that needs masking
	- Compliance: Help meet GDPR, CCPA requirements
	- Security: Detect sensitive information leaks

	## Limitations

	- Maximum input length: 512 tokens
	- Optimized for English text
	- May not detect all variations of PII
	- Performance depends on text format and quality

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	model_name = "your-username/your-space-name" # Update after deployment
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	text = "My name is John Smith and my email is john@example.com"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	## License

	Apache 2.0