SoelMgd
/

bert-pii-detection

Token Classification

personal-information

Model card Files Files and versions

bert-pii-detection / README.md

SoelMgd's picture

Update README.md

6a385a5 verified 6 months ago

|

history blame contribute delete

2.76 kB

	---
	license: mit
	base_model: distilbert-base-uncased
	tags:
	- token-classification
	- pii
	- privacy
	- personal-information
	- bert
	- distilbert
	language:
	- en
	pipeline_tag: token-classification
	library_name: transformers
	datasets:
	- ai4privacy/pii-masking-200k
	metrics:
	- f1
	- precision
	- recall
	widget:
	- text: "Hi, my name is John Smith and my email is john.smith@company.com"
	example_title: "Example with PII"
	---

	# BERT PII Detection Model

	Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.

	## Model Details

	- Base Model: `distilbert-base-uncased`
	- Task: Token Classification (Named Entity Recognition)
	- Languages: English
	- License: MIT
	- Fine-tuned on: AI4Privacy PII-42k dataset

	## Supported PII Entity Types

	This model can detect 56 different types of PII entities including:

	Personal Information:
	- FIRSTNAME, LASTNAME, MIDDLENAME
	- EMAIL, PHONENUMBER, USERNAME
	- DATE, TIME, DOB, AGE

	Address Information:
	- STREET, CITY, STATE, COUNTY
	- ZIPCODE, BUILDINGNUMBER
	- SECONDARYADDRESS

	Financial Information:
	- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
	- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
	- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL

	Identification:
	- SSN, PIN, PASSWORD
	- IP, IPV4, IPV6, MAC
	- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS

	Professional Information:
	- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME


	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	# Load model and tokenizer
	model_name = "SoelMgd/bert-pii-detection"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Create NER pipeline
	ner_pipeline = pipeline(
	"ner",
	model=model,
	tokenizer=tokenizer,
	aggregation_strategy="simple"
	)

	# Example usage
	text = "Hi, my name is John Smith and my email is john.smith@company.com"
	entities = ner_pipeline(text)
	print(entities)
	```

	## Training Data

	- Dataset: AI4Privacy PII-200k
	- Size: ~209k examples
	- Languages: English, French, German, Italian (this model: English only)
	- Entity Types: 56 different PII categories

	## Performance

	The model achieves high performance on PII detection tasks with good precision and recall across different entity types.

	## Intended Use

	This model is designed for:
	- PII detection and masking in text
	- Privacy compliance applications
	- Data anonymization pipelines
	- Content moderation systems

	## Limitations

	- Trained primarily on English text
	- May not generalize to domain-specific jargon
	- Performance may vary on very short or very long texts
	- Should be validated on your specific use case