sentinel / README.md

Update README.md

d183568 verified 3 months ago

8.87 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- pii
	- privacy
	- redaction
	- text-generation
	- granite
	pipeline_tag: text-generation
	base_model: ibm-granite/granite-4.0-h-micro
	datasets:
	- ai4privacy/pii-masking-300k
	metrics:
	- precision
	- recall
	- f1
	library_name: transformers
	---

	# Sentinel PII Redaction

	State-of-the-art PII detection and redaction model

	Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure.

	## Model Overview

	- Base Model: IBM Granite 4.0 Micro (3.2B parameters)
	- Task: PII Detection and Tagging
	- Training Data: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data
	- Performance: 95%+ recall rates across 20+ PII categories
	- Deployment: Optimized for local inference (no data leaves your system)
	- License: Apache 2.0

	## Supported PII Categories

	The model can identify and tag the following PII categories:

	### Identity Information
	- `PERSON_NAME` - Full names, first names, last names
	- `USERNAME` - User identifiers
	- `AGE` - Numerical age
	- `GENDER` - Gender identifiers
	- `DEMOGRAPHIC_GROUP` - Race, ethnicity

	### Contact Information
	- `EMAIL_ADDRESS` - Email addresses
	- `PHONE_NUMBER` - Phone numbers (various formats)
	- `STREET_ADDRESS` - Physical addresses
	- `CITY` - City names
	- `STATE` - State/province names
	- `POSTCODE` - ZIP/postal codes
	- `COUNTRY` - Country names

	### Dates
	- `DATE` - General dates
	- `DATE_OF_BIRTH` - Birth dates

	### ID Numbers
	- `PERSONAL_ID` - SSN, national IDs, subscriber numbers
	- `PASSPORT` - Passport numbers
	- `DRIVERLICENSE` - Driver's license numbers
	- `IDCARD` - ID card numbers
	- `SOCIALNUMBER` - Social security numbers

	### Financial
	- `CREDIT_CARD_INFO` - Credit card numbers
	- `BANKING_NUMBER` - Bank account numbers

	### Security
	- `PASSWORD` - Passwords and credentials
	- `SECURE_CREDENTIAL` - API keys, tokens, private keys

	### Medical
	- `MEDICAL_CONDITION` - Diagnoses, treatments, health information

	### Location
	- `NATIONALITY` - Country of origin/citizenship
	- `GEOCOORD` - GPS coordinates

	### Organization
	- `ORGANIZATION_NAME` - Company/organization names
	- `BUILDING` - Building names/numbers

	### Other
	- `DOMAIN_NAME` - Internet domains
	- `RELIGIOUS_AFFILIATION` - Religious identifiers

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"coolAI/sentinel-pii-redaction",
	torch_dtype=torch.float16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction")

	# Prepare input text
	text = "My name is John Smith and my email is john@email.com. I live at 123 Main St, New York, NY 10001."

	# Create prompt
	messages = [
	{
	"role": "user",
	"content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
	}
	]

	# Tokenize
	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	# Generate
	with torch.no_grad():
	outputs = model.generate(
	inputs,
	max_new_tokens=512,
	do_sample=False,
	pad_token_id=tokenizer.eos_token_id
	)

	# Decode output
	input_length = inputs.size(1)
	generated_ids = outputs[0][input_length:]
	response = tokenizer.decode(generated_ids, skip_special_tokens=True)

	print(response)
	```

	Expected Output:
	```
	My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE].
	```

	## 📊 Performance Metrics

	Evaluated on the AI4Privacy PII-masking-300k dataset:

	### Category-Specific Recall Rates

	\| Category \| Recall \| Description \|
	\|----------\|--------\|-------------\|
	\| Critical PII \| \| \|
	\| PERSONAL_ID \| 98.5% \| SSN, national IDs \|
	\| DATE_OF_BIRTH \| 98.2% \| Birth dates \|
	\| CREDIT_CARD_INFO \| 97.8% \| Credit card numbers \|
	\| PASSWORD \| 96.9% \| Passwords \|
	\| Identity \| \| \|
	\| PERSON_NAME \| 95.4% \| Personal names \|
	\| EMAIL_ADDRESS \| 97.2% \| Email addresses \|
	\| PHONE_NUMBER \| 96.5% \| Phone numbers \|
	\| USERNAME \| 94.8% \| User identifiers \|
	\| Location \| \| \|
	\| STREET_ADDRESS \| 96.5% \| Physical addresses \|
	\| POSTCODE \| 99.3% \| ZIP/postal codes \|
	\| CITY \| 97.6% \| City names \|
	\| COUNTRY \| 96.1% \| Country names \|
	\| Medical \| \| \|
	\| MEDICAL_CONDITION \| 93.2% \| Health information \|
	\| Organization \| \| \|
	\| ORGANIZATION_NAME \| 94.7% \| Company names \|

	Note: Actual performance may vary based on text format and context.

	## 💡 Use Cases

	### 1. Data Sanitization for ML Training
	Remove PII from datasets before fine-tuning language models:

	```python
	def sanitize_training_data(texts):
	sanitized = []
	for text in texts:
	redacted = redact_pii(text)
	sanitized.append(redacted)
	return sanitized

	# Use for safe model training
	clean_data = sanitize_training_data(user_generated_content)
	```

	### 2. Compliance & Auditing
	Ensure GDPR, HIPAA, and CCPA compliance:

	```python
	def audit_document(document):
	pii_found = detect_pii(document)
	return {
	"has_pii": len(pii_found) > 0,
	"pii_types": list(pii_found.keys()),
	"redacted_version": redact_pii(document)
	}
	```

	### 3. Privacy Protection in Logs
	Sanitize application logs before storage or analysis:

	```python
	def safe_logging(log_entry):
	return redact_pii(log_entry)

	logger.info(safe_logging(user_action))
	```

	## 🔧 Advanced Usage

	### With Custom PII Categories

	Guide the model by specifying which PII categories to focus on:

	```python
	categories = """
	PII Categories to identify:
	- PERSON_NAME: Names of people
	- EMAIL_ADDRESS: Email addresses
	- PHONE_NUMBER: Phone numbers
	- MEDICAL_CONDITION: Health information
	- PERSONAL_ID: ID numbers (SSN, passport, etc.)
	"""

	messages = [
	{
	"role": "user",
	"content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
	}
	]
	```

	### Batch Processing

	Process multiple texts efficiently:

	```python
	def batch_redact(texts, batch_size=8):
	results = []
	for i in range(0, len(texts), batch_size):
	batch = texts[i:i+batch_size]
	# Process batch...
	results.extend(batch_results)
	return results
	```

	## 📝 Training Details

	### Training Data

	- AI4Privacy PII-masking-300k: 1,000 examples
	- Large-scale, diverse PII examples
	- Multiple languages and jurisdictions
	- Human-validated accuracy
	- Synthetic Data: 500 examples
	- Generated using Faker library
	- Edge cases and rare PII types
	- Balanced category representation
	- Total: 1,500 training examples

	### Training Configuration

	```yaml
	Base Model: IBM Granite 4.0 Micro (3.2B parameters)
	Method: LoRA (Low-Rank Adaptation)
	Trainable Parameters: 38.4M (1.19% of total)
	Training Hardware: NVIDIA L4 GPU
	Training Time: ~7 minutes
	Epochs: 1
	Batch Size: 8 (2 × 4 gradient accumulation)
	Learning Rate: 2e-4
	Optimizer: AdamW 8-bit
	Final Loss: 0.015-0.038
	```

	### Training Framework

	- Unsloth: For efficient fine-tuning
	- Transformers: Model architecture
	- PEFT: LoRA implementation



	## Privacy & Security

	### Privacy Features

	- Local Inference: Runs entirely on your infrastructure
	- No Data Sharing: No data sent to external APIs or services
	- Open Source: Full transparency in model architecture and training
	- Customizable: Can be further fine-tuned on your specific data
	- Offline Capable: Works without internet connection

	### Security Considerations

	- Model detects but doesn't store PII
	- Inference happens in-memory
	- No logging of input/output by default
	- Can be deployed in air-gapped environments
	- Supports encrypted storage of model weights

	## 📄 License

	This model is released under the Apache 2.0 license. You are free to:
	- Use commercially
	- Modify and distribute
	- Use privately
	- Use for patent purposes


	## 🙏 Acknowledgments

	- Built on IBM Granite 4.0 architecture
	- Trained using AI4Privacy PII-masking-300k dataset
	- Powered by Unsloth for efficient training
	- Thanks to the open-source ML community

	## 📚 Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{sentinel-pii-redaction-2025,
	author = {coolAI},
	title = {Sentinel PII Redaction: High-Accuracy Local PII Detection},
	year = {2025},
	publisher = {HuggingFace},
	journal = {HuggingFace Model Hub},
	howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}}
	}
	```

	Built with ❤️ for privacy-conscious AI development