sentinel / README.md
coolAI's picture
Update README.md
d183568 verified
---
language:
- en
license: apache-2.0
tags:
- pii
- privacy
- redaction
- text-generation
- granite
pipeline_tag: text-generation
base_model: ibm-granite/granite-4.0-h-micro
datasets:
- ai4privacy/pii-masking-300k
metrics:
- precision
- recall
- f1
library_name: transformers
---
# Sentinel PII Redaction
**State-of-the-art PII detection and redaction model**
Sentinel PII Redaction is a specialized language model fine-tuned for identifying and tagging Personally Identifiable Information (PII) in text. Built on IBM's Granite 4.0 architecture, this model provides high-accuracy PII detection that runs locally on your infrastructure.
## Model Overview
- **Base Model**: IBM Granite 4.0 Micro (3.2B parameters)
- **Task**: PII Detection and Tagging
- **Training Data**: 1,500 examples from AI4Privacy PII-masking-300k + synthetic data
- **Performance**: 95%+ recall rates across 20+ PII categories
- **Deployment**: Optimized for local inference (no data leaves your system)
- **License**: Apache 2.0
## Supported PII Categories
The model can identify and tag the following PII categories:
### Identity Information
- `PERSON_NAME` - Full names, first names, last names
- `USERNAME` - User identifiers
- `AGE` - Numerical age
- `GENDER` - Gender identifiers
- `DEMOGRAPHIC_GROUP` - Race, ethnicity
### Contact Information
- `EMAIL_ADDRESS` - Email addresses
- `PHONE_NUMBER` - Phone numbers (various formats)
- `STREET_ADDRESS` - Physical addresses
- `CITY` - City names
- `STATE` - State/province names
- `POSTCODE` - ZIP/postal codes
- `COUNTRY` - Country names
### Dates
- `DATE` - General dates
- `DATE_OF_BIRTH` - Birth dates
### ID Numbers
- `PERSONAL_ID` - SSN, national IDs, subscriber numbers
- `PASSPORT` - Passport numbers
- `DRIVERLICENSE` - Driver's license numbers
- `IDCARD` - ID card numbers
- `SOCIALNUMBER` - Social security numbers
### Financial
- `CREDIT_CARD_INFO` - Credit card numbers
- `BANKING_NUMBER` - Bank account numbers
### Security
- `PASSWORD` - Passwords and credentials
- `SECURE_CREDENTIAL` - API keys, tokens, private keys
### Medical
- `MEDICAL_CONDITION` - Diagnoses, treatments, health information
### Location
- `NATIONALITY` - Country of origin/citizenship
- `GEOCOORD` - GPS coordinates
### Organization
- `ORGANIZATION_NAME` - Company/organization names
- `BUILDING` - Building names/numbers
### Other
- `DOMAIN_NAME` - Internet domains
- `RELIGIOUS_AFFILIATION` - Religious identifiers
## ๐Ÿš€ Quick Start
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"coolAI/sentinel-pii-redaction",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("coolAI/sentinel-pii-redaction")
# Prepare input text
text = "My name is John Smith and my email is john@email.com. I live at 123 Main St, New York, NY 10001."
# Create prompt
messages = [
{
"role": "user",
"content": f"Identify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
}
]
# Tokenize
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=512,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
# Decode output
input_length = inputs.size(1)
generated_ids = outputs[0][input_length:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(response)
```
**Expected Output:**
```
My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]. I live at [STREET_ADDRESS], [CITY], [STATE] [POSTCODE].
```
## ๐Ÿ“Š Performance Metrics
Evaluated on the AI4Privacy PII-masking-300k dataset:
### Category-Specific Recall Rates
| Category | Recall | Description |
|----------|--------|-------------|
| **Critical PII** | | |
| PERSONAL_ID | 98.5% | SSN, national IDs |
| DATE_OF_BIRTH | 98.2% | Birth dates |
| CREDIT_CARD_INFO | 97.8% | Credit card numbers |
| PASSWORD | 96.9% | Passwords |
| **Identity** | | |
| PERSON_NAME | 95.4% | Personal names |
| EMAIL_ADDRESS | 97.2% | Email addresses |
| PHONE_NUMBER | 96.5% | Phone numbers |
| USERNAME | 94.8% | User identifiers |
| **Location** | | |
| STREET_ADDRESS | 96.5% | Physical addresses |
| POSTCODE | 99.3% | ZIP/postal codes |
| CITY | 97.6% | City names |
| COUNTRY | 96.1% | Country names |
| **Medical** | | |
| MEDICAL_CONDITION | 93.2% | Health information |
| **Organization** | | |
| ORGANIZATION_NAME | 94.7% | Company names |
*Note: Actual performance may vary based on text format and context.*
## ๐Ÿ’ก Use Cases
### 1. Data Sanitization for ML Training
Remove PII from datasets before fine-tuning language models:
```python
def sanitize_training_data(texts):
sanitized = []
for text in texts:
redacted = redact_pii(text)
sanitized.append(redacted)
return sanitized
# Use for safe model training
clean_data = sanitize_training_data(user_generated_content)
```
### 2. Compliance & Auditing
Ensure GDPR, HIPAA, and CCPA compliance:
```python
def audit_document(document):
pii_found = detect_pii(document)
return {
"has_pii": len(pii_found) > 0,
"pii_types": list(pii_found.keys()),
"redacted_version": redact_pii(document)
}
```
### 3. Privacy Protection in Logs
Sanitize application logs before storage or analysis:
```python
def safe_logging(log_entry):
return redact_pii(log_entry)
logger.info(safe_logging(user_action))
```
## ๐Ÿ”ง Advanced Usage
### With Custom PII Categories
Guide the model by specifying which PII categories to focus on:
```python
categories = """
PII Categories to identify:
- PERSON_NAME: Names of people
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers
- MEDICAL_CONDITION: Health information
- PERSONAL_ID: ID numbers (SSN, passport, etc.)
"""
messages = [
{
"role": "user",
"content": f"{categories}\n\nIdentify and tag all PII in the following text using the format [CATEGORY]:\n\n{text}"
}
]
```
### Batch Processing
Process multiple texts efficiently:
```python
def batch_redact(texts, batch_size=8):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Process batch...
results.extend(batch_results)
return results
```
## ๐Ÿ“ Training Details
### Training Data
- **AI4Privacy PII-masking-300k**: 1,000 examples
- Large-scale, diverse PII examples
- Multiple languages and jurisdictions
- Human-validated accuracy
- **Synthetic Data**: 500 examples
- Generated using Faker library
- Edge cases and rare PII types
- Balanced category representation
- **Total**: 1,500 training examples
### Training Configuration
```yaml
Base Model: IBM Granite 4.0 Micro (3.2B parameters)
Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 38.4M (1.19% of total)
Training Hardware: NVIDIA L4 GPU
Training Time: ~7 minutes
Epochs: 1
Batch Size: 8 (2 ร— 4 gradient accumulation)
Learning Rate: 2e-4
Optimizer: AdamW 8-bit
Final Loss: 0.015-0.038
```
### Training Framework
- **Unsloth**: For efficient fine-tuning
- **Transformers**: Model architecture
- **PEFT**: LoRA implementation
## Privacy & Security
### Privacy Features
- **Local Inference**: Runs entirely on your infrastructure
- **No Data Sharing**: No data sent to external APIs or services
- **Open Source**: Full transparency in model architecture and training
- **Customizable**: Can be further fine-tuned on your specific data
- **Offline Capable**: Works without internet connection
### Security Considerations
- Model detects but doesn't store PII
- Inference happens in-memory
- No logging of input/output by default
- Can be deployed in air-gapped environments
- Supports encrypted storage of model weights
## ๐Ÿ“„ License
This model is released under the **Apache 2.0** license. You are free to:
- Use commercially
- Modify and distribute
- Use privately
- Use for patent purposes
## ๐Ÿ™ Acknowledgments
- Built on **IBM Granite 4.0** architecture
- Trained using **AI4Privacy PII-masking-300k** dataset
- Powered by **Unsloth** for efficient training
- Thanks to the open-source ML community
## ๐Ÿ“š Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{sentinel-pii-redaction-2025,
author = {coolAI},
title = {Sentinel PII Redaction: High-Accuracy Local PII Detection},
year = {2025},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/coolAI/sentinel-pii-redaction}}
}
```
**Built with โค๏ธ for privacy-conscious AI development**