PII Detector (MiniLM-L6-v2)

GitHub Open In Spaces

Model Description

This model is a highly efficient, lightweight Token Classification model designed to detect Personally Identifiable Information (PII). It is a fine-tuned version of sentence-transformers/all-MiniLM-L6-v2 trained on the nvidia/Nemotron-PII dataset.

Because it is based on the MiniLM architecture, the model is incredibly small (~90 MB) and extremely fast, making it the perfect choice for running locally on CPU-only machines, edge devices, or in environments with strict data privacy constraints where cloud-based APIs cannot be used. This model is available for commercial use.

Evaluation Results (Epoch 4)

This checkpoint represents Epoch 4 and shows excellent generalization on the 20,000-sample validation split.

Epoch Training Loss Validation Loss Precision Recall F1 Score Accuracy
4 0.031307 0.029043 0.933025 0.947065 0.939993 0.992261

⚠️ Production Disclaimer

Please note that automated PII detection is not completely foolproof, and accuracy will vary depending on your specific data context and formatting. We strongly advise thoroughly validating the model on your own data and incorporating human oversight to ensure it meets your intended purpose before any production deployment.

Training Parameters

The model was trained with the following parameters:

  • Base Model: sentence-transformers/all-MiniLM-L6-v2
  • Dataset: nvidia/Nemotron-PII (180k Training / 20k Validation split)
  • Learning Rate: 2e-5
  • Batch Size: 64 (per device)
  • Weight Decay: 0.01
  • Max Sequence Length: 512
  • Number of Epochs: 4
  • Task: Token Classification (BIO Tagging format)

πŸ’» How to Use: Local Inference

Because privacy is critical when handling PII, this model is meant to be downloaded and run locally. You can easily test it using the Hugging Face pipeline.

Prerequisites

Make sure you have the transformers library installed in your local environment:

pip install transformers torch

Python Inference Script

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model_id = "Negative-Star-Innovators/MiniLM-L6-finetuned-pii-detection"

print("Downloading/Loading model locally...")
# Load the tokenizer and model locally
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Initialize the pipeline
# aggregation_strategy="simple" merges B- and I- tags into single coherent words/phrases
pii_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer, 
    aggregation_strategy="simple" 
)

# Text containing dummy PII for testing
sample_text = (
    "John Doe's bank routing number is 123456789. "
    "He is 45 years old and his email is john.doe@example.com."
)

print("\nRunning inference locally...")
results = pii_pipeline(sample_text)

# Display the detected PII entities
print("\nDetected PII Entities:")
for entity in results:
    print(f"- Entity: {entity['word']}")
    print(f"  Label:  {entity['entity_group']}")
    print(f"  Score:  {entity['score']:.4f}\n")

Expected Output Format

The pipeline will extract the entities based on the Nemotron-PII label mappings, yielding output like:

- Entity: john
  Label:  first_name
  Score:  0.9984

- Entity: doe
  Label:  last_name
  Score:  0.9971

- Entity: 123456789
  Label:  bank_routing_number
  Score:  0.9688

- Entity: 45
  Label:  age
  Score:  0.9117

- Entity: john. doe @ example. com
  Label:  email
  Score:  0.9993

πŸ“¬ Contact

Please reach out if you have questions or feedback. We also do custom projects, consultating, freelance and collaboration.

Email: thieves@negativestarinnovators.com

πŸ’– Support This Project

If you find this PII detector useful for your projects or business, please support our work!

Buy Me A Coffee

Downloads last month
94
Safetensors
Model size
22.6M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Negative-Star-Innovators/MiniLM-L6-finetuned-pii-detection 1