ai4privacy/pii-masking-400k
Viewer β’ Updated β’ 407k β’ 2.62k β’ 63
How to use HikmaAI/hikmaai-distilbert-pii with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("token-classification", model="HikmaAI/hikmaai-distilbert-pii") # Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("HikmaAI/hikmaai-distilbert-pii")
model = AutoModelForTokenClassification.from_pretrained("HikmaAI/hikmaai-distilbert-pii")A fine-tuned DistilBERT model for Named Entity Recognition (NER) of Personally Identifiable Information (PII) in multilingual text. Trained on 6 languages (English, Italian, French, German, Spanish, Dutch). Detects 17 entity types across personal, financial, and contact information categories.
| Property | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Task | Token classification (BIO tagging) |
| Entity types | 17 (35 labels: B- and I- for each + O) |
| Parameters | ~66M |
| ONNX INT8 size | 129 MB |
| Max sequence length | 512 |
| Input tensors | input_ids, attention_mask |
| Output | logits shape [batch, seq_len, 35] |
| License | Apache 2.0 |
| Category | Entity | Label | Examples |
|---|---|---|---|
| Personal | Given name | GIVENNAME | John, Maria, Ahmed |
| Surname | SURNAME | Smith, Garcia, Chen | |
| Date of birth | DATEOFBIRTH | 1990-01-15, 15/03/1985 | |
| Username | USERNAME | john_doe, user123 | |
| Password | PASSWORD | P@ssw0rd!, mySecret | |
| Contact | john@example.com | ||
| Phone | TELEPHONENUM | +1-555-0123, 06 12345678 | |
| Street | STREET | 123 Main St, Via Roma 42 | |
| City | CITY | New York, London, Milano | |
| ZIP code | ZIPCODE | 10001, SW1A 1AA | |
| Building number | BUILDINGNUM | Apt 4B, Suite 200 | |
| Financial | Credit card | CREDITCARDNUMBER | 4111-1111-1111-1111 |
| Account number | ACCOUNTNUM | GB29 NWBK 6016 1331 9268 19 | |
| Tax number | TAXNUM | SSN, TIN, codice fiscale | |
| Identity | Social security | SOCIALNUM | 123-45-6789 |
| Driver license | DRIVERLICENSENUM | D1234567 | |
| ID card | IDCARDNUM | National ID numbers |
from transformers import pipeline
ner = pipeline(
"ner",
model="hikmaai-io/hikmaai-distilbert-pii",
aggregation_strategy="simple"
)
results = ner("Contact John Smith at john@example.com or call +1-555-0123")
for entity in results:
print(f"{entity['word']:20s} -> {entity['entity_group']:20s} ({entity['score']:.2f})")
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("hikmaai-io/hikmaai-distilbert-pii")
session = ort.InferenceSession("onnx/model_quantized.onnx")
inputs = tokenizer("Contact John Smith at john@example.com", return_tensors="np")
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
})
# outputs[0] shape: [1, seq_len, 35] (logits for 35 BIO labels)
The model is designed for high-performance inference in Go applications via yalue/onnxruntime_go:
// 2 input tensors: input_ids, attention_mask
// Output: logits [1, seq_len, 35]
// Post-processing: argmax per token -> BIO label -> entity spans
distilbert-base-uncasedβββ config.json # Model configuration
βββ model.safetensors # FP32 weights (PyTorch/safetensors)
βββ tokenizer.json # Fast tokenizer
βββ tokenizer_config.json # Tokenizer configuration
βββ onnx/
β βββ model.onnx # FP32 ONNX
β βββ model_quantized.onnx # INT8 ONNX (recommended for inference)
This model is designed for privacy-preserving PII detection in AI security gateways:
{"EMAIL": 2, "PHONE": 1}), never the detected text@misc{hikmaai-distilbert-pii,
author = {HikmaAI},
title = {HikmaAI DistilBERT PII NER: Fine-tuned model for PII detection in AI security gateways},
year = {2026},
url = {https://huggingface.co/hikmaai-io/hikmaai-distilbert-pii}
}