HikmaAI DistilBERT PII NER

A fine-tuned DistilBERT model for Named Entity Recognition (NER) of Personally Identifiable Information (PII) in multilingual text. Trained on 6 languages (English, Italian, French, German, Spanish, Dutch). Detects 17 entity types across personal, financial, and contact information categories.

Model Details

Property Value
Base model distilbert-base-uncased
Task Token classification (BIO tagging)
Entity types 17 (35 labels: B- and I- for each + O)
Parameters ~66M
ONNX INT8 size 129 MB
Max sequence length 512
Input tensors input_ids, attention_mask
Output logits shape [batch, seq_len, 35]
License Apache 2.0

Entity Types

Category Entity Label Examples
Personal Given name GIVENNAME John, Maria, Ahmed
Surname SURNAME Smith, Garcia, Chen
Date of birth DATEOFBIRTH 1990-01-15, 15/03/1985
Username USERNAME john_doe, user123
Password PASSWORD P@ssw0rd!, mySecret
Contact Email EMAIL john@example.com
Phone TELEPHONENUM +1-555-0123, 06 12345678
Street STREET 123 Main St, Via Roma 42
City CITY New York, London, Milano
ZIP code ZIPCODE 10001, SW1A 1AA
Building number BUILDINGNUM Apt 4B, Suite 200
Financial Credit card CREDITCARDNUMBER 4111-1111-1111-1111
Account number ACCOUNTNUM GB29 NWBK 6016 1331 9268 19
Tax number TAXNUM SSN, TIN, codice fiscale
Identity Social security SOCIALNUM 123-45-6789
Driver license DRIVERLICENSENUM D1234567
ID card IDCARDNUM National ID numbers

Usage

Transformers

from transformers import pipeline

ner = pipeline(
    "ner",
    model="hikmaai-io/hikmaai-distilbert-pii",
    aggregation_strategy="simple"
)

results = ner("Contact John Smith at john@example.com or call +1-555-0123")
for entity in results:
    print(f"{entity['word']:20s} -> {entity['entity_group']:20s} ({entity['score']:.2f})")

ONNX Runtime

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hikmaai-io/hikmaai-distilbert-pii")
session = ort.InferenceSession("onnx/model_quantized.onnx")

inputs = tokenizer("Contact John Smith at john@example.com", return_tensors="np")
outputs = session.run(None, {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
})
# outputs[0] shape: [1, seq_len, 35] (logits for 35 BIO labels)

ONNX Runtime (Go)

The model is designed for high-performance inference in Go applications via yalue/onnxruntime_go:

// 2 input tensors: input_ids, attention_mask
// Output: logits [1, seq_len, 35]
// Post-processing: argmax per token -> BIO label -> entity spans

Training

  • Base model: distilbert-base-uncased
  • Dataset: ai4privacy/pii-masking-400k (325K annotated examples, 6 languages: EN 21%, IT 20%, FR 20%, DE 20%, ES 10%, NL 10%)
  • Checkpoints: 15,000 steps
  • Quantization: INT8 dynamic quantization via ONNX Runtime (FP32 also available)

File Structure

β”œβ”€β”€ config.json              # Model configuration
β”œβ”€β”€ model.safetensors        # FP32 weights (PyTorch/safetensors)
β”œβ”€β”€ tokenizer.json           # Fast tokenizer
β”œβ”€β”€ tokenizer_config.json    # Tokenizer configuration
β”œβ”€β”€ onnx/
β”‚   β”œβ”€β”€ model.onnx           # FP32 ONNX
β”‚   └── model_quantized.onnx # INT8 ONNX (recommended for inference)

Privacy Design

This model is designed for privacy-preserving PII detection in AI security gateways:

  • Detection only: the model identifies entity spans and types. It does not store, transmit, or log the actual PII values
  • Counts, not content: downstream systems receive entity type counts (e.g., {"EMAIL": 2, "PHONE": 1}), never the detected text
  • Local inference: runs entirely on-device via ONNX Runtime, no external API calls
  • Stateless: no request data is retained between inferences

Limitations

  • Trained on 6 languages (EN, IT, FR, DE, ES, NL); performance on other languages is untested
  • Maximum sequence length: 512 tokens (longer texts require chunking)
  • Regex-based PII detection (emails, phone numbers, credit cards) may outperform NER for well-structured patterns; this model adds value for unstructured PII (names, addresses, free-text credentials)
  • INT8 quantization may slightly reduce accuracy on edge cases compared to FP32

Citation

@misc{hikmaai-distilbert-pii,
  author = {HikmaAI},
  title = {HikmaAI DistilBERT PII NER: Fine-tuned model for PII detection in AI security gateways},
  year = {2026},
  url = {https://huggingface.co/hikmaai-io/hikmaai-distilbert-pii}
}
Downloads last month
34
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train HikmaAI/hikmaai-distilbert-pii