Kiji PII Detection Model (ONNX Quantized)
INT8-quantized ONNX version of the Kiji PII detection model for efficient CPU inference. Detects Personally Identifiable Information (PII) in text with coreference resolution.
Source Model
This is a quantized version of DataikuNLP/kiji-pii-model โ a multi-task DistilBERT model fine-tuned for PII detection with coreference resolution.
Model Summary
| Format | ONNX (INT8 quantized) |
| Architecture | Shared DistilBERT encoder + two classification heads |
| Tasks | PII token classification (53 labels) + coreference detection (7 labels) |
| PII entity types | 26 |
| Max sequence length | 512 tokens |
| Runtime | ONNX Runtime |
Files
| File | Size |
|---|---|
model_quantized.onnx |
63.3 MB |
model.onnx.data |
248.9 MB |
ort_config.json |
0.7 KB |
label_mappings.json |
2.9 KB |
model_manifest.json |
1.6 KB |
tokenizer_config.json |
1.3 KB |
tokenizer.json |
653.2 KB |
vocab.txt |
208.4 KB |
special_tokens_map.json |
0.7 KB |
Quantization Details
| Method | Dynamic quantization (ONNX Runtime / Optimum) |
| Weights | QInt8 (symmetric, per-channel) |
| Activations | QUInt8 (asymmetric, per-tensor) |
| Mode | IntegerOps |
| Format | QOperator |
| Operators quantized | Conv, MatMul, Attention, LSTM, Gather, Transpose, EmbedLayerNormalization |
Usage
import numpy as np
from onnxruntime import InferenceSession
from transformers import AutoTokenizer
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model-onnx")
session = InferenceSession("DataikuNLP/kiji-pii-model-onnx/model_quantized.onnx") # or local path
# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="np", truncation=True, max_length=512)
# Run inference
outputs = session.run(None, dict(inputs))
pii_logits, coref_logits = outputs # (1, seq_len, 53), (1, seq_len, 7)
# Decode PII predictions
pii_predictions = np.argmax(pii_logits, axis=-1)[0]
# See label_mappings.json for label ID -> label name mapping
PII Labels (BIO tagging)
The model uses BIO tagging with 26 entity types:
| Label | Description |
|---|---|
AGE |
Age |
BUILDINGNUM |
Building number |
CITY |
City |
COMPANYNAME |
Company name |
COUNTRY |
Country |
CREDITCARDNUMBER |
Credit Card Number |
DATEOFBIRTH |
Date of birth |
DRIVERLICENSENUM |
Driver's License Number |
EMAIL |
|
FIRSTNAME |
First name |
IBAN |
IBAN |
IDCARDNUM |
ID Card Number |
LICENSEPLATENUM |
License Plate Number |
NATIONALID |
National ID |
PASSPORTID |
Passport ID |
PASSWORD |
Password |
PHONENUMBER |
Phone number |
SECURITYTOKEN |
API Security Tokens |
SSN |
Social Security Number |
STATE |
State |
STREET |
Street |
SURNAME |
Last name |
TAXNUM |
Tax Number |
URL |
URL |
USERNAME |
Username |
ZIP |
Zip code |
Each entity type has B- (beginning) and I- (inside) variants, plus O for non-PII tokens.
Coreference Labels
| Label | Description |
|---|---|
NO_COREF |
Token is not part of a coreference cluster |
CLUSTER_0-CLUSTER_3 |
Token belongs to coreference cluster 0-3 |
Training Data
The source model was trained on the DataikuNLP/kiji-pii-training-data dataset โ a synthetic multilingual PII dataset with entity annotations and coreference resolution.
Lineage
| Stage | Repository |
|---|---|
| Dataset | DataikuNLP/kiji-pii-training-data |
| Trained model | DataikuNLP/kiji-pii-model |
| Quantized model | DataikuNLP/kiji-pii-model-onnx (this repo) |
Limitations
- Trained on synthetically generated data โ may not generalize to all real-world text
- Coreference head supports up to 4 clusters per sequence
- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
- Max sequence length is 512 tokens
- Quantization may slightly reduce accuracy compared to the full-precision model
Model tree for DataikuNLP/kiji-pii-model-onnx
Base model
distilbert/distilbert-base-cased
Finetuned
DataikuNLP/kiji-pii-model