Kiji PII Detection Model (ONNX Quantized)

INT8-quantized ONNX version of the Kiji PII detection model for efficient CPU inference. Detects Personally Identifiable Information (PII) in text with coreference resolution.

Source Model

This is a quantized version of DataikuNLP/kiji-pii-model โ€” a multi-task DistilBERT model fine-tuned for PII detection with coreference resolution.

Model Summary

Format ONNX (INT8 quantized)
Architecture Shared DistilBERT encoder + two classification heads
Tasks PII token classification (53 labels) + coreference detection (7 labels)
PII entity types 26
Max sequence length 512 tokens
Runtime ONNX Runtime

Files

File Size
model_quantized.onnx 63.3 MB
model.onnx.data 248.9 MB
ort_config.json 0.7 KB
label_mappings.json 2.9 KB
model_manifest.json 1.6 KB
tokenizer_config.json 1.3 KB
tokenizer.json 653.2 KB
vocab.txt 208.4 KB
special_tokens_map.json 0.7 KB

Quantization Details

Method Dynamic quantization (ONNX Runtime / Optimum)
Weights QInt8 (symmetric, per-channel)
Activations QUInt8 (asymmetric, per-tensor)
Mode IntegerOps
Format QOperator
Operators quantized Conv, MatMul, Attention, LSTM, Gather, Transpose, EmbedLayerNormalization

Usage

import numpy as np
from onnxruntime import InferenceSession
from transformers import AutoTokenizer

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model-onnx")
session = InferenceSession("DataikuNLP/kiji-pii-model-onnx/model_quantized.onnx")  # or local path

# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="np", truncation=True, max_length=512)

# Run inference
outputs = session.run(None, dict(inputs))
pii_logits, coref_logits = outputs  # (1, seq_len, 53), (1, seq_len, 7)

# Decode PII predictions
pii_predictions = np.argmax(pii_logits, axis=-1)[0]

# See label_mappings.json for label ID -> label name mapping

PII Labels (BIO tagging)

The model uses BIO tagging with 26 entity types:

Label Description
AGE Age
BUILDINGNUM Building number
CITY City
COMPANYNAME Company name
COUNTRY Country
CREDITCARDNUMBER Credit Card Number
DATEOFBIRTH Date of birth
DRIVERLICENSENUM Driver's License Number
EMAIL Email
FIRSTNAME First name
IBAN IBAN
IDCARDNUM ID Card Number
LICENSEPLATENUM License Plate Number
NATIONALID National ID
PASSPORTID Passport ID
PASSWORD Password
PHONENUMBER Phone number
SECURITYTOKEN API Security Tokens
SSN Social Security Number
STATE State
STREET Street
SURNAME Last name
TAXNUM Tax Number
URL URL
USERNAME Username
ZIP Zip code

Each entity type has B- (beginning) and I- (inside) variants, plus O for non-PII tokens.

Coreference Labels

Label Description
NO_COREF Token is not part of a coreference cluster
CLUSTER_0-CLUSTER_3 Token belongs to coreference cluster 0-3

Training Data

The source model was trained on the DataikuNLP/kiji-pii-training-data dataset โ€” a synthetic multilingual PII dataset with entity annotations and coreference resolution.

Lineage

Stage Repository
Dataset DataikuNLP/kiji-pii-training-data
Trained model DataikuNLP/kiji-pii-model
Quantized model DataikuNLP/kiji-pii-model-onnx (this repo)

Limitations

  • Trained on synthetically generated data โ€” may not generalize to all real-world text
  • Coreference head supports up to 4 clusters per sequence
  • Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
  • Max sequence length is 512 tokens
  • Quantization may slightly reduce accuracy compared to the full-precision model
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DataikuNLP/kiji-pii-model-onnx

Quantized
(1)
this model