Kiji PII Detection Model
Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from distilbert-base-cased.
Model Summary
| Base model | distilbert-base-cased |
| Architecture | Shared DistilBERT encoder + two linear classification heads |
| Parameters | ~66M |
| Model size | 249 MB (SafeTensors) |
| Tasks | PII token classification (53 labels) + coreference detection (7 labels) |
| PII entity types | 26 |
| Max sequence length | 512 tokens |
Architecture
Input (input_ids, attention_mask)
|
DistilBERT Encoder (shared, hidden_size=768)
|
+----+----+
| |
PII Head Coref Head
(768->53) (768->7)
The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.
Usage
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")
# The model uses a custom MultiTaskPIIDetectionModel architecture.
# Load weights manually:
from safetensors.torch import load_file
weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors") # or local path
# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# See the label_mappings.json file for PII label definitions
PII Labels (BIO tagging)
The model uses BIO tagging with 26 entity types:
| Label | Description |
|---|---|
AGE |
Age |
BUILDINGNUM |
Building number |
CITY |
City |
COMPANYNAME |
Company name |
COUNTRY |
Country |
CREDITCARDNUMBER |
Credit Card Number |
DATEOFBIRTH |
Date of birth |
DRIVERLICENSENUM |
Driver's License Number |
EMAIL |
|
FIRSTNAME |
First name |
IBAN |
IBAN |
IDCARDNUM |
ID Card Number |
LICENSEPLATENUM |
License Plate Number |
NATIONALID |
National ID |
PASSPORTID |
Passport ID |
PASSWORD |
Password |
PHONENUMBER |
Phone number |
SECURITYTOKEN |
API Security Tokens |
SSN |
Social Security Number |
STATE |
State |
STREET |
Street |
SURNAME |
Last name |
TAXNUM |
Tax Number |
URL |
URL |
USERNAME |
Username |
ZIP |
Zip code |
Each entity type has B- (beginning) and I- (inside) variants, plus O for non-PII tokens.
Coreference Labels
| Label | Description |
|---|---|
NO_COREF |
Token is not part of a coreference cluster |
CLUSTER_0-CLUSTER_3 |
Token belongs to coreference cluster 0-3 |
Training
| Epochs | 15 (with early stopping) |
| Batch size | 16 |
| Learning rate | 3e-5 |
| Weight decay | 0.01 |
| Warmup steps | 200 |
| Early stopping | patience=3, threshold=1% |
| Loss | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) |
| Optimizer | AdamW |
| Metric | Weighted F1 (PII task) |
Training Data
Trained on the DataikuNLP/kiji-pii-training-data dataset โ a synthetic multilingual PII dataset with entity annotations and coreference resolution.
Derived Models
| Variant | Format | Repository |
|---|---|---|
| Quantized (INT8) | ONNX | DataikuNLP/kiji-pii-model-onnx |
Limitations
- Trained on synthetically generated data โ may not generalize to all real-world text
- Coreference head supports up to 4 clusters per sequence
- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
- Max sequence length is 512 tokens