Kiji PII Detection Model

Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from distilbert-base-cased.

Model Summary

Base model distilbert-base-cased
Architecture Shared DistilBERT encoder + two linear classification heads
Parameters ~66M
Model size 249 MB (SafeTensors)
Tasks PII token classification (53 labels) + coreference detection (7 labels)
PII entity types 26
Max sequence length 512 tokens

Architecture

Input (input_ids, attention_mask)
        |
  DistilBERT Encoder (shared, hidden_size=768)
        |
   +----+----+
   |         |
PII Head  Coref Head
(768->53)  (768->7)

The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.

Usage

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")

# The model uses a custom MultiTaskPIIDetectionModel architecture.
# Load weights manually:
from safetensors.torch import load_file
weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors")  # or local path

# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# See the label_mappings.json file for PII label definitions

PII Labels (BIO tagging)

The model uses BIO tagging with 26 entity types:

Label Description
AGE Age
BUILDINGNUM Building number
CITY City
COMPANYNAME Company name
COUNTRY Country
CREDITCARDNUMBER Credit Card Number
DATEOFBIRTH Date of birth
DRIVERLICENSENUM Driver's License Number
EMAIL Email
FIRSTNAME First name
IBAN IBAN
IDCARDNUM ID Card Number
LICENSEPLATENUM License Plate Number
NATIONALID National ID
PASSPORTID Passport ID
PASSWORD Password
PHONENUMBER Phone number
SECURITYTOKEN API Security Tokens
SSN Social Security Number
STATE State
STREET Street
SURNAME Last name
TAXNUM Tax Number
URL URL
USERNAME Username
ZIP Zip code

Each entity type has B- (beginning) and I- (inside) variants, plus O for non-PII tokens.

Coreference Labels

Label Description
NO_COREF Token is not part of a coreference cluster
CLUSTER_0-CLUSTER_3 Token belongs to coreference cluster 0-3

Training

Epochs 15 (with early stopping)
Batch size 16
Learning rate 3e-5
Weight decay 0.01
Warmup steps 200
Early stopping patience=3, threshold=1%
Loss Multi-task: PII cross-entropy + coreference cross-entropy (equal weights)
Optimizer AdamW
Metric Weighted F1 (PII task)

Training Data

Trained on the DataikuNLP/kiji-pii-training-data dataset โ€” a synthetic multilingual PII dataset with entity annotations and coreference resolution.

Derived Models

Variant Format Repository
Quantized (INT8) ONNX DataikuNLP/kiji-pii-model-onnx

Limitations

  • Trained on synthetically generated data โ€” may not generalize to all real-world text
  • Coreference head supports up to 4 clusters per sequence
  • Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
  • Max sequence length is 512 tokens
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
65.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DataikuNLP/kiji-pii-model

Finetuned
(309)
this model
Quantizations
1 model