---
language:
- da
- de
- en
- es
- fr
- nl
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- pii
- privacy
- ner
- coreference-resolution
- distilbert
- multi-task
base_model: distilbert-base-cased
---

# Kiji PII Detection Model

Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased).

## Model Summary

| | |
|---|---|
| **Base model** | [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) |
| **Architecture** | Shared DistilBERT encoder + two linear classification heads |
| **Parameters** | ~66M |
| **Model size** | 249 MB (SafeTensors) |
| **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) |
| **PII entity types** | 26 |
| **Max sequence length** | 512 tokens |

## Architecture

```
Input (input_ids, attention_mask)
        |
  DistilBERT Encoder (shared, hidden_size=768)
        |
   +----+----+
   |         |
PII Head  Coref Head
(768->53)  (768->7)
```

The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.

## Usage

```python
import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")

# The model uses a custom MultiTaskPIIDetectionModel architecture.
# Load weights manually:
from safetensors.torch import load_file
weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors")  # or local path

# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# See the label_mappings.json file for PII label definitions
```

## PII Labels (BIO tagging)

The model uses BIO tagging with 26 entity types:

| Label | Description |
|-------|-------------|
| `AGE` | Age |
| `BUILDINGNUM` | Building number |
| `CITY` | City |
| `COMPANYNAME` | Company name |
| `COUNTRY` | Country |
| `CREDITCARDNUMBER` | Credit Card Number |
| `DATEOFBIRTH` | Date of birth |
| `DRIVERLICENSENUM` | Driver's License Number |
| `EMAIL` | Email |
| `FIRSTNAME` | First name |
| `IBAN` | IBAN |
| `IDCARDNUM` | ID Card Number |
| `LICENSEPLATENUM` | License Plate Number |
| `NATIONALID` | National ID |
| `PASSPORTID` | Passport ID |
| `PASSWORD` | Password |
| `PHONENUMBER` | Phone number |
| `SECURITYTOKEN` | API Security Tokens |
| `SSN` | Social Security Number |
| `STATE` | State |
| `STREET` | Street |
| `SURNAME` | Last name |
| `TAXNUM` | Tax Number |
| `URL` | URL |
| `USERNAME` | Username |
| `ZIP` | Zip code |


Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens.

## Coreference Labels

| Label | Description |
|-------|-------------|
| `NO_COREF` | Token is not part of a coreference cluster |
| `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 |

## Training

| | |
|---|---|
| **Epochs** | 15 (with early stopping) |
| **Batch size** | 16 |
| **Learning rate** | 3e-5 |
| **Weight decay** | 0.01 |
| **Warmup steps** | 200 |
| **Early stopping** | patience=3, threshold=1% |
| **Loss** | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) |
| **Optimizer** | AdamW |
| **Metric** | Weighted F1 (PII task) |

## Training Data

Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution.

## Derived Models

| Variant | Format | Repository |
|---------|--------|------------|
| Quantized (INT8) | ONNX | [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) |

## Limitations

- Trained on **synthetically generated** data — may not generalize to all real-world text
- Coreference head supports up to 4 clusters per sequence
- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
- Max sequence length is 512 tokens