kiji-pii-model / README.md
hanneshapke's picture
Upload folder using huggingface_hub
6d35bc0 verified
---
language:
- da
- de
- en
- es
- fr
- nl
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
- pii
- privacy
- ner
- coreference-resolution
- distilbert
- multi-task
base_model: distilbert-base-cased
---
# Kiji PII Detection Model
Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased).
## Model Summary
| | |
|---|---|
| **Base model** | [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) |
| **Architecture** | Shared DistilBERT encoder + two linear classification heads |
| **Parameters** | ~66M |
| **Model size** | 249 MB (SafeTensors) |
| **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) |
| **PII entity types** | 26 |
| **Max sequence length** | 512 tokens |
## Architecture
```
Input (input_ids, attention_mask)
|
DistilBERT Encoder (shared, hidden_size=768)
|
+----+----+
| |
PII Head Coref Head
(768->53) (768->7)
```
The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.
## Usage
```python
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")
# The model uses a custom MultiTaskPIIDetectionModel architecture.
# Load weights manually:
from safetensors.torch import load_file
weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors") # or local path
# Tokenize
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# See the label_mappings.json file for PII label definitions
```
## PII Labels (BIO tagging)
The model uses BIO tagging with 26 entity types:
| Label | Description |
|-------|-------------|
| `AGE` | Age |
| `BUILDINGNUM` | Building number |
| `CITY` | City |
| `COMPANYNAME` | Company name |
| `COUNTRY` | Country |
| `CREDITCARDNUMBER` | Credit Card Number |
| `DATEOFBIRTH` | Date of birth |
| `DRIVERLICENSENUM` | Driver's License Number |
| `EMAIL` | Email |
| `FIRSTNAME` | First name |
| `IBAN` | IBAN |
| `IDCARDNUM` | ID Card Number |
| `LICENSEPLATENUM` | License Plate Number |
| `NATIONALID` | National ID |
| `PASSPORTID` | Passport ID |
| `PASSWORD` | Password |
| `PHONENUMBER` | Phone number |
| `SECURITYTOKEN` | API Security Tokens |
| `SSN` | Social Security Number |
| `STATE` | State |
| `STREET` | Street |
| `SURNAME` | Last name |
| `TAXNUM` | Tax Number |
| `URL` | URL |
| `USERNAME` | Username |
| `ZIP` | Zip code |
Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens.
## Coreference Labels
| Label | Description |
|-------|-------------|
| `NO_COREF` | Token is not part of a coreference cluster |
| `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 |
## Training
| | |
|---|---|
| **Epochs** | 15 (with early stopping) |
| **Batch size** | 16 |
| **Learning rate** | 3e-5 |
| **Weight decay** | 0.01 |
| **Warmup steps** | 200 |
| **Early stopping** | patience=3, threshold=1% |
| **Loss** | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) |
| **Optimizer** | AdamW |
| **Metric** | Weighted F1 (PII task) |
## Training Data
Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution.
## Derived Models
| Variant | Format | Repository |
|---------|--------|------------|
| Quantized (INT8) | ONNX | [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) |
## Limitations
- Trained on **synthetically generated** data — may not generalize to all real-world text
- Coreference head supports up to 4 clusters per sequence
- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
- Max sequence length is 512 tokens