|
|
--- |
|
|
language: |
|
|
- da |
|
|
- de |
|
|
- en |
|
|
- es |
|
|
- fr |
|
|
- nl |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- pii |
|
|
- privacy |
|
|
- ner |
|
|
- coreference-resolution |
|
|
- distilbert |
|
|
- multi-task |
|
|
base_model: distilbert-base-cased |
|
|
--- |
|
|
|
|
|
# Kiji PII Detection Model |
|
|
|
|
|
Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased). |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
| | | |
|
|
|---|---| |
|
|
| **Base model** | [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) | |
|
|
| **Architecture** | Shared DistilBERT encoder + two linear classification heads | |
|
|
| **Parameters** | ~66M | |
|
|
| **Model size** | 249 MB (SafeTensors) | |
|
|
| **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) | |
|
|
| **PII entity types** | 26 | |
|
|
| **Max sequence length** | 512 tokens | |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Input (input_ids, attention_mask) |
|
|
| |
|
|
DistilBERT Encoder (shared, hidden_size=768) |
|
|
| |
|
|
+----+----+ |
|
|
| | |
|
|
PII Head Coref Head |
|
|
(768->53) (768->7) |
|
|
``` |
|
|
|
|
|
The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model") |
|
|
|
|
|
# The model uses a custom MultiTaskPIIDetectionModel architecture. |
|
|
# Load weights manually: |
|
|
from safetensors.torch import load_file |
|
|
weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors") # or local path |
|
|
|
|
|
# Tokenize |
|
|
text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567." |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
# See the label_mappings.json file for PII label definitions |
|
|
``` |
|
|
|
|
|
## PII Labels (BIO tagging) |
|
|
|
|
|
The model uses BIO tagging with 26 entity types: |
|
|
|
|
|
| Label | Description | |
|
|
|-------|-------------| |
|
|
| `AGE` | Age | |
|
|
| `BUILDINGNUM` | Building number | |
|
|
| `CITY` | City | |
|
|
| `COMPANYNAME` | Company name | |
|
|
| `COUNTRY` | Country | |
|
|
| `CREDITCARDNUMBER` | Credit Card Number | |
|
|
| `DATEOFBIRTH` | Date of birth | |
|
|
| `DRIVERLICENSENUM` | Driver's License Number | |
|
|
| `EMAIL` | Email | |
|
|
| `FIRSTNAME` | First name | |
|
|
| `IBAN` | IBAN | |
|
|
| `IDCARDNUM` | ID Card Number | |
|
|
| `LICENSEPLATENUM` | License Plate Number | |
|
|
| `NATIONALID` | National ID | |
|
|
| `PASSPORTID` | Passport ID | |
|
|
| `PASSWORD` | Password | |
|
|
| `PHONENUMBER` | Phone number | |
|
|
| `SECURITYTOKEN` | API Security Tokens | |
|
|
| `SSN` | Social Security Number | |
|
|
| `STATE` | State | |
|
|
| `STREET` | Street | |
|
|
| `SURNAME` | Last name | |
|
|
| `TAXNUM` | Tax Number | |
|
|
| `URL` | URL | |
|
|
| `USERNAME` | Username | |
|
|
| `ZIP` | Zip code | |
|
|
|
|
|
|
|
|
Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens. |
|
|
|
|
|
## Coreference Labels |
|
|
|
|
|
| Label | Description | |
|
|
|-------|-------------| |
|
|
| `NO_COREF` | Token is not part of a coreference cluster | |
|
|
| `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 | |
|
|
|
|
|
## Training |
|
|
|
|
|
| | | |
|
|
|---|---| |
|
|
| **Epochs** | 15 (with early stopping) | |
|
|
| **Batch size** | 16 | |
|
|
| **Learning rate** | 3e-5 | |
|
|
| **Weight decay** | 0.01 | |
|
|
| **Warmup steps** | 200 | |
|
|
| **Early stopping** | patience=3, threshold=1% | |
|
|
| **Loss** | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Metric** | Weighted F1 (PII task) | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution. |
|
|
|
|
|
## Derived Models |
|
|
|
|
|
| Variant | Format | Repository | |
|
|
|---------|--------|------------| |
|
|
| Quantized (INT8) | ONNX | [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on **synthetically generated** data — may not generalize to all real-world text |
|
|
- Coreference head supports up to 4 clusters per sequence |
|
|
- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish) |
|
|
- Max sequence length is 512 tokens |
|
|
|