--- language: - da - de - en - es - fr - nl license: apache-2.0 library_name: transformers pipeline_tag: token-classification tags: - pii - privacy - ner - coreference-resolution - distilbert - multi-task base_model: distilbert-base-cased --- # Kiji PII Detection Model Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased). ## Model Summary | | | |---|---| | **Base model** | [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) | | **Architecture** | Shared DistilBERT encoder + two linear classification heads | | **Parameters** | ~66M | | **Model size** | 249 MB (SafeTensors) | | **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) | | **PII entity types** | 26 | | **Max sequence length** | 512 tokens | ## Architecture ``` Input (input_ids, attention_mask) | DistilBERT Encoder (shared, hidden_size=768) | +----+----+ | | PII Head Coref Head (768->53) (768->7) ``` The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization. ## Usage ```python import torch from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model") # The model uses a custom MultiTaskPIIDetectionModel architecture. # Load weights manually: from safetensors.torch import load_file weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors") # or local path # Tokenize text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) # See the label_mappings.json file for PII label definitions ``` ## PII Labels (BIO tagging) The model uses BIO tagging with 26 entity types: | Label | Description | |-------|-------------| | `AGE` | Age | | `BUILDINGNUM` | Building number | | `CITY` | City | | `COMPANYNAME` | Company name | | `COUNTRY` | Country | | `CREDITCARDNUMBER` | Credit Card Number | | `DATEOFBIRTH` | Date of birth | | `DRIVERLICENSENUM` | Driver's License Number | | `EMAIL` | Email | | `FIRSTNAME` | First name | | `IBAN` | IBAN | | `IDCARDNUM` | ID Card Number | | `LICENSEPLATENUM` | License Plate Number | | `NATIONALID` | National ID | | `PASSPORTID` | Passport ID | | `PASSWORD` | Password | | `PHONENUMBER` | Phone number | | `SECURITYTOKEN` | API Security Tokens | | `SSN` | Social Security Number | | `STATE` | State | | `STREET` | Street | | `SURNAME` | Last name | | `TAXNUM` | Tax Number | | `URL` | URL | | `USERNAME` | Username | | `ZIP` | Zip code | Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens. ## Coreference Labels | Label | Description | |-------|-------------| | `NO_COREF` | Token is not part of a coreference cluster | | `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 | ## Training | | | |---|---| | **Epochs** | 15 (with early stopping) | | **Batch size** | 16 | | **Learning rate** | 3e-5 | | **Weight decay** | 0.01 | | **Warmup steps** | 200 | | **Early stopping** | patience=3, threshold=1% | | **Loss** | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) | | **Optimizer** | AdamW | | **Metric** | Weighted F1 (PII task) | ## Training Data Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution. ## Derived Models | Variant | Format | Repository | |---------|--------|------------| | Quantized (INT8) | ONNX | [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) | ## Limitations - Trained on **synthetically generated** data — may not generalize to all real-world text - Coreference head supports up to 4 clusters per sequence - Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish) - Max sequence length is 512 tokens