kiji-pii-model / README.md

Upload folder using huggingface_hub

6d35bc0 verified 4 days ago

4.31 kB

	---
	language:
	- da
	- de
	- en
	- es
	- fr
	- nl
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- pii
	- privacy
	- ner
	- coreference-resolution
	- distilbert
	- multi-task
	base_model: distilbert-base-cased
	---

	# Kiji PII Detection Model

	Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased).

	## Model Summary

	\| \| \|
	\|---\|---\|
	\| Base model \| [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) \|
	\| Architecture \| Shared DistilBERT encoder + two linear classification heads \|
	\| Parameters \| ~66M \|
	\| Model size \| 249 MB (SafeTensors) \|
	\| Tasks \| PII token classification (53 labels) + coreference detection (7 labels) \|
	\| PII entity types \| 26 \|
	\| Max sequence length \| 512 tokens \|

	## Architecture

	```
	Input (input_ids, attention_mask)
	\|
	DistilBERT Encoder (shared, hidden_size=768)
	\|
	+----+----+
	\| \|
	PII Head Coref Head
	(768->53) (768->7)
	```

	The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")

	# The model uses a custom MultiTaskPIIDetectionModel architecture.
	# Load weights manually:
	from safetensors.torch import load_file
	weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors") # or local path

	# Tokenize
	text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

	# See the label_mappings.json file for PII label definitions
	```

	## PII Labels (BIO tagging)

	The model uses BIO tagging with 26 entity types:

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `AGE` \| Age \|
	\| `BUILDINGNUM` \| Building number \|
	\| `CITY` \| City \|
	\| `COMPANYNAME` \| Company name \|
	\| `COUNTRY` \| Country \|
	\| `CREDITCARDNUMBER` \| Credit Card Number \|
	\| `DATEOFBIRTH` \| Date of birth \|
	\| `DRIVERLICENSENUM` \| Driver's License Number \|
	\| `EMAIL` \| Email \|
	\| `FIRSTNAME` \| First name \|
	\| `IBAN` \| IBAN \|
	\| `IDCARDNUM` \| ID Card Number \|
	\| `LICENSEPLATENUM` \| License Plate Number \|
	\| `NATIONALID` \| National ID \|
	\| `PASSPORTID` \| Passport ID \|
	\| `PASSWORD` \| Password \|
	\| `PHONENUMBER` \| Phone number \|
	\| `SECURITYTOKEN` \| API Security Tokens \|
	\| `SSN` \| Social Security Number \|
	\| `STATE` \| State \|
	\| `STREET` \| Street \|
	\| `SURNAME` \| Last name \|
	\| `TAXNUM` \| Tax Number \|
	\| `URL` \| URL \|
	\| `USERNAME` \| Username \|
	\| `ZIP` \| Zip code \|


	Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens.

	## Coreference Labels

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `NO_COREF` \| Token is not part of a coreference cluster \|
	\| `CLUSTER_0`-`CLUSTER_3` \| Token belongs to coreference cluster 0-3 \|

	## Training

	\| \| \|
	\|---\|---\|
	\| Epochs \| 15 (with early stopping) \|
	\| Batch size \| 16 \|
	\| Learning rate \| 3e-5 \|
	\| Weight decay \| 0.01 \|
	\| Warmup steps \| 200 \|
	\| Early stopping \| patience=3, threshold=1% \|
	\| Loss \| Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) \|
	\| Optimizer \| AdamW \|
	\| Metric \| Weighted F1 (PII task) \|

	## Training Data

	Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution.

	## Derived Models

	\| Variant \| Format \| Repository \|
	\|---------\|--------\|------------\|
	\| Quantized (INT8) \| ONNX \| [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) \|

	## Limitations

	- Trained on synthetically generated data — may not generalize to all real-world text
	- Coreference head supports up to 4 clusters per sequence
	- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
	- Max sequence length is 512 tokens