Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +146 -3
label_mappings.json +126 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,146 @@
----
-license: apache-2.0
----

+---
+language:
+- da
+- de
+- en
+- es
+- fr
+- nl
+license: apache-2.0
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- pii
+- privacy
+- ner
+- coreference-resolution
+- distilbert
+- multi-task
+base_model: distilbert-base-cased
+---
+# Kiji PII Detection Model
+Multi-task DistilBERT model for detecting Personally Identifiable Information (PII) in text with coreference resolution. Fine-tuned from [`distilbert-base-cased`](https://huggingface.co/distilbert-base-cased).
+## Model Summary
+| | |
+|---|---|
+| **Base model** | [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) |
+| **Architecture** | Shared DistilBERT encoder + two linear classification heads |
+| **Parameters** | ~66M |
+| **Model size** | 249 MB (SafeTensors) |
+| **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) |
+| **PII entity types** | 26 |
+| **Max sequence length** | 512 tokens |
+## Architecture
+```
+Input (input_ids, attention_mask)
+        |
+  DistilBERT Encoder (shared, hidden_size=768)
+        |
+   +----+----+
+   |         |
+PII Head  Coref Head
+(768->53)  (768->7)
+```
+The model uses multi-task learning: a shared DistilBERT encoder feeds into two independent linear classification heads. Both tasks are trained simultaneously with equal loss weighting, which acts as regularization and improves PII detection generalization.
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model")
+# The model uses a custom MultiTaskPIIDetectionModel architecture.
+# Load weights manually:
+from safetensors.torch import load_file
+weights = load_file("DataikuNLP/kiji-pii-model/model.safetensors")  # or local path
+# Tokenize
+text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+# See the label_mappings.json file for PII label definitions
+```
+## PII Labels (BIO tagging)
+The model uses BIO tagging with 26 entity types:
+| Label | Description |
+|-------|-------------|
+| `AGE` | Age |
+| `BUILDINGNUM` | Building number |
+| `CITY` | City |
+| `COMPANYNAME` | Company name |
+| `COUNTRY` | Country |
+| `CREDITCARDNUMBER` | Credit Card Number |
+| `DATEOFBIRTH` | Date of birth |
+| `DRIVERLICENSENUM` | Driver's License Number |
+| `EMAIL` | Email |
+| `FIRSTNAME` | First name |
+| `IBAN` | IBAN |
+| `IDCARDNUM` | ID Card Number |
+| `LICENSEPLATENUM` | License Plate Number |
+| `NATIONALID` | National ID |
+| `PASSPORTID` | Passport ID |
+| `PASSWORD` | Password |
+| `PHONENUMBER` | Phone number |
+| `SECURITYTOKEN` | API Security Tokens |
+| `SSN` | Social Security Number |
+| `STATE` | State |
+| `STREET` | Street |
+| `SURNAME` | Last name |
+| `TAXNUM` | Tax Number |
+| `URL` | URL |
+| `USERNAME` | Username |
+| `ZIP` | Zip code |
+Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens.
+## Coreference Labels
+| Label | Description |
+|-------|-------------|
+| `NO_COREF` | Token is not part of a coreference cluster |
+| `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 |
+## Training
+| | |
+|---|---|
+| **Epochs** | 15 (with early stopping) |
+| **Batch size** | 16 |
+| **Learning rate** | 3e-5 |
+| **Weight decay** | 0.01 |
+| **Warmup steps** | 200 |
+| **Early stopping** | patience=3, threshold=1% |
+| **Loss** | Multi-task: PII cross-entropy + coreference cross-entropy (equal weights) |
+| **Optimizer** | AdamW |
+| **Metric** | Weighted F1 (PII task) |
+## Training Data
+Trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution.
+## Derived Models
+| Variant | Format | Repository |
+|---------|--------|------------|
+| Quantized (INT8) | ONNX | [DataikuNLP/kiji-pii-model-onnx](https://huggingface.co/DataikuNLP/kiji-pii-model-onnx) |
+## Limitations
+- Trained on **synthetically generated** data — may not generalize to all real-world text
+- Coreference head supports up to 4 clusters per sequence
+- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
+- Max sequence length is 512 tokens

label_mappings.json ADDED Viewed

	@@ -0,0 +1,126 @@

+{
+  "pii": {
+    "label2id": {
+      "O": 0,
+      "B-SURNAME": 1,
+      "I-SURNAME": 2,
+      "B-FIRSTNAME": 3,
+      "I-FIRSTNAME": 4,
+      "B-BUILDINGNUM": 5,
+      "I-BUILDINGNUM": 6,
+      "B-DATEOFBIRTH": 7,
+      "I-DATEOFBIRTH": 8,
+      "B-EMAIL": 9,
+      "I-EMAIL": 10,
+      "B-PHONENUMBER": 11,
+      "I-PHONENUMBER": 12,
+      "B-CITY": 13,
+      "I-CITY": 14,
+      "B-URL": 15,
+      "I-URL": 16,
+      "B-COMPANYNAME": 17,
+      "I-COMPANYNAME": 18,
+      "B-STATE": 19,
+      "I-STATE": 20,
+      "B-ZIP": 21,
+      "I-ZIP": 22,
+      "B-STREET": 23,
+      "I-STREET": 24,
+      "B-COUNTRY": 25,
+      "I-COUNTRY": 26,
+      "B-SSN": 27,
+      "I-SSN": 28,
+      "B-DRIVERLICENSENUM": 29,
+      "I-DRIVERLICENSENUM": 30,
+      "B-PASSPORTID": 31,
+      "I-PASSPORTID": 32,
+      "B-NATIONALID": 33,
+      "I-NATIONALID": 34,
+      "B-IDCARDNUM": 35,
+      "I-IDCARDNUM": 36,
+      "B-TAXNUM": 37,
+      "I-TAXNUM": 38,
+      "B-LICENSEPLATENUM": 39,
+      "I-LICENSEPLATENUM": 40,
+      "B-PASSWORD": 41,
+      "I-PASSWORD": 42,
+      "B-IBAN": 43,
+      "I-IBAN": 44,
+      "B-AGE": 45,
+      "I-AGE": 46,
+      "B-SECURITYTOKEN": 47,
+      "I-SECURITYTOKEN": 48,
+      "B-CREDITCARDNUMBER": 49,
+      "I-CREDITCARDNUMBER": 50,
+      "B-USERNAME": 51,
+      "I-USERNAME": 52
+    },
+    "id2label": {
+      "0": "O",
+      "1": "B-SURNAME",
+      "2": "I-SURNAME",
+      "3": "B-FIRSTNAME",
+      "4": "I-FIRSTNAME",
+      "5": "B-BUILDINGNUM",
+      "6": "I-BUILDINGNUM",
+      "7": "B-DATEOFBIRTH",
+      "8": "I-DATEOFBIRTH",
+      "9": "B-EMAIL",
+      "10": "I-EMAIL",
+      "11": "B-PHONENUMBER",
+      "12": "I-PHONENUMBER",
+      "13": "B-CITY",
+      "14": "I-CITY",
+      "15": "B-URL",
+      "16": "I-URL",
+      "17": "B-COMPANYNAME",
+      "18": "I-COMPANYNAME",
+      "19": "B-STATE",
+      "20": "I-STATE",
+      "21": "B-ZIP",
+      "22": "I-ZIP",
+      "23": "B-STREET",
+      "24": "I-STREET",
+      "25": "B-COUNTRY",
+      "26": "I-COUNTRY",
+      "27": "B-SSN",
+      "28": "I-SSN",
+      "29": "B-DRIVERLICENSENUM",
+      "30": "I-DRIVERLICENSENUM",
+      "31": "B-PASSPORTID",
+      "32": "I-PASSPORTID",
+      "33": "B-NATIONALID",
+      "34": "I-NATIONALID",
+      "35": "B-IDCARDNUM",
+      "36": "I-IDCARDNUM",
+      "37": "B-TAXNUM",
+      "38": "I-TAXNUM",
+      "39": "B-LICENSEPLATENUM",
+      "40": "I-LICENSEPLATENUM",
+      "41": "B-PASSWORD",
+      "42": "I-PASSWORD",
+      "43": "B-IBAN",
+      "44": "I-IBAN",
+      "45": "B-AGE",
+      "46": "I-AGE",
+      "47": "B-SECURITYTOKEN",
+      "48": "I-SECURITYTOKEN",
+      "49": "B-CREDITCARDNUMBER",
+      "50": "I-CREDITCARDNUMBER",
+      "51": "B-USERNAME",
+      "52": "I-USERNAME",
+      "-100": "IGNORE"
+    }
+  },
+  "coref": {
+    "id2label": {
+      "0": "NO_COREF",
+      "1": "CLUSTER_0",
+      "2": "CLUSTER_1",
+      "3": "CLUSTER_2",
+      "4": "CLUSTER_3",
+      "5": "CLUSTER_4",
+      "6": "CLUSTER_5"
+    }
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:246b7e3f3f1e0369155ad84c55efa9769ddd149861f9ce7f93a8f293ab58ee7e
+size 260960440

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff