Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitattributes +1 -0
README.md +155 -3
label_mappings.json +126 -0
model.onnx.data +3 -0
model_manifest.json +49 -0
model_quantized.onnx +3 -0
ort_config.json +33 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +60 -0
vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model.onnx.data filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,155 @@
----
-license: apache-2.0
----

+---
+language:
+- da
+- de
+- en
+- es
+- fr
+- nl
+license: apache-2.0
+library_name: onnx
+pipeline_tag: token-classification
+tags:
+- pii
+- privacy
+- ner
+- coreference-resolution
+- distilbert
+- multi-task
+- onnx
+- quantized
+- int8
+base_model: DataikuNLP/kiji-pii-model
+---
+# Kiji PII Detection Model (ONNX Quantized)
+INT8-quantized ONNX version of the Kiji PII detection model for efficient CPU inference. Detects Personally Identifiable Information (PII) in text with coreference resolution.
+## Source Model
+This is a quantized version of [DataikuNLP/kiji-pii-model](https://huggingface.co/DataikuNLP/kiji-pii-model) — a multi-task DistilBERT model fine-tuned for PII detection with coreference resolution.
+## Model Summary
+| | |
+|---|---|
+| **Format** | ONNX (INT8 quantized) |
+| **Architecture** | Shared DistilBERT encoder + two classification heads |
+| **Tasks** | PII token classification (53 labels) + coreference detection (7 labels) |
+| **PII entity types** | 26 |
+| **Max sequence length** | 512 tokens |
+| **Runtime** | ONNX Runtime |
+## Files
+| File | Size |
+|------|------|
+| `model_quantized.onnx` | 63.3 MB |
+| `model.onnx.data` | 248.9 MB |
+| `ort_config.json` | 0.7 KB |
+| `label_mappings.json` | 2.9 KB |
+| `model_manifest.json` | 1.6 KB |
+| `tokenizer_config.json` | 1.3 KB |
+| `tokenizer.json` | 653.2 KB |
+| `vocab.txt` | 208.4 KB |
+| `special_tokens_map.json` | 0.7 KB |
+## Quantization Details
+| | |
+|---|---|
+| **Method** | Dynamic quantization (ONNX Runtime / Optimum) |
+| **Weights** | QInt8 (symmetric, per-channel) |
+| **Activations** | QUInt8 (asymmetric, per-tensor) |
+| **Mode** | IntegerOps |
+| **Format** | QOperator |
+| **Operators quantized** | Conv, MatMul, Attention, LSTM, Gather, Transpose, EmbedLayerNormalization |
+## Usage
+```python
+import numpy as np
+from onnxruntime import InferenceSession
+from transformers import AutoTokenizer
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("DataikuNLP/kiji-pii-model-onnx")
+session = InferenceSession("DataikuNLP/kiji-pii-model-onnx/model_quantized.onnx")  # or local path
+# Tokenize
+text = "Contact John Smith at john.smith@example.com or call +1-555-123-4567."
+inputs = tokenizer(text, return_tensors="np", truncation=True, max_length=512)
+# Run inference
+outputs = session.run(None, dict(inputs))
+pii_logits, coref_logits = outputs  # (1, seq_len, 53), (1, seq_len, 7)
+# Decode PII predictions
+pii_predictions = np.argmax(pii_logits, axis=-1)[0]
+# See label_mappings.json for label ID -> label name mapping
+```
+## PII Labels (BIO tagging)
+The model uses BIO tagging with 26 entity types:
+| Label | Description |
+|-------|-------------|
+| `AGE` | Age |
+| `BUILDINGNUM` | Building number |
+| `CITY` | City |
+| `COMPANYNAME` | Company name |
+| `COUNTRY` | Country |
+| `CREDITCARDNUMBER` | Credit Card Number |
+| `DATEOFBIRTH` | Date of birth |
+| `DRIVERLICENSENUM` | Driver's License Number |
+| `EMAIL` | Email |
+| `FIRSTNAME` | First name |
+| `IBAN` | IBAN |
+| `IDCARDNUM` | ID Card Number |
+| `LICENSEPLATENUM` | License Plate Number |
+| `NATIONALID` | National ID |
+| `PASSPORTID` | Passport ID |
+| `PASSWORD` | Password |
+| `PHONENUMBER` | Phone number |
+| `SECURITYTOKEN` | API Security Tokens |
+| `SSN` | Social Security Number |
+| `STATE` | State |
+| `STREET` | Street |
+| `SURNAME` | Last name |
+| `TAXNUM` | Tax Number |
+| `URL` | URL |
+| `USERNAME` | Username |
+| `ZIP` | Zip code |
+Each entity type has `B-` (beginning) and `I-` (inside) variants, plus `O` for non-PII tokens.
+## Coreference Labels
+| Label | Description |
+|-------|-------------|
+| `NO_COREF` | Token is not part of a coreference cluster |
+| `CLUSTER_0`-`CLUSTER_3` | Token belongs to coreference cluster 0-3 |
+## Training Data
+The source model was trained on the [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) dataset — a synthetic multilingual PII dataset with entity annotations and coreference resolution.
+## Lineage
+| Stage | Repository |
+|-------|------------|
+| Dataset | [DataikuNLP/kiji-pii-training-data](https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data) |
+| Trained model | [DataikuNLP/kiji-pii-model](https://huggingface.co/DataikuNLP/kiji-pii-model) |
+| **Quantized model** | **DataikuNLP/kiji-pii-model-onnx** (this repo) |
+## Limitations
+- Trained on **synthetically generated** data — may not generalize to all real-world text
+- Coreference head supports up to 4 clusters per sequence
+- Optimized for the 6 languages in the training data (English, German, French, Spanish, Dutch, Danish)
+- Max sequence length is 512 tokens
+- Quantization may slightly reduce accuracy compared to the full-precision model

label_mappings.json ADDED Viewed

	@@ -0,0 +1,126 @@

+{
+  "pii": {
+    "label2id": {
+      "O": 0,
+      "B-SURNAME": 1,
+      "I-SURNAME": 2,
+      "B-FIRSTNAME": 3,
+      "I-FIRSTNAME": 4,
+      "B-BUILDINGNUM": 5,
+      "I-BUILDINGNUM": 6,
+      "B-DATEOFBIRTH": 7,
+      "I-DATEOFBIRTH": 8,
+      "B-EMAIL": 9,
+      "I-EMAIL": 10,
+      "B-PHONENUMBER": 11,
+      "I-PHONENUMBER": 12,
+      "B-CITY": 13,
+      "I-CITY": 14,
+      "B-URL": 15,
+      "I-URL": 16,
+      "B-COMPANYNAME": 17,
+      "I-COMPANYNAME": 18,
+      "B-STATE": 19,
+      "I-STATE": 20,
+      "B-ZIP": 21,
+      "I-ZIP": 22,
+      "B-STREET": 23,
+      "I-STREET": 24,
+      "B-COUNTRY": 25,
+      "I-COUNTRY": 26,
+      "B-SSN": 27,
+      "I-SSN": 28,
+      "B-DRIVERLICENSENUM": 29,
+      "I-DRIVERLICENSENUM": 30,
+      "B-PASSPORTID": 31,
+      "I-PASSPORTID": 32,
+      "B-NATIONALID": 33,
+      "I-NATIONALID": 34,
+      "B-IDCARDNUM": 35,
+      "I-IDCARDNUM": 36,
+      "B-TAXNUM": 37,
+      "I-TAXNUM": 38,
+      "B-LICENSEPLATENUM": 39,
+      "I-LICENSEPLATENUM": 40,
+      "B-PASSWORD": 41,
+      "I-PASSWORD": 42,
+      "B-IBAN": 43,
+      "I-IBAN": 44,
+      "B-AGE": 45,
+      "I-AGE": 46,
+      "B-SECURITYTOKEN": 47,
+      "I-SECURITYTOKEN": 48,
+      "B-CREDITCARDNUMBER": 49,
+      "I-CREDITCARDNUMBER": 50,
+      "B-USERNAME": 51,
+      "I-USERNAME": 52
+    },
+    "id2label": {
+      "0": "O",
+      "1": "B-SURNAME",
+      "2": "I-SURNAME",
+      "3": "B-FIRSTNAME",
+      "4": "I-FIRSTNAME",
+      "5": "B-BUILDINGNUM",
+      "6": "I-BUILDINGNUM",
+      "7": "B-DATEOFBIRTH",
+      "8": "I-DATEOFBIRTH",
+      "9": "B-EMAIL",
+      "10": "I-EMAIL",
+      "11": "B-PHONENUMBER",
+      "12": "I-PHONENUMBER",
+      "13": "B-CITY",
+      "14": "I-CITY",
+      "15": "B-URL",
+      "16": "I-URL",
+      "17": "B-COMPANYNAME",
+      "18": "I-COMPANYNAME",
+      "19": "B-STATE",
+      "20": "I-STATE",
+      "21": "B-ZIP",
+      "22": "I-ZIP",
+      "23": "B-STREET",
+      "24": "I-STREET",
+      "25": "B-COUNTRY",
+      "26": "I-COUNTRY",
+      "27": "B-SSN",
+      "28": "I-SSN",
+      "29": "B-DRIVERLICENSENUM",
+      "30": "I-DRIVERLICENSENUM",
+      "31": "B-PASSPORTID",
+      "32": "I-PASSPORTID",
+      "33": "B-NATIONALID",
+      "34": "I-NATIONALID",
+      "35": "B-IDCARDNUM",
+      "36": "I-IDCARDNUM",
+      "37": "B-TAXNUM",
+      "38": "I-TAXNUM",
+      "39": "B-LICENSEPLATENUM",
+      "40": "I-LICENSEPLATENUM",
+      "41": "B-PASSWORD",
+      "42": "I-PASSWORD",
+      "43": "B-IBAN",
+      "44": "I-IBAN",
+      "45": "B-AGE",
+      "46": "I-AGE",
+      "47": "B-SECURITYTOKEN",
+      "48": "I-SECURITYTOKEN",
+      "49": "B-CREDITCARDNUMBER",
+      "50": "I-CREDITCARDNUMBER",
+      "51": "B-USERNAME",
+      "52": "I-USERNAME",
+      "-100": "IGNORE"
+    }
+  },
+  "coref": {
+    "id2label": {
+      "0": "NO_COREF",
+      "1": "CLUSTER_0",
+      "2": "CLUSTER_1",
+      "3": "CLUSTER_2",
+      "4": "CLUSTER_3",
+      "5": "CLUSTER_4",
+      "6": "CLUSTER_5"
+    }
+  }
+}

model.onnx.data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d93759243fb646966fc12b3e1a2cacedaf0326a94e014194faafdb683a50d98b
+size 260976640

model_manifest.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "model_path": "model/quantized",
+  "hashes": {
+    "sha256": "8e84ca14b7d7b344e0e9ca338ba1834ba09ff70e0d0342bafd398293aabb6ddd",
+    "sha512": "ef56b044de853c69f394c47f15b495cecd81114fdc52d5e4d04f1d83f20d1decc59fd688fe3a72fe88cabd90b2bed710d81e2481e73e20e86392ee33fb7fe8cb"
+  },
+  "files": [
+    {
+      "path": "label_mappings.json",
+      "sha256": "67c7dc80b62c1f0f83a6a77f9e1990572d5bfbfb0ae7673daad471f845b70ae5"
+    },
+    {
+      "path": "model.onnx",
+      "sha256": "aaf9ff3319b5c3d4cc4cc2ae3e6e1acaca6ff91924dc0abb823ca5655b751432"
+    },
+    {
+      "path": "model.onnx.data",
+      "sha256": "9a8c787f32551a3e5fafff4ddc32d3048edf7c23435e0030995dde2934c1b9b0"
+    },
+    {
+      "path": "model_manifest.json",
+      "sha256": "2539a2b206ebbacb3b5bc42b22dd00fd528c3a8505bfdd06afbd935e54f5ec97"
+    },
+    {
+      "path": "model_quantized.onnx",
+      "sha256": "61df2fc432ffbc9c96bb953dbcba4ccb399fb2e326994ca40a5e0d268428c003"
+    },
+    {
+      "path": "ort_config.json",
+      "sha256": "713d4cdeb867f45c821d5f2bc226ed33ac3afcce49a83b1b95dc7b45aca116f9"
+    },
+    {
+      "path": "special_tokens_map.json",
+      "sha256": "5d5b662e421ea9fac075174bb0688ee0d9431699900b90662acd44b2a350503a"
+    },
+    {
+      "path": "tokenizer.json",
+      "sha256": "cb26b43c98e8266ae3e99c2a583cf8315d73b33a17e6b20b4df7ff1f22392d34"
+    },
+    {
+      "path": "tokenizer_config.json",
+      "sha256": "91955163d30880d1ef0928cfdd729baaf413401dc4456f3ab1a48b3eec2ec624"
+    },
+    {
+      "path": "vocab.txt",
+      "sha256": "eeaa9875b23b04b4c54ef759d03db9d1ba1554838f8fb26c5d96fa551df93d02"
+    }
+  ]
+}

model_quantized.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b5f3311391a74470ff26c5b700a422832e5dcdb74fedc205ad5a68e70dbf487c
+size 66338684

ort_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "one_external_file": true,
+  "opset": null,
+  "optimization": {},
+  "quantization": {
+    "activations_dtype": "QUInt8",
+    "activations_symmetric": false,
+    "format": "QOperator",
+    "is_static": false,
+    "mode": "IntegerOps",
+    "nodes_to_exclude": [],
+    "nodes_to_quantize": [],
+    "operators_to_quantize": [
+      "Conv",
+      "MatMul",
+      "Attention",
+      "LSTM",
+      "Gather",
+      "Transpose",
+      "EmbedLayerNormalization"
+    ],
+    "per_channel": true,
+    "qdq_add_pair_to_weight": false,
+    "qdq_dedicated_pair": false,
+    "qdq_op_type_per_channel_support_to_axis": {
+      "MatMul": 1
+    },
+    "reduce_range": false,
+    "weights_dtype": "QInt8",
+    "weights_symmetric": true
+  },
+  "use_external_data_format": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff