gantz-ai commited on
Commit
ba4b7fc
·
verified ·
1 Parent(s): 5023c57

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: onnxruntime
3
+ tags:
4
+ - ner
5
+ - pii
6
+ - gliner
7
+ - onnx
8
+ - privacy
9
+ - pdpa
10
+ - pdpd
11
+ - pipl
12
+ - multilingual
13
+ language:
14
+ - en
15
+ - ms
16
+ - ta
17
+ - zh
18
+ - id
19
+ - vi
20
+ - th
21
+ - hi
22
+ - bn
23
+ - ko
24
+ - de
25
+ - fr
26
+ - ru
27
+ license: agpl-3.0
28
+ pipeline_tag: token-classification
29
+ ---
30
+
31
+ # PII-Engineer-Multi-NER-v2.1
32
+
33
+ A high-accuracy, multilingual PII detection model for privacy compliance (PDPA, PDPD, PDP Law, PIPL).
34
+
35
+ Fine-tuned with LoRA on GLiNER2 architecture (mDeBERTa-v3-base encoder, 280M params). Optimized for real-world PII detection across 13+ languages.
36
+
37
+ **Built by [PII Engineer](https://pii.engineer)**
38
+
39
+ ## Labels (9 PII types)
40
+
41
+ | Label | Description |
42
+ |-------|-------------|
43
+ | `person_name` | Full names, partial names |
44
+ | `phone_number` | Phone/mobile numbers (international) |
45
+ | `government_id` | NRIC, SSN, Aadhaar, NIK, CCCD, etc. |
46
+ | `street_address` | Physical addresses |
47
+ | `date_of_birth` | Birth dates in any format |
48
+ | `email_address` | Email addresses |
49
+ | `passport_number` | Passport numbers (multi-country) |
50
+ | `license_plate` | Vehicle license plates |
51
+ | `bank_account_number` | Bank account/routing numbers |
52
+
53
+ ## Performance
54
+
55
+ | Label | Precision | Recall | F1 |
56
+ |-------|-----------|--------|----|
57
+ | person_name | 0.808 | 0.838 | 0.823 |
58
+ | phone_number | 0.962 | 0.975 | 0.968 |
59
+ | government_id | 0.902 | 0.938 | 0.920 |
60
+ | street_address | 0.903 | 0.891 | 0.897 |
61
+ | date_of_birth | 0.901 | 0.901 | 0.901 |
62
+ | email_address | 0.974 | 0.966 | 0.970 |
63
+ | passport_number | 0.808 | 0.812 | 0.810 |
64
+ | license_plate | 0.837 | 0.847 | 0.842 |
65
+ | bank_account_number | 0.879 | 0.906 | 0.892 |
66
+ | **Mean** | | | **0.902** |
67
+
68
+ ## Architecture
69
+
70
+ - **Encoder:** mDeBERTa-v3-base (768 hidden, 12 layers, 12 heads)
71
+ - **Framework:** GLiNER2 span-based NER (5 ONNX models)
72
+ - **Parameters:** ~280M
73
+ - **Inference:** ONNX Runtime (CPU or GPU)
74
+
75
+ ### ONNX Models
76
+
77
+ | File | Size | Description |
78
+ |------|------|-------------|
79
+ | encoder.onnx | 1.1GB | Token encoder (FP32) |
80
+ | encoder_int8.onnx | 511MB | Token encoder (INT8 quantized) |
81
+ | span_rep.onnx | 63MB | Span representation |
82
+ | count_embed.onnx | 41MB | Count embedding |
83
+ | count_pred.onnx | 4.6MB | Count prediction |
84
+ | classifier.onnx | 4.5MB | Classification head |
85
+
86
+ ## Quick Start
87
+
88
+ Use with [pii.engineer](https://github.com/gantz-ai/pii.engineer) (Rust server with auto-download):
89
+
90
+ ```bash
91
+ cargo build --release --package pii-engineer-server
92
+ cargo run --release --package pii-engineer-server
93
+ # Models download automatically on first run
94
+ ```
95
+
96
+ Or download manually:
97
+
98
+ ```bash
99
+ huggingface-cli download pii-engineer/PII-Engineer-Multi-NER-v2.1 --local-dir models/ner-v21
100
+ ```
101
+
102
+ ## Supported Languages
103
+
104
+ **Primary:** English, Malay, Tamil, Chinese, Indonesian, Vietnamese
105
+
106
+ **Secondary:** Thai, Hindi, Bengali, Korean, German, French, Russian
107
+
108
+ ## Use Cases
109
+
110
+ - PDPA (Singapore) compliance scanning
111
+ - PDPD (Vietnam) compliance scanning
112
+ - PDP Law (Indonesia) compliance scanning
113
+ - PIPL (China) compliance scanning
114
+ - PII detection in documents, chat logs, databases
115
+ - Pre-processing for data anonymization pipelines
116
+
117
+ ## Limitations
118
+
119
+ - Optimized for structured/semi-structured text (forms, emails, documents)
120
+ - May underperform on highly informal social media text
121
+ - Date-of-birth detection requires contextual birth cues (e.g., "born", "DOB", "lahir")
122
+
123
+ ## Citation
124
+
125
+ ```bibtex
126
+ @misc{pii-engineer-multi-ner-v2.1,
127
+ title={PII-Engineer-Multi-NER-v2.1: Multilingual PII Detection Model},
128
+ author={PII Engineer},
129
+ year={2026},
130
+ url={https://pii.engineer},
131
+ note={Fine-tuned on GLiNER2 architecture with LoRA}
132
+ }
133
+ ```
134
+
135
+ ## License
136
+
137
+ AGPL-3.0 — free for open-source use. Commercial license available at [pii.engineer](https://pii.engineer).
138
+
139
+ Built on [gliner2-multi-v1](https://huggingface.co/fastino/gliner2-multi-v1) (Apache 2.0) and [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) (MIT).