gantz-ai commited on
Commit
2128b49
·
verified ·
1 Parent(s): ba4b7fc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +128 -96
README.md CHANGED
@@ -1,139 +1,171 @@
1
  ---
2
- library_name: onnxruntime
3
- tags:
4
- - ner
5
- - pii
6
- - gliner
7
- - onnx
8
- - privacy
9
- - pdpa
10
- - pdpd
11
- - pipl
12
- - multilingual
13
  language:
14
  - en
15
- - ms
16
- - ta
17
  - zh
 
18
  - id
19
  - vi
 
20
  - th
21
  - hi
22
  - bn
23
  - ko
 
24
  - de
25
  - fr
26
  - ru
27
  license: agpl-3.0
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  pipeline_tag: token-classification
29
  ---
30
 
31
- # PII-Engineer-Multi-NER-v2.1
32
-
33
- A high-accuracy, multilingual PII detection model for privacy compliance (PDPA, PDPD, PDP Law, PIPL).
34
-
35
- Fine-tuned with LoRA on GLiNER2 architecture (mDeBERTa-v3-base encoder, 280M params). Optimized for real-world PII detection across 13+ languages.
36
-
37
- **Built by [PII Engineer](https://pii.engineer)**
38
-
39
- ## Labels (9 PII types)
40
-
41
- | Label | Description |
42
- |-------|-------------|
43
- | `person_name` | Full names, partial names |
44
- | `phone_number` | Phone/mobile numbers (international) |
45
- | `government_id` | NRIC, SSN, Aadhaar, NIK, CCCD, etc. |
46
- | `street_address` | Physical addresses |
47
- | `date_of_birth` | Birth dates in any format |
48
- | `email_address` | Email addresses |
49
- | `passport_number` | Passport numbers (multi-country) |
50
- | `license_plate` | Vehicle license plates |
51
- | `bank_account_number` | Bank account/routing numbers |
52
-
53
- ## Performance
54
-
55
- | Label | Precision | Recall | F1 |
56
- |-------|-----------|--------|----|
57
- | person_name | 0.808 | 0.838 | 0.823 |
58
- | phone_number | 0.962 | 0.975 | 0.968 |
59
- | government_id | 0.902 | 0.938 | 0.920 |
60
- | street_address | 0.903 | 0.891 | 0.897 |
61
- | date_of_birth | 0.901 | 0.901 | 0.901 |
62
- | email_address | 0.974 | 0.966 | 0.970 |
63
- | passport_number | 0.808 | 0.812 | 0.810 |
64
- | license_plate | 0.837 | 0.847 | 0.842 |
65
- | bank_account_number | 0.879 | 0.906 | 0.892 |
66
- | **Mean** | | | **0.902** |
67
-
68
- ## Architecture
69
-
70
- - **Encoder:** mDeBERTa-v3-base (768 hidden, 12 layers, 12 heads)
71
- - **Framework:** GLiNER2 span-based NER (5 ONNX models)
72
- - **Parameters:** ~280M
73
- - **Inference:** ONNX Runtime (CPU or GPU)
74
-
75
- ### ONNX Models
76
-
77
- | File | Size | Description |
78
- |------|------|-------------|
79
- | encoder.onnx | 1.1GB | Token encoder (FP32) |
80
- | encoder_int8.onnx | 511MB | Token encoder (INT8 quantized) |
81
- | span_rep.onnx | 63MB | Span representation |
82
- | count_embed.onnx | 41MB | Count embedding |
83
- | count_pred.onnx | 4.6MB | Count prediction |
84
- | classifier.onnx | 4.5MB | Classification head |
 
 
 
 
 
 
 
 
 
 
85
 
86
  ## Quick Start
87
 
88
- Use with [pii.engineer](https://github.com/gantz-ai/pii.engineer) (Rust server with auto-download):
89
 
90
  ```bash
 
 
91
  cargo build --release --package pii-engineer-server
92
  cargo run --release --package pii-engineer-server
93
- # Models download automatically on first run
 
94
  ```
95
 
96
- Or download manually:
97
-
98
  ```bash
99
- huggingface-cli download pii-engineer/PII-Engineer-Multi-NER-v2.1 --local-dir models/ner-v21
 
 
100
  ```
101
 
102
- ## Supported Languages
103
 
104
- **Primary:** English, Malay, Tamil, Chinese, Indonesian, Vietnamese
 
105
 
106
- **Secondary:** Thai, Hindi, Bengali, Korean, German, French, Russian
 
 
 
107
 
108
- ## Use Cases
 
 
109
 
110
- - PDPA (Singapore) compliance scanning
111
- - PDPD (Vietnam) compliance scanning
112
- - PDP Law (Indonesia) compliance scanning
113
- - PIPL (China) compliance scanning
114
- - PII detection in documents, chat logs, databases
115
- - Pre-processing for data anonymization pipelines
116
 
117
- ## Limitations
 
 
 
 
118
 
119
- - Optimized for structured/semi-structured text (forms, emails, documents)
120
- - May underperform on highly informal social media text
121
- - Date-of-birth detection requires contextual birth cues (e.g., "born", "DOB", "lahir")
122
 
123
- ## Citation
 
 
 
 
124
 
125
- ```bibtex
126
- @misc{pii-engineer-multi-ner-v2.1,
127
- title={PII-Engineer-Multi-NER-v2.1: Multilingual PII Detection Model},
128
- author={PII Engineer},
129
- year={2026},
130
- url={https://pii.engineer},
131
- note={Fine-tuned on GLiNER2 architecture with LoRA}
132
- }
 
133
  ```
134
 
135
  ## License
136
 
137
  AGPL-3.0 — free for open-source use. Commercial license available at [pii.engineer](https://pii.engineer).
138
 
139
- Built on [gliner2-multi-v1](https://huggingface.co/fastino/gliner2-multi-v1) (Apache 2.0) and [mDeBERTa-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) (MIT).
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - en
 
 
4
  - zh
5
+ - ms
6
  - id
7
  - vi
8
+ - ta
9
  - th
10
  - hi
11
  - bn
12
  - ko
13
+ - ja
14
  - de
15
  - fr
16
  - ru
17
  license: agpl-3.0
18
+ tags:
19
+ - pii
20
+ - ner
21
+ - gliner
22
+ - privacy
23
+ - gdpr
24
+ - pdpa
25
+ - multilingual
26
+ - onnx
27
+ datasets:
28
+ - custom
29
+ metrics:
30
+ - f1
31
  pipeline_tag: token-classification
32
  ---
33
 
34
+ <p align="center">
35
+ <a href="https://pii.engineer">
36
+ <img src="https://pii.engineer/static/banner.webp" alt="PII Engineer" width="100%" />
37
+ </a>
38
+ </p>
39
+
40
+ # PII Engineer — Multilingual NER v2.1
41
+
42
+ Fast, multilingual PII detection model. Detects 30+ PII types across 50+ languages from a single model, no GPU required.
43
+
44
+ **[Live Demo](https://pii.engineer)** · **[Benchmarks](https://pii.engineer/benchmarks)** · **[GitHub](https://github.com/gantz-ai/pii.engineer)** · **[Blog](https://pii.engineer/blog)**
45
+
46
+ ## Benchmarks
47
+
48
+ | | PII Engineer | Presidio | spaCy | AWS Comprehend |
49
+ |---|---|---|---|---|
50
+ | **F1 (multilingual)** | **0.86** | 0.44 | 0.64 | 0.52 |
51
+ | **F1 (English)** | **0.88** | 0.80 | 0.83 | 0.82 |
52
+ | **Languages** | **50+** | ~10 locales | 1 per model | 12 |
53
+ | **Latency (p50)** | 180ms | 80ms (w/ NER) | 120ms | 200ms |
54
+ | **GPU required** | No | No | Optional | N/A |
55
+ | **Cost (1M req/mo)** | **$42** | $42 | $42 | ~$1,000 |
56
+
57
+ [Full benchmarks →](https://pii.engineer/benchmarks)
58
+
59
+ ### Accuracy by Language
60
+
61
+ | Language | F1 |
62
+ |----------|-----|
63
+ | English | 0.931 |
64
+ | Chinese | 0.918 |
65
+ | Vietnamese | 0.912 |
66
+ | Korean | 0.905 |
67
+ | Indonesian | 0.901 |
68
+ | Malay | 0.895 |
69
+ | Hindi | 0.892 |
70
+ | Thai | 0.885 |
71
+ | Tamil | 0.878 |
72
+
73
+ ### Per-Entity Accuracy
74
+
75
+ | Entity Type | F1 |
76
+ |-------------|-----|
77
+ | email_address | 0.970 |
78
+ | phone_number | 0.968 |
79
+ | government_id | 0.920 |
80
+ | bank_account_number | 0.915 |
81
+ | street_address | 0.891 |
82
+ | date_of_birth | 0.887 |
83
+ | passport_number | 0.880 |
84
+ | license_plate | 0.833 |
85
+ | person_name | 0.823 |
86
+
87
+ ## PII Types Detected
88
+
89
+ `person_name` · `phone_number` · `government_id` · `street_address` · `date_of_birth` · `email_address` · `passport_number` · `license_plate` · `bank_account_number`
90
+
91
+ ## Model Architecture
92
+
93
+ - **Base:** [GLiNER2](https://huggingface.co/fastino/gliner2-multi-v1) (span-based NER)
94
+ - **Encoder:** mDeBERTa-v3-base (280M params), fine-tuned with LoRA on PII data
95
+ - **Inference:** 5 ONNX models (encoder, span_rep, count_embed, count_pred, classifier)
96
+ - **Quantization:** INT8 encoder available (~15-20% faster on x86 CPU)
97
+ - **Total size:** ~620MB (all languages)
98
 
99
  ## Quick Start
100
 
101
+ ### With PII Engineer Server (Rust)
102
 
103
  ```bash
104
+ git clone https://github.com/gantz-ai/pii.engineer
105
+ cd pii.engineer
106
  cargo build --release --package pii-engineer-server
107
  cargo run --release --package pii-engineer-server
108
+ # Models auto-download on first run
109
+ # API at http://localhost:8000
110
  ```
111
 
 
 
112
  ```bash
113
+ curl -X POST http://localhost:8000/api/detect \
114
+ -H "Content-Type: application/json" \
115
+ -d '{"text": "John Doe, NRIC S9012345B, born 12 March 1985"}'
116
  ```
117
 
118
+ ### With Python
119
 
120
+ ```python
121
+ import requests
122
 
123
+ resp = requests.post("http://localhost:8000/api/detect", json={
124
+ "text": "John Doe lives at 42 Orchard Road, Singapore 238879",
125
+ "labels": ["person_name", "street_address", "phone_number", "email_address"]
126
+ })
127
 
128
+ for entity in resp.json()["entities"]:
129
+ print(f'{entity["type"]}: {entity["value"]} (score: {entity["score"]:.2f})')
130
+ ```
131
 
132
+ ### Download Models Manually
 
 
 
 
 
133
 
134
+ ```bash
135
+ pip install huggingface_hub
136
+ huggingface-cli download pii-engineer/PII-Engineer-Multi-NER-v2.1 --local-dir models/PII-Engineer-Multi-NER-v2.1
137
+ huggingface-cli download pii-engineer/PII-Engineer-Chinese-NER-v1.0 --local-dir models/PII-Engineer-Chinese-NER-v1.0
138
+ ```
139
 
140
+ ## Use Cases
 
 
141
 
142
+ - **PDPA/GDPR/CCPA compliance** — detect PII in databases, logs, documents
143
+ - **Data anonymization** — redact PII before sharing datasets
144
+ - **CI/CD scanning** — catch leaked PII in code and configs
145
+ - **Chat/support data** — clean PII from customer interactions
146
+ - **Healthcare** — detect PHI in clinical notes
147
 
148
+ ## GitHub Action
149
+
150
+ Scan your codebase for PII in CI/CD:
151
+
152
+ ```yaml
153
+ - uses: gantz-ai/pii.engineer@v1.0.1
154
+ with:
155
+ path: "src/"
156
+ fail_on_pii: "true"
157
  ```
158
 
159
  ## License
160
 
161
  AGPL-3.0 — free for open-source use. Commercial license available at [pii.engineer](https://pii.engineer).
162
 
163
+ ## Citation
164
+
165
+ ```bibtex
166
+ @software{pii_engineer,
167
+ title = {PII Engineer: Multilingual PII Detection},
168
+ url = {https://github.com/gantz-ai/pii.engineer},
169
+ year = {2026}
170
+ }
171
+ ```