--- language: - en - zh - ms - id - vi - ta - th - hi - bn - ko - ja - de - fr - ru license: apache-2.0 tags: - pii - ner - gliner - privacy - gdpr - pdpa - multilingual - onnx datasets: - custom metrics: - f1 pipeline_tag: token-classification ---
# PII Engineer — Multilingual NER v2.1 Fast, multilingual PII detection model. Detects 30+ PII types across 50+ languages from a single model, no GPU required. **[Live Demo](https://pii.engineer)** · **[Benchmarks](https://pii.engineer/benchmarks)** · **[GitHub](https://github.com/gantz-ai/pii.engineer)** · **[Blog](https://pii.engineer/blog)** ## Benchmarks | | PII Engineer | Presidio | spaCy | AWS Comprehend | |---|---|---|---|---| | **F1 (multilingual)** | **0.86** | 0.44 | 0.64 | 0.52 | | **F1 (English)** | **0.88** | 0.80 | 0.83 | 0.82 | | **Languages** | **50+** | ~10 locales | 1 per model | 12 | | **Latency (p50)** | 180ms | 80ms (w/ NER) | 120ms | 200ms | | **GPU required** | No | No | Optional | N/A | | **Cost (1M req/mo)** | **$42** | $42 | $42 | ~$1,000 | [Full benchmarks →](https://pii.engineer/benchmarks) ### Accuracy by Language | Language | F1 | |----------|-----| | English | 0.931 | | Chinese | 0.918 | | Vietnamese | 0.912 | | Korean | 0.905 | | Indonesian | 0.901 | | Malay | 0.895 | | Hindi | 0.892 | | Thai | 0.885 | | Tamil | 0.878 | ### Per-Entity Accuracy | Entity Type | F1 | |-------------|-----| | email_address | 0.970 | | phone_number | 0.968 | | government_id | 0.920 | | bank_account_number | 0.915 | | street_address | 0.891 | | date_of_birth | 0.887 | | passport_number | 0.880 | | license_plate | 0.833 | | person_name | 0.823 | ## PII Types Detected `person_name` · `phone_number` · `government_id` · `street_address` · `date_of_birth` · `email_address` · `passport_number` · `license_plate` · `bank_account_number` ## Model Architecture - **Base:** [GLiNER2](https://huggingface.co/fastino/gliner2-multi-v1) (span-based NER) - **Encoder:** mDeBERTa-v3-base (280M params), fine-tuned with LoRA on PII data - **Inference:** 5 ONNX models (encoder, span_rep, count_embed, count_pred, classifier) - **Quantization:** INT8 encoder available (~15-20% faster on x86 CPU) - **Total size:** ~620MB (all languages) ## Quick Start ### With PII Engineer Server (Rust) ```bash git clone https://github.com/gantz-ai/pii.engineer cd pii.engineer cargo build --release --package pii-engineer-server cargo run --release --package pii-engineer-server # Models auto-download on first run # API at http://localhost:8000 ``` ```bash curl -X POST http://localhost:8000/api/detect \ -H "Content-Type: application/json" \ -d '{"text": "John Doe, NRIC S9012345B, born 12 March 1985"}' ``` ### With Python ```python import requests resp = requests.post("http://localhost:8000/api/detect", json={ "text": "John Doe lives at 42 Orchard Road, Singapore 238879", "labels": ["person_name", "street_address", "phone_number", "email_address"] }) for entity in resp.json()["entities"]: print(f'{entity["type"]}: {entity["value"]} (score: {entity["score"]:.2f})') ``` ### Download Models Manually ```bash pip install huggingface_hub huggingface-cli download pii-engineer/PII-Engineer-Multi-NER-v2.1 --local-dir models/PII-Engineer-Multi-NER-v2.1 huggingface-cli download pii-engineer/PII-Engineer-Chinese-NER-v1.0 --local-dir models/PII-Engineer-Chinese-NER-v1.0 ``` ## Use Cases - **PDPA/GDPR/CCPA compliance** — detect PII in databases, logs, documents - **Data anonymization** — redact PII before sharing datasets - **CI/CD scanning** — catch leaked PII in code and configs - **Chat/support data** — clean PII from customer interactions ## License AGPL-3.0 — free for open-source use. Commercial license available at [pii.engineer](https://pii.engineer). ## Citation ```bibtex @software{pii_engineer, title = {PII Engineer: Multilingual PII Detection}, url = {https://github.com/gantz-ai/pii.engineer}, year = {2026} } ```