Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - de | |
| - fr | |
| - es | |
| - it | |
| - multilingual | |
| base_model: | |
| - urchade/gliner_multi_pii-v1 | |
| library_name: gliner | |
| tags: | |
| - pii | |
| - privacy | |
| - ner | |
| - llm-safety | |
| - gdpr | |
| - pii-redaction | |
| - multilingual | |
| - onnx | |
| pipeline_tag: token-classification | |
| # nullpii | |
| Multilingual PII detection. ONNX-exported GLiNER built on | |
| [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) | |
| (mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class | |
| span output. | |
| 🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA. | |
| Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)** | |
| derivative content. | |
| ## Two F1 columns | |
| The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with. | |
| | Mode | What it is | F1 leader | | |
| |---|---|---| | |
| | `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits | | |
| | `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip | | |
| Both numbers published below so the model-vs-pipeline delta is explicit. | |
| ## Benchmark | |
| Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`. | |
| v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`). | |
| | Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` | | |
| |---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | |
| | `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 | | |
| | `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 | | |
| | `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 | | |
| | `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 | | |
| | `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 | | |
| | `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 | | |
| | `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 | | |
| | `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 | | |
| | `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 | | |
| Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench). | |
| Legend: | |
| - **bold** = row max | |
| - ⚠ training-distribution overlap with at least one competitor in the row | |
| - ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline) | |
| - ‡ competitor on its own training distribution (best-case self-report) | |
| - § Presidio benched on its own evaluator dataset (best-case self-report) | |
| ## Latency | |
| M5 Pro CPU, Node 24, `nullpii` runtime full pipeline: | |
| | Input | p50 | p95 | p99 | | |
| |---:|---:|---:|---:| | |
| | 100 chars | 23 ms | 25 ms | 27 ms | | |
| | 1,000 chars | 95 ms | 113 ms | 114 ms | | |
| | 10,000 chars | 938 ms | 972 ms | 1,122 ms | | |
| Cold start (first `sanitize()`, ONNX load included): ~756 ms. | |
| ## When to pick which | |
| - **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime. | |
| - **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`. | |
| ## Schema (12 classes) | |
| | Label | Examples | Source | | |
| |---|---|---| | |
| | `private_person` | names | model | | |
| | `private_email` | emails | model + regex | | |
| | `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex | | |
| | `private_address` | street, city, ZIP | model | | |
| | `private_date` | birth / hire dates | model | | |
| | `private_url` | `http(s)://`, `www.` | model + regex | | |
| | `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass | | |
| | `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass | | |
| | `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass | | |
| | `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass | | |
| | `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) | | |
| | `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) | | |
| | `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) | | |
| | `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 | | |
| The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them. | |
| ## Tricky inputs the npm runtime still catches | |
| Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss: | |
| | Surface | Input | Detected as | | |
| |---|---|---| | |
| | base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) | | |
| | HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) | | |
| | double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) | | |
| | zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) | | |
| | spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) | | |
| | Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) | | |
| | fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) | | |
| | Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) | | |
| Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack. | |
| ## Intended use | |
| - Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing. | |
| - Span-level PII tagging for batch redaction. | |
| - Geographic scope: EU + Romance + English. Limited coverage outside. | |
| ## Out-of-scope | |
| - **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans. | |
| - **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope. | |
| - **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope. | |
| - **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config. | |
| ## Limitations | |
| - Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack. | |
| - Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU. | |
| - `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim. | |
| ## How to use | |
| ### npm (production path) | |
| ```bash | |
| npm install nullpii onnxruntime-node | |
| ``` | |
| ```ts | |
| import { sanitize, restore, wrapForLLM } from 'nullpii'; | |
| const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789'); | |
| const prompt = wrapForLLM(safe, 'Translate to Italian'); | |
| // … LLM call … | |
| const back = restore(reply, safe.sessionId); | |
| ``` | |
| First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`. | |
| ### Python (bare model) | |
| ```python | |
| from gliner import GLiNER | |
| m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True) | |
| labels = [ | |
| "account_number", "private_address", "private_date", "private_email", | |
| "private_person", "private_phone", "private_url", "secret", | |
| # zero-shot prompted (recall lower; pair with regex pack in production) | |
| "private_passport", "private_driver_license", | |
| "private_vehicle_id", "private_geolocation", | |
| ] | |
| m.predict_entities("Email John at john@acme.io", labels, threshold=0.5) | |
| ``` | |
| ## License | |
| Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution. | |
| ## Citation | |
| > nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii | |
| Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024). | |