Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
File size: 10,245 Bytes
554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 dbfe77c 0e67121 dbfe77c 0e67121 554e4c8 dbfe77c 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 554e4c8 0e67121 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | ---
license: apache-2.0
language:
- en
- de
- fr
- es
- it
- multilingual
base_model:
- urchade/gliner_multi_pii-v1
library_name: gliner
tags:
- pii
- privacy
- ner
- llm-safety
- gdpr
- pii-redaction
- multilingual
- onnx
pipeline_tag: token-classification
---
# nullpii
Multilingual PII detection. ONNX-exported GLiNER built on
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
span output.
🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA.
Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)**
derivative content.
## Two F1 columns
The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.
| Mode | What it is | F1 leader |
|---|---|---|
| `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
| `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip |
Both numbers published below so the model-vs-pipeline delta is explicit.
## Benchmark
Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.
v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).
| Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` |
|---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
| `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
| `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
| `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
| `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
| `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
| `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
| `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
| `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |
Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
Legend:
- **bold** = row max
- ⚠ training-distribution overlap with at least one competitor in the row
- ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline)
- ‡ competitor on its own training distribution (best-case self-report)
- § Presidio benched on its own evaluator dataset (best-case self-report)
## Latency
M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:
| Input | p50 | p95 | p99 |
|---:|---:|---:|---:|
| 100 chars | 23 ms | 25 ms | 27 ms |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
Cold start (first `sanitize()`, ONNX load included): ~756 ms.
## When to pick which
- **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
- **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.
## Schema (12 classes)
| Label | Examples | Source |
|---|---|---|
| `private_person` | names | model |
| `private_email` | emails | model + regex |
| `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
| `private_address` | street, city, ZIP | model |
| `private_date` | birth / hire dates | model |
| `private_url` | `http(s)://`, `www.` | model + regex |
| `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
| `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass |
| `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
| `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
| `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
| `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
| `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
| `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |
The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.
## Tricky inputs the npm runtime still catches
Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:
| Surface | Input | Detected as |
|---|---|---|
| base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) |
| HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) |
| double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) |
| zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) |
| spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) |
| Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) |
| fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) |
| Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) |
Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.
## Intended use
- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
- Span-level PII tagging for batch redaction.
- Geographic scope: EU + Romance + English. Limited coverage outside.
## Out-of-scope
- **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
## Limitations
- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
- `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.
## How to use
### npm (production path)
```bash
npm install nullpii onnxruntime-node
```
```ts
import { sanitize, restore, wrapForLLM } from 'nullpii';
const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
const prompt = wrapForLLM(safe, 'Translate to Italian');
// … LLM call …
const back = restore(reply, safe.sessionId);
```
First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
### Python (bare model)
```python
from gliner import GLiNER
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
labels = [
"account_number", "private_address", "private_date", "private_email",
"private_person", "private_phone", "private_url", "secret",
# zero-shot prompted (recall lower; pair with regex pack in production)
"private_passport", "private_driver_license",
"private_vehicle_id", "private_geolocation",
]
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
```
## License
Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.
## Citation
> nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii
Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).
|