nullpii / README.md
lBroth's picture
model card: full 12-class label table + adversarial coverage + sync with GitHub README
dbfe77c verified
---
license: apache-2.0
language:
- en
- de
- fr
- es
- it
- multilingual
base_model:
- urchade/gliner_multi_pii-v1
library_name: gliner
tags:
- pii
- privacy
- ner
- llm-safety
- gdpr
- pii-redaction
- multilingual
- onnx
pipeline_tag: token-classification
---
# nullpii
Multilingual PII detection. ONNX-exported GLiNER built on
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
span output.
🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA.
Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)**
derivative content.
## Two F1 columns
The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.
| Mode | What it is | F1 leader |
|---|---|---|
| `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
| `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip |
Both numbers published below so the model-vs-pipeline delta is explicit.
## Benchmark
Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.
v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).
| Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` |
|---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
| `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
| `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
| `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
| `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
| `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
| `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
| `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
| `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |
Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
Legend:
- **bold** = row max
- ⚠ training-distribution overlap with at least one competitor in the row
- ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline)
- ‡ competitor on its own training distribution (best-case self-report)
- § Presidio benched on its own evaluator dataset (best-case self-report)
## Latency
M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:
| Input | p50 | p95 | p99 |
|---:|---:|---:|---:|
| 100 chars | 23 ms | 25 ms | 27 ms |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
Cold start (first `sanitize()`, ONNX load included): ~756 ms.
## When to pick which
- **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
- **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.
## Schema (12 classes)
| Label | Examples | Source |
|---|---|---|
| `private_person` | names | model |
| `private_email` | emails | model + regex |
| `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
| `private_address` | street, city, ZIP | model |
| `private_date` | birth / hire dates | model |
| `private_url` | `http(s)://`, `www.` | model + regex |
| `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
| `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass |
| `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
| `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
| `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
| `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
| `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
| `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |
The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.
## Tricky inputs the npm runtime still catches
Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:
| Surface | Input | Detected as |
|---|---|---|
| base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) |
| HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) |
| double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) |
| zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) |
| spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) |
| Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) |
| fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) |
| Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) |
Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.
## Intended use
- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
- Span-level PII tagging for batch redaction.
- Geographic scope: EU + Romance + English. Limited coverage outside.
## Out-of-scope
- **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
## Limitations
- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
- `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.
## How to use
### npm (production path)
```bash
npm install nullpii onnxruntime-node
```
```ts
import { sanitize, restore, wrapForLLM } from 'nullpii';
const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
const prompt = wrapForLLM(safe, 'Translate to Italian');
// … LLM call …
const back = restore(reply, safe.sessionId);
```
First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
### Python (bare model)
```python
from gliner import GLiNER
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
labels = [
"account_number", "private_address", "private_date", "private_email",
"private_person", "private_phone", "private_url", "secret",
# zero-shot prompted (recall lower; pair with regex pack in production)
"private_passport", "private_driver_license",
"private_vehicle_id", "private_geolocation",
]
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
```
## License
Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.
## Citation
> nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii
Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).