nullpii / README.md

model card: full 12-class label table + adversarial coverage + sync with GitHub README

dbfe77c verified 4 days ago

10.2 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	- fr
	- es
	- it
	- multilingual
	base_model:
	- urchade/gliner_multi_pii-v1
	library_name: gliner
	tags:
	- pii
	- privacy
	- ner
	- llm-safety
	- gdpr
	- pii-redaction
	- multilingual
	- onnx
	pipeline_tag: token-classification
	---

	# nullpii

	Multilingual PII detection. ONNX-exported GLiNER built on
	[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
	(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
	span output.

	🧪 Hobby / experiment. Nights-and-weekends project. No SLA.

	Attribution: this model includes NVIDIA Nemotron-PII (CC-BY-4.0)
	derivative content.

	## Two F1 columns

	The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.

	\| Mode \| What it is \| F1 leader \|
	\|---\|---\|---\|
	\| `nullpii-bare` \| this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. \| clean OOD splits \|
	\| `nullpii` (full runtime) \| npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault \| adversarial / token-shape PII / production round-trip \|

	Both numbers published below so the model-vs-pipeline delta is explicit.

	## Benchmark

	Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.

	v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = 0.7784 (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).

	\| Dataset \| n \| `nullpii` \| `nullpii-bare` \| `nemotron-pii-raw` \| `gliner-pii-large-v1` \| `gliner-onnx-pii-fp32` \| `deberta` \| `piiranha` \| `presidio` \| `opf` \|
	\|---\|---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| `presidio-synthetic` \| 5,000 \| 0.9137 \| 0.8487 \| 0.7154 \| 0.6749 \| 0.5254 \| 0.5111 \| 0.3853 \| 0.5511 § \| 0.6530 \|
	\| `isotonic-en-heldout` \| 1,900 \| 0.7197 \| 0.5969 \| 0.7518 \| 0.6662 \| 0.5485 \| 0.6224 \| 0.4124 \| 0.4472 \| 0.4095 \|
	\| `isotonic-de-heldout` \| 2,400 \| 0.7297 \| 0.6191 \| 0.7271 \| 0.6325 \| 0.5432 \| 0.3969 \| 0.4112 \| 0.3859 \| 0.4155 \|
	\| `isotonic-fr-heldout` \| 2,800 \| 0.7254 \| 0.6001 \| 0.7276 \| 0.6663 \| 0.5393 \| 0.4824 \| 0.4172 \| 0.4042 \| 0.4257 \|
	\| `isotonic-it-heldout` \| 2,200 \| 0.7395 \| 0.6148 \| 0.7273 \| 0.6605 \| 0.5519 \| 0.4509 \| 0.4176 \| 0.4057 \| 0.4420 \|
	\| `ai4privacy-300k-heldout` \| 5,000 \| 0.6966 \| 0.5241 \| 0.6608 \| 0.4306 \| 0.5131 \| 0.2183 \| 0.3266 \| 0.4882 \| 0.4630 \|
	\| `tab-echr` ⚠ \| 127 \| 0.9239 \| 0.9275 \| 0.6026 \| 0.6346 \| 0.6463 \| 0.2908 \| 0.3163 \| 0.7761 \| 0.4166 \|
	\| `nemotron-pii-test` ⚠ \| 5,000 \| 0.8063 \| 0.6814 \| 0.9286 ‡ \| 0.7675 \| 0.7352 \| 0.4153 \| 0.3286 \| 0.4236 \| 0.4005 \|
	\| `nullpii-internal-bench` ⚐ \| 2,361 \| 0.4228 \| 0.3090 \| 0.3065 \| 0.2851 \| 0.2936 \| 0.1711 \| 0.1669 \| 0.1436 \| 0.2488 \|

	Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).

	Legend:
	- bold = row max
	- ⚠ training-distribution overlap with at least one competitor in the row
	- ⚐ in-distribution for `nullpii` itself (regression cell, not counted in the OOD headline)
	- ‡ competitor on its own training distribution (best-case self-report)
	- § Presidio benched on its own evaluator dataset (best-case self-report)

	## Latency

	M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:

	\| Input \| p50 \| p95 \| p99 \|
	\|---:\|---:\|---:\|---:\|
	\| 100 chars \| 23 ms \| 25 ms \| 27 ms \|
	\| 1,000 chars \| 95 ms \| 113 ms \| 114 ms \|
	\| 10,000 chars \| 938 ms \| 972 ms \| 1,122 ms \|

	Cold start (first `sanitize()`, ONNX load included): ~756 ms.

	## When to pick which

	- `nullpii-bare` — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
	- `nullpii` (npm full runtime) — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.

	## Schema (12 classes)

	\| Label \| Examples \| Source \|
	\|---\|---\|---\|
	\| `private_person` \| names \| model \|
	\| `private_email` \| emails \| model + regex \|
	\| `private_phone` \| int'l + IT / FR / ES / HIPAA-fax domestic \| model + regex \|
	\| `private_address` \| street, city, ZIP \| model \|
	\| `private_date` \| birth / hire dates \| model \|
	\| `private_url` \| `http(s)://`, `www.` \| model + regex \|
	\| `private_ip` \| IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) \| regex post-pass \|
	\| `private_mac` \| MAC addresses (broadcast / multicast filtered) \| regex post-pass \|
	\| `private_passport` \| US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) \| model (zero-shot) + regex post-pass \|
	\| `private_driver_license` \| US per-state + IT / EU per-country (context-anchored) \| model (zero-shot) + regex post-pass \|
	\| `private_vehicle_id` \| VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US \| model (zero-shot) + regex (validated) \|
	\| `private_geolocation` \| lat/lon decimal pairs (range-validated) + DMS notation \| model (zero-shot) + regex (validated) \|
	\| `account_number` \| IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI \| model + regex (validated) \|
	\| `secret` \| API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII \| regex (50+) + base64 \|

	The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.

	## Tricky inputs the npm runtime still catches

	Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:

	\| Surface \| Input \| Detected as \|
	\|---\|---\|---\|
	\| base64-wrapped secret \| `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` \| `sk-ant-api03-…` (Anthropic key) \|
	\| HTML-entity-encoded secret \| `sk-ant…` \| `sk-ant-…` (Anthropic key) \|
	\| double-URL-encoded email \| `bob.jones%2540company.io` \| `bob.jones@company.io` (email) \|
	\| zero-width-obfuscated address \| `221B Baker St`U+200B`re`U+200B`et `U+200B`London` \| `221B Baker Street London` (address) \|
	\| spaced-out email \| `u s e r . 1 2 3 @ g m a i l . c o m` \| `user.123@gmail.com` (email) \|
	\| Cyrillic-homoglyph email \| `pаyments@bank.com` (`а` = U+0430) \| `payments@bank.com` (email) \|
	\| fullwidth ASCII email \| `ＵＳＥＲ．ＮＡＭＥ＠ｅｘａｍｐｌｅ．ｃｏｍ` \| `USER.NAME@example.com` (email) \|
	\| Italian IBAN in prose \| `IT60X0542811101000001023456` \| `IT60X0542811101000001023456` (account_number, mod-97 verified) \|

	Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.

	## Intended use

	- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
	- Span-level PII tagging for batch redaction.
	- Geographic scope: EU + Romance + English. Limited coverage outside.

	## Out-of-scope

	- Implied / opinion-based attributes (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
	- HIPAA PHI — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
	- CJK / RTL / Indic scripts — limited coverage; treat as out-of-scope.
	- Air-gapped first-run — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.

	## Limitations

	- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
	- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
	- `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.

	## How to use

	### npm (production path)

	```bash
	npm install nullpii onnxruntime-node
	```

	```ts
	import { sanitize, restore, wrapForLLM } from 'nullpii';

	const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
	const prompt = wrapForLLM(safe, 'Translate to Italian');
	// … LLM call …
	const back = restore(reply, safe.sessionId);
	```

	First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.

	### Python (bare model)

	```python
	from gliner import GLiNER
	m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
	labels = [
	"account_number", "private_address", "private_date", "private_email",
	"private_person", "private_phone", "private_url", "secret",
	# zero-shot prompted (recall lower; pair with regex pack in production)
	"private_passport", "private_driver_license",
	"private_vehicle_id", "private_geolocation",
	]
	m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
	```

	## License

	Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.

	## Citation

	> nullpii contributors (2026). nullpii — multilingual PII detection. https://huggingface.co/lBroth/nullpii

	Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).