Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
nullpii
Multilingual PII detection. ONNX-exported GLiNER built on
urchade/gliner_multi_pii-v1
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
span output.
🧪 Hobby / experiment. Nights-and-weekends project. No SLA.
Attribution: this model includes NVIDIA Nemotron-PII (CC-BY-4.0) derivative content.
Two F1 columns
The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.
| Mode | What it is | F1 leader |
|---|---|---|
nullpii-bare |
this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
nullpii (full runtime) |
npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL %XX / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault |
adversarial / token-shape PII / production round-trip |
Both numbers published below so the model-vs-pipeline delta is explicit.
Benchmark
Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). --parallel-tools 1 fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: packages/eval/published-bench/matrix.csv. Run: packages/eval/scripts/bench_full.py.
v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for nullpii = 0.7784 (presidio-synthetic + isotonic-{en,de,fr,it}-heldout + ai4privacy-300k-heldout + tab-echr).
| Dataset | n | nullpii |
nullpii-bare |
nemotron-pii-raw |
gliner-pii-large-v1 |
gliner-onnx-pii-fp32 |
deberta |
piiranha |
presidio |
opf |
|---|---|---|---|---|---|---|---|---|---|---|
presidio-synthetic |
5,000 | 0.9137 | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
isotonic-en-heldout |
1,900 | 0.7197 | 0.5969 | 0.7518 | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
isotonic-de-heldout |
2,400 | 0.7297 | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
isotonic-fr-heldout |
2,800 | 0.7254 | 0.6001 | 0.7276 | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
isotonic-it-heldout |
2,200 | 0.7395 | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
ai4privacy-300k-heldout |
5,000 | 0.6966 | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
tab-echr ⚠ |
127 | 0.9239 | 0.9275 | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
nemotron-pii-test ⚠ |
5,000 | 0.8063 | 0.6814 | 0.9286 ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
nullpii-internal-bench ⚐ |
2,361 | 0.4228 | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |
Full 16-row matrix at github.com/lBroth/nullpii/tree/main/packages/eval/published-bench.
Legend:
- bold = row max
- ⚠ training-distribution overlap with at least one competitor in the row
- ⚐ in-distribution for
nullpiiitself (regression cell, not counted in the OOD headline) - ‡ competitor on its own training distribution (best-case self-report)
- § Presidio benched on its own evaluator dataset (best-case self-report)
Latency
M5 Pro CPU, Node 24, nullpii runtime full pipeline:
| Input | p50 | p95 | p99 |
|---|---|---|---|
| 100 chars | 23 ms | 25 ms | 27 ms |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
Cold start (first sanitize(), ONNX load included): ~756 ms.
When to pick which
nullpii-bare— clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.nullpii(npm full runtime) — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders.npm i nullpii.
Schema (12 classes)
| Label | Examples | Source |
|---|---|---|
private_person |
names | model |
private_email |
emails | model + regex |
private_phone |
int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
private_address |
street, city, ZIP | model |
private_date |
birth / hire dates | model |
private_url |
http(s)://, www. |
model + regex |
private_ip |
IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
private_mac |
MAC addresses (broadcast / multicast filtered) | regex post-pass |
private_passport |
US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
private_driver_license |
US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
private_vehicle_id |
VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
private_geolocation |
lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
account_number |
IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
secret |
API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |
The GLiNER head is trained on 8 categories (the first 8 + account_number + secret). The other 4 (private_passport / driver_license / vehicle_id / geolocation) are prompted zero-shot and paired with validated regex post-pass. private_ip / private_mac are regex-only — the model is not trained on them.
Tricky inputs the npm runtime still catches
Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:
| Surface | Input | Detected as |
|---|---|---|
| base64-wrapped secret | (base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw… |
sk-ant-api03-… (Anthropic key) |
| HTML-entity-encoded secret | sk-ant… |
sk-ant-… (Anthropic key) |
| double-URL-encoded email | bob.jones%2540company.io |
bob.jones@company.io (email) |
| zero-width-obfuscated address | 221B Baker StU+200BreU+200Bet U+200BLondon |
221B Baker Street London (address) |
| spaced-out email | u s e r . 1 2 3 @ g m a i l . c o m |
user.123@gmail.com (email) |
| Cyrillic-homoglyph email | pаyments@bank.com (а = U+0430) |
payments@bank.com (email) |
| fullwidth ASCII email | USER.NAME@example.com |
USER.NAME@example.com (email) |
| Italian IBAN in prose | IT60X0542811101000001023456 |
IT60X0542811101000001023456 (account_number, mod-97 verified) |
Five passes total: Unicode normalisation (NFKC + any-ascii transliteration), base64 decode-then-classify, iterative URL %XX + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.
Intended use
- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
- Span-level PII tagging for batch redaction.
- Geographic scope: EU + Romance + English. Limited coverage outside.
Out-of-scope
- Implied / opinion-based attributes (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
- HIPAA PHI —
account_numbercatches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope. - CJK / RTL / Indic scripts — limited coverage; treat as out-of-scope.
- Air-gapped first-run — point at a local mirror via
NULLPII_MODEL_DIRormodelDirconfig.
Limitations
- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
nullpii-benchis in-distribution for the project pipeline — treat as regression test, not OOD claim.
How to use
npm (production path)
npm install nullpii onnxruntime-node
import { sanitize, restore, wrapForLLM } from 'nullpii';
const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
const prompt = wrapForLLM(safe, 'Translate to Italian');
// … LLM call …
const back = restore(reply, safe.sessionId);
First call downloads the artifacts here into ~/.cache/nullpii/. Pre-warm with npx nullpii prefetch.
Python (bare model)
from gliner import GLiNER
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
labels = [
"account_number", "private_address", "private_date", "private_email",
"private_person", "private_phone", "private_url", "secret",
# zero-shot prompted (recall lower; pair with regex pack in production)
"private_passport", "private_driver_license",
"private_vehicle_id", "private_geolocation",
]
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
License
Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.
Citation
nullpii contributors (2026). nullpii — multilingual PII detection. https://huggingface.co/lBroth/nullpii
Built on urchade/gliner_multi_pii-v1 (Zaratiana et al., NAACL 2024).
- Downloads last month
- 43
Model tree for lBroth/nullpii
Base model
urchade/gliner_multi_pii-v1