nullpii

Multilingual PII detection. ONNX-exported GLiNER built on urchade/gliner_multi_pii-v1 (mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class span output.

🧪 Hobby / experiment. Nights-and-weekends project. No SLA.

Attribution: this model includes NVIDIA Nemotron-PII (CC-BY-4.0) derivative content.

Two F1 columns

The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.

Mode What it is F1 leader
nullpii-bare this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. clean OOD splits
nullpii (full runtime) npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL %XX / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault adversarial / token-shape PII / production round-trip

Both numbers published below so the model-vs-pipeline delta is explicit.

Benchmark

Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). --parallel-tools 1 fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: packages/eval/published-bench/matrix.csv. Run: packages/eval/scripts/bench_full.py.

v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for nullpii = 0.7784 (presidio-synthetic + isotonic-{en,de,fr,it}-heldout + ai4privacy-300k-heldout + tab-echr).

Dataset n nullpii nullpii-bare nemotron-pii-raw gliner-pii-large-v1 gliner-onnx-pii-fp32 deberta piiranha presidio opf
presidio-synthetic 5,000 0.9137 0.8487 0.7154 0.6749 0.5254 0.5111 0.3853 0.5511 § 0.6530
isotonic-en-heldout 1,900 0.7197 0.5969 0.7518 0.6662 0.5485 0.6224 0.4124 0.4472 0.4095
isotonic-de-heldout 2,400 0.7297 0.6191 0.7271 0.6325 0.5432 0.3969 0.4112 0.3859 0.4155
isotonic-fr-heldout 2,800 0.7254 0.6001 0.7276 0.6663 0.5393 0.4824 0.4172 0.4042 0.4257
isotonic-it-heldout 2,200 0.7395 0.6148 0.7273 0.6605 0.5519 0.4509 0.4176 0.4057 0.4420
ai4privacy-300k-heldout 5,000 0.6966 0.5241 0.6608 0.4306 0.5131 0.2183 0.3266 0.4882 0.4630
tab-echr 127 0.9239 0.9275 0.6026 0.6346 0.6463 0.2908 0.3163 0.7761 0.4166
nemotron-pii-test 5,000 0.8063 0.6814 0.9286 0.7675 0.7352 0.4153 0.3286 0.4236 0.4005
nullpii-internal-bench 2,361 0.4228 0.3090 0.3065 0.2851 0.2936 0.1711 0.1669 0.1436 0.2488

Full 16-row matrix at github.com/lBroth/nullpii/tree/main/packages/eval/published-bench.

Legend:

  • bold = row max
  • ⚠ training-distribution overlap with at least one competitor in the row
  • ⚐ in-distribution for nullpii itself (regression cell, not counted in the OOD headline)
  • ‡ competitor on its own training distribution (best-case self-report)
  • § Presidio benched on its own evaluator dataset (best-case self-report)

Latency

M5 Pro CPU, Node 24, nullpii runtime full pipeline:

Input p50 p95 p99
100 chars 23 ms 25 ms 27 ms
1,000 chars 95 ms 113 ms 114 ms
10,000 chars 938 ms 972 ms 1,122 ms

Cold start (first sanitize(), ONNX load included): ~756 ms.

When to pick which

  • nullpii-bare — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
  • nullpii (npm full runtime) — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. npm i nullpii.

Schema (12 classes)

Label Examples Source
private_person names model
private_email emails model + regex
private_phone int'l + IT / FR / ES / HIPAA-fax domestic model + regex
private_address street, city, ZIP model
private_date birth / hire dates model
private_url http(s)://, www. model + regex
private_ip IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) regex post-pass
private_mac MAC addresses (broadcast / multicast filtered) regex post-pass
private_passport US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) model (zero-shot) + regex post-pass
private_driver_license US per-state + IT / EU per-country (context-anchored) model (zero-shot) + regex post-pass
private_vehicle_id VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US model (zero-shot) + regex (validated)
private_geolocation lat/lon decimal pairs (range-validated) + DMS notation model (zero-shot) + regex (validated)
account_number IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI model + regex (validated)
secret API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII regex (50+) + base64

The GLiNER head is trained on 8 categories (the first 8 + account_number + secret). The other 4 (private_passport / driver_license / vehicle_id / geolocation) are prompted zero-shot and paired with validated regex post-pass. private_ip / private_mac are regex-only — the model is not trained on them.

Tricky inputs the npm runtime still catches

Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:

Surface Input Detected as
base64-wrapped secret (base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw… sk-ant-api03-… (Anthropic key)
HTML-entity-encoded secret sk-ant… sk-ant-… (Anthropic key)
double-URL-encoded email bob.jones%2540company.io bob.jones@company.io (email)
zero-width-obfuscated address 221B Baker StU+200BreU+200Bet U+200BLondon 221B Baker Street London (address)
spaced-out email u s e r . 1 2 3 @ g m a i l . c o m user.123@gmail.com (email)
Cyrillic-homoglyph email pаyments@bank.com (а = U+0430) payments@bank.com (email)
fullwidth ASCII email USER.NAME@example.com USER.NAME@example.com (email)
Italian IBAN in prose IT60X0542811101000001023456 IT60X0542811101000001023456 (account_number, mod-97 verified)

Five passes total: Unicode normalisation (NFKC + any-ascii transliteration), base64 decode-then-classify, iterative URL %XX + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.

Intended use

  • Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
  • Span-level PII tagging for batch redaction.
  • Geographic scope: EU + Romance + English. Limited coverage outside.

Out-of-scope

  • Implied / opinion-based attributes (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
  • HIPAA PHIaccount_number catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
  • CJK / RTL / Indic scripts — limited coverage; treat as out-of-scope.
  • Air-gapped first-run — point at a local mirror via NULLPII_MODEL_DIR or modelDir config.

Limitations

  • Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
  • Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
  • nullpii-bench is in-distribution for the project pipeline — treat as regression test, not OOD claim.

How to use

npm (production path)

npm install nullpii onnxruntime-node
import { sanitize, restore, wrapForLLM } from 'nullpii';

const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
const prompt = wrapForLLM(safe, 'Translate to Italian');
// … LLM call …
const back = restore(reply, safe.sessionId);

First call downloads the artifacts here into ~/.cache/nullpii/. Pre-warm with npx nullpii prefetch.

Python (bare model)

from gliner import GLiNER
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
labels = [
    "account_number", "private_address", "private_date", "private_email",
    "private_person", "private_phone", "private_url", "secret",
    # zero-shot prompted (recall lower; pair with regex pack in production)
    "private_passport", "private_driver_license",
    "private_vehicle_id", "private_geolocation",
]
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)

License

Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.

Citation

nullpii contributors (2026). nullpii — multilingual PII detection. https://huggingface.co/lBroth/nullpii

Built on urchade/gliner_multi_pii-v1 (Zaratiana et al., NAACL 2024).

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lBroth/nullpii

Quantized
(2)
this model