Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
v0.3.0 model card — sync to 12-class schema + 9×16 bench + latency
Browse files
README.md
CHANGED
|
@@ -26,48 +26,55 @@ pipeline_tag: token-classification
|
|
| 26 |
|
| 27 |
Multilingual PII detection. ONNX-exported GLiNER built on
|
| 28 |
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
|
| 29 |
-
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32.
|
| 30 |
-
|
| 31 |
|
| 32 |
-
**Hobby / experiment.**
|
| 33 |
-
roadmap commitments. If it helps you, great.
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
(Anthropic Messages API proxy).
|
| 38 |
-
Source: [github.com/lBroth/nullpii](https://github.com/lBroth/nullpii).
|
| 39 |
|
| 40 |
-
##
|
| 41 |
|
| 42 |
-
|
| 43 |
-
`private_date`, `private_email`, `private_person`, `private_phone`,
|
| 44 |
-
`private_url`, `secret`. Zero-shot prompted + regex post-pass (4):
|
| 45 |
-
`private_passport`, `private_driver_license`, `private_vehicle_id`,
|
| 46 |
-
`private_geolocation`. Pure regex (2): `private_ip`, `private_mac`.
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
depends on which runtime you pair it with:
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
-
`isotonic-{en,de,fr,it}-heldout`, `ai4privacy-300k-heldout`, `tab-echr`.
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## Latency
|
| 69 |
|
| 70 |
-
M5 Pro CPU, Node 24, `nullpii` runtime:
|
| 71 |
|
| 72 |
| Input | p50 | p95 | p99 |
|
| 73 |
|---:|---:|---:|---:|
|
|
@@ -75,60 +82,80 @@ M5 Pro CPU, Node 24, `nullpii` runtime:
|
|
| 75 |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
|
| 76 |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
|
| 77 |
|
| 78 |
-
Cold start (first
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
```bash
|
| 83 |
npm install nullpii onnxruntime-node
|
| 84 |
```
|
| 85 |
|
| 86 |
```ts
|
| 87 |
-
import { sanitize, restore } from 'nullpii';
|
| 88 |
-
|
| 89 |
-
const safe = await sanitize('Email John Smith at john@acme.io');
|
| 90 |
-
// → 'Email {{PII_PRIVATE_PERSON_0_…}} at {{PII_PRIVATE_EMAIL_0_…}}'
|
| 91 |
-
|
| 92 |
-
// ... your LLM call (OpenAI, Anthropic, Gemini, anything) ...
|
| 93 |
|
|
|
|
|
|
|
|
|
|
| 94 |
const back = restore(reply, safe.sessionId);
|
| 95 |
-
// → original text restored
|
| 96 |
```
|
| 97 |
|
| 98 |
-
|
| 99 |
-
`~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
|
| 100 |
|
| 101 |
-
##
|
| 102 |
|
| 103 |
-
```
|
| 104 |
-
# Python — minimum viable: tokenizer + ONNX inference + decoder
|
| 105 |
from gliner import GLiNER
|
| 106 |
-
m = GLiNER.from_pretrained(
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
```
|
| 112 |
|
| 113 |
-
##
|
| 114 |
-
|
| 115 |
-
- Detector misses (no model is 100% accurate).
|
| 116 |
-
- Not a HIPAA de-identifier — diagnoses, ICD codes, dosages,
|
| 117 |
-
biometric / genetic identifiers are out of scope.
|
| 118 |
-
- `private_ip` / `private_mac` come from the regex pack, not the model.
|
| 119 |
-
- Detection is best-effort. Defence in depth, not the sole privacy
|
| 120 |
-
control.
|
| 121 |
-
|
| 122 |
-
## Attribution
|
| 123 |
|
| 124 |
-
-
|
| 125 |
-
(GLiNER, Zaratiana et al., NAACL 2024) — Apache-2.0.
|
| 126 |
-
- mDeBERTa-v3 base — MIT.
|
| 127 |
-
- Includes Nemotron-PII derivative content — CC-BY-4.0.
|
| 128 |
|
| 129 |
-
|
| 130 |
-
attribution + license posture.
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
|
|
|
|
| 26 |
|
| 27 |
Multilingual PII detection. ONNX-exported GLiNER built on
|
| 28 |
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
|
| 29 |
+
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
|
| 30 |
+
span output.
|
| 31 |
|
| 32 |
+
🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA.
|
|
|
|
| 33 |
|
| 34 |
+
Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)**
|
| 35 |
+
derivative content.
|
|
|
|
|
|
|
| 36 |
|
| 37 |
+
## Two F1 columns
|
| 38 |
|
| 39 |
+
The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
| Mode | What it is | F1 leader |
|
| 42 |
+
|---|---|---|
|
| 43 |
+
| `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
|
| 44 |
+
| `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip |
|
| 45 |
|
| 46 |
+
Both numbers published below so the model-vs-pipeline delta is explicit.
|
| 47 |
|
| 48 |
+
## Benchmark
|
|
|
|
| 49 |
|
| 50 |
+
Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.
|
| 51 |
+
|
| 52 |
+
v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).
|
| 53 |
+
|
| 54 |
+
| Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` |
|
| 55 |
+
|---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 56 |
+
| `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
|
| 57 |
+
| `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
|
| 58 |
+
| `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
|
| 59 |
+
| `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
|
| 60 |
+
| `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
|
| 61 |
+
| `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
|
| 62 |
+
| `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
|
| 63 |
+
| `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
|
| 64 |
+
| `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |
|
| 65 |
|
| 66 |
+
Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
|
|
|
|
| 67 |
|
| 68 |
+
Legend:
|
| 69 |
+
- **bold** = row max
|
| 70 |
+
- ⚠ training-distribution overlap with at least one competitor in the row
|
| 71 |
+
- ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline)
|
| 72 |
+
- ‡ competitor on its own training distribution (best-case self-report)
|
| 73 |
+
- § Presidio benched on its own evaluator dataset (best-case self-report)
|
| 74 |
|
| 75 |
## Latency
|
| 76 |
|
| 77 |
+
M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:
|
| 78 |
|
| 79 |
| Input | p50 | p95 | p99 |
|
| 80 |
|---:|---:|---:|---:|
|
|
|
|
| 82 |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
|
| 83 |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
|
| 84 |
|
| 85 |
+
Cold start (first `sanitize()`, ONNX load included): ~756 ms.
|
| 86 |
+
|
| 87 |
+
## When to pick which
|
| 88 |
+
|
| 89 |
+
- **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
|
| 90 |
+
- **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.
|
| 91 |
+
|
| 92 |
+
## Schema (12 classes)
|
| 93 |
+
|
| 94 |
+
ML-trained (8): `account_number` · `private_address` · `private_date` · `private_email` · `private_person` · `private_phone` · `private_url` · `secret`
|
| 95 |
+
|
| 96 |
+
Zero-shot prompted + regex post-pass (4): `private_passport` · `private_driver_license` · `private_vehicle_id` · `private_geolocation`
|
| 97 |
+
|
| 98 |
+
Pure regex post-pass (2): `private_ip` · `private_mac` (GLiNER head not trained on them).
|
| 99 |
+
|
| 100 |
+
## Intended use
|
| 101 |
+
|
| 102 |
+
- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
|
| 103 |
+
- Span-level PII tagging for batch redaction.
|
| 104 |
+
- Geographic scope: EU + Romance + English. Limited coverage outside.
|
| 105 |
+
|
| 106 |
+
## Out-of-scope
|
| 107 |
|
| 108 |
+
- **GDPR Article 9 special categories** (health, biometric, genetic, religious, political, trade-union, sexual orientation). Not represented in the schema.
|
| 109 |
+
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier.
|
| 110 |
+
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
|
| 111 |
+
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
|
| 112 |
+
|
| 113 |
+
## Limitations
|
| 114 |
+
|
| 115 |
+
- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
|
| 116 |
+
- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
|
| 117 |
+
- `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.
|
| 118 |
+
|
| 119 |
+
## How to use
|
| 120 |
+
|
| 121 |
+
### npm (production path)
|
| 122 |
|
| 123 |
```bash
|
| 124 |
npm install nullpii onnxruntime-node
|
| 125 |
```
|
| 126 |
|
| 127 |
```ts
|
| 128 |
+
import { sanitize, restore, wrapForLLM } from 'nullpii';
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
|
| 131 |
+
const prompt = wrapForLLM(safe, 'Translate to Italian');
|
| 132 |
+
// … LLM call …
|
| 133 |
const back = restore(reply, safe.sessionId);
|
|
|
|
| 134 |
```
|
| 135 |
|
| 136 |
+
First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
|
|
|
|
| 137 |
|
| 138 |
+
### Python (bare model)
|
| 139 |
|
| 140 |
+
```python
|
|
|
|
| 141 |
from gliner import GLiNER
|
| 142 |
+
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
|
| 143 |
+
labels = [
|
| 144 |
+
"account_number", "private_address", "private_date", "private_email",
|
| 145 |
+
"private_person", "private_phone", "private_url", "secret",
|
| 146 |
+
# zero-shot prompted (recall lower; pair with regex pack in production)
|
| 147 |
+
"private_passport", "private_driver_license",
|
| 148 |
+
"private_vehicle_id", "private_geolocation",
|
| 149 |
+
]
|
| 150 |
+
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
|
| 151 |
```
|
| 152 |
|
| 153 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
## Citation
|
|
|
|
| 158 |
|
| 159 |
+
> nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii
|
| 160 |
|
| 161 |
+
Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).
|