File size: 10,245 Bytes
554e4c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e67121
 
554e4c8
0e67121
554e4c8
0e67121
 
554e4c8
0e67121
554e4c8
0e67121
554e4c8
0e67121
 
 
 
554e4c8
0e67121
554e4c8
0e67121
554e4c8
0e67121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554e4c8
0e67121
554e4c8
0e67121
 
 
 
 
 
554e4c8
 
 
0e67121
554e4c8
 
 
 
 
 
 
0e67121
 
 
 
 
 
 
 
 
dbfe77c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e67121
dbfe77c
0e67121
 
 
 
 
 
 
 
554e4c8
dbfe77c
 
0e67121
 
 
 
 
 
 
 
 
 
 
 
554e4c8
 
 
 
 
 
0e67121
554e4c8
0e67121
 
 
554e4c8
 
 
0e67121
554e4c8
0e67121
554e4c8
0e67121
554e4c8
0e67121
 
 
 
 
 
 
 
 
554e4c8
 
0e67121
554e4c8
0e67121
554e4c8
0e67121
554e4c8
0e67121
554e4c8
0e67121
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
license: apache-2.0
language:
  - en
  - de
  - fr
  - es
  - it
  - multilingual
base_model:
  - urchade/gliner_multi_pii-v1
library_name: gliner
tags:
  - pii
  - privacy
  - ner
  - llm-safety
  - gdpr
  - pii-redaction
  - multilingual
  - onnx
pipeline_tag: token-classification
---

# nullpii

Multilingual PII detection. ONNX-exported GLiNER built on
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
span output.

🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA.

Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)**
derivative content.

## Two F1 columns

The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.

| Mode | What it is | F1 leader |
|---|---|---|
| `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
| `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip |

Both numbers published below so the model-vs-pipeline delta is explicit.

## Benchmark

Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.

v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).

| Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` |
|---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
| `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
| `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
| `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
| `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
| `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
| `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
| `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
| `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |

Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).

Legend:
- **bold** = row max
- ⚠ training-distribution overlap with at least one competitor in the row
- ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline)
- ‡ competitor on its own training distribution (best-case self-report)
- § Presidio benched on its own evaluator dataset (best-case self-report)

## Latency

M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:

| Input | p50 | p95 | p99 |
|---:|---:|---:|---:|
| 100 chars | 23 ms | 25 ms | 27 ms |
| 1,000 chars | 95 ms | 113 ms | 114 ms |
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |

Cold start (first `sanitize()`, ONNX load included): ~756 ms.

## When to pick which

- **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
- **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.

## Schema (12 classes)

| Label | Examples | Source |
|---|---|---|
| `private_person` | names | model |
| `private_email` | emails | model + regex |
| `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
| `private_address` | street, city, ZIP | model |
| `private_date` | birth / hire dates | model |
| `private_url` | `http(s)://`, `www.` | model + regex |
| `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
| `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass |
| `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
| `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
| `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
| `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
| `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
| `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |

The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.

## Tricky inputs the npm runtime still catches

Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:

| Surface | Input | Detected as |
|---|---|---|
| base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) |
| HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) |
| double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) |
| zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) |
| spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) |
| Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) |
| fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) |
| Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) |

Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.

## Intended use

- Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
- Span-level PII tagging for batch redaction.
- Geographic scope: EU + Romance + English. Limited coverage outside.

## Out-of-scope

- **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.

## Limitations

- Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
- Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
- `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.

## How to use

### npm (production path)

```bash
npm install nullpii onnxruntime-node
```

```ts
import { sanitize, restore, wrapForLLM } from 'nullpii';

const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
const prompt = wrapForLLM(safe, 'Translate to Italian');
// … LLM call …
const back = restore(reply, safe.sessionId);
```

First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.

### Python (bare model)

```python
from gliner import GLiNER
m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
labels = [
    "account_number", "private_address", "private_date", "private_email",
    "private_person", "private_phone", "private_url", "secret",
    # zero-shot prompted (recall lower; pair with regex pack in production)
    "private_passport", "private_driver_license",
    "private_vehicle_id", "private_geolocation",
]
m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
```

## License

Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.

## Citation

> nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii

Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).