lBroth commited on
Commit
0e67121
·
verified ·
1 Parent(s): 554e4c8

v0.3.0 model card — sync to 12-class schema + 9×16 bench + latency

Browse files
Files changed (1) hide show
  1. README.md +93 -66
README.md CHANGED
@@ -26,48 +26,55 @@ pipeline_tag: token-classification
26
 
27
  Multilingual PII detection. ONNX-exported GLiNER built on
28
  [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
29
- (mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32.
30
- 12-category span output.
31
 
32
- **Hobby / experiment.** A nights-and-weekends project. No SLA, no
33
- roadmap commitments. If it helps you, great.
34
 
35
- Companion runtime: [`nullpii`](https://www.npmjs.com/package/nullpii)
36
- (npm) + [`@lbroth/nullpii-gateway`](https://www.npmjs.com/package/@lbroth/nullpii-gateway)
37
- (Anthropic Messages API proxy).
38
- Source: [github.com/lBroth/nullpii](https://github.com/lBroth/nullpii).
39
 
40
- ## What gets detected
41
 
42
- 12 categories. ML-trained (8): `account_number`, `private_address`,
43
- `private_date`, `private_email`, `private_person`, `private_phone`,
44
- `private_url`, `secret`. Zero-shot prompted + regex post-pass (4):
45
- `private_passport`, `private_driver_license`, `private_vehicle_id`,
46
- `private_geolocation`. Pure regex (2): `private_ip`, `private_mac`.
47
 
48
- Full label table + adversarial-input coverage matrix:
49
- [GitHub README §What gets caught](https://github.com/lBroth/nullpii#what-gets-caught).
 
 
50
 
51
- ## Two runtime modes
52
 
53
- This repo ships the raw ONNX + tokenizer + `gliner_config.json`. F1
54
- depends on which runtime you pair it with:
55
 
56
- | Mode | What it is | Best for |
57
- |---|---|---|
58
- | **`nullpii-bare`** | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | benchmark parity with other GLiNER family models. |
59
- | **`nullpii`** | the npm package full runtime: adversarial-input preprocessor (NFKC + transliteration + URL/HTML decode + zero-width strip), 50+ recognizer regex pack with validators (IBAN mod-97, Luhn, VIN ISO 3779, etc.), base64 decode-then-classify, never-PII filter, boundary refinement, reversible in-memory vault. | production PII sanitization, OOD generalization. |
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- Held-out OOD macro F1 (`nullpii`): **0.7784** across `presidio-synthetic`,
62
- `isotonic-{en,de,fr,it}-heldout`, `ai4privacy-300k-heldout`, `tab-echr`.
63
 
64
- Full 9-tool × 16-dataset matrix vs Piiranha, DeBERTa-PII, Presidio,
65
- Nemotron-PII, OpenAI privacy-filter, GLiNER native:
66
- [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
 
 
 
67
 
68
  ## Latency
69
 
70
- M5 Pro CPU, Node 24, `nullpii` runtime:
71
 
72
  | Input | p50 | p95 | p99 |
73
  |---:|---:|---:|---:|
@@ -75,60 +82,80 @@ M5 Pro CPU, Node 24, `nullpii` runtime:
75
  | 1,000 chars | 95 ms | 113 ms | 114 ms |
76
  | 10,000 chars | 938 ms | 972 ms | 1,122 ms |
77
 
78
- Cold start (first call, includes ONNX load): ~756 ms.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ```bash
83
  npm install nullpii onnxruntime-node
84
  ```
85
 
86
  ```ts
87
- import { sanitize, restore } from 'nullpii';
88
-
89
- const safe = await sanitize('Email John Smith at john@acme.io');
90
- // → 'Email {{PII_PRIVATE_PERSON_0_…}} at {{PII_PRIVATE_EMAIL_0_…}}'
91
-
92
- // ... your LLM call (OpenAI, Anthropic, Gemini, anything) ...
93
 
 
 
 
94
  const back = restore(reply, safe.sessionId);
95
- // → original text restored
96
  ```
97
 
98
- The npm package downloads this model on first `sanitize()` into
99
- `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
100
 
101
- ## Direct ONNX (no runtime)
102
 
103
- ```py
104
- # Python — minimum viable: tokenizer + ONNX inference + decoder
105
  from gliner import GLiNER
106
- m = GLiNER.from_pretrained('lBroth/nullpii', load_onnx_model=True)
107
- spans = m.predict_entities(
108
- text='Email John at john@acme.io',
109
- labels=['private_person', 'private_email'],
110
- )
 
 
 
 
111
  ```
112
 
113
- ## Limitations
114
-
115
- - Detector misses (no model is 100% accurate).
116
- - Not a HIPAA de-identifier — diagnoses, ICD codes, dosages,
117
- biometric / genetic identifiers are out of scope.
118
- - `private_ip` / `private_mac` come from the regex pack, not the model.
119
- - Detection is best-effort. Defence in depth, not the sole privacy
120
- control.
121
-
122
- ## Attribution
123
 
124
- - Base model: [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
125
- (GLiNER, Zaratiana et al., NAACL 2024) — Apache-2.0.
126
- - mDeBERTa-v3 base — MIT.
127
- - Includes Nemotron-PII derivative content — CC-BY-4.0.
128
 
129
- See [`NOTICE`](https://github.com/lBroth/nullpii/blob/main/NOTICE) for the full
130
- attribution + license posture.
131
 
132
- ## License
133
 
134
- Apache-2.0.
 
26
 
27
  Multilingual PII detection. ONNX-exported GLiNER built on
28
  [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
29
+ (mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32. 12-class
30
+ span output.
31
 
32
+ 🧪 **Hobby / experiment.** Nights-and-weekends project. No SLA.
 
33
 
34
+ Attribution: this model includes **NVIDIA Nemotron-PII (CC-BY-4.0)**
35
+ derivative content.
 
 
36
 
37
+ ## Two F1 columns
38
 
39
+ The repo ships the raw ONNX + tokenizer. F1 depends on which runtime you pair it with.
 
 
 
 
40
 
41
+ | Mode | What it is | F1 leader |
42
+ |---|---|---|
43
+ | `nullpii-bare` | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | clean OOD splits |
44
+ | `nullpii` (full runtime) | npm package: this model + 70-pattern recognizer pack (AWS / GitHub / Stripe / IBAN / SSN / …) + adversarial-input preprocessor (NFKC / any-ascii / URL `%XX` / HTML entity / zero-width / spaced PII) + base64 decoder + never-PII filter + reversible vault | adversarial / token-shape PII / production round-trip |
45
 
46
+ Both numbers published below so the model-vs-pipeline delta is explicit.
47
 
48
+ ## Benchmark
 
49
 
50
+ Mac M5 Pro CPU, single seed, macro F1 at IoU ≥ 0.5 partial-match span scoring. Cap 5,000 / dataset (less where the dataset is smaller). `--parallel-tools 1` fair-serial. Third-party tools run bare (no nullpii post-processing on competitor rows). Full matrix CSV: [`packages/eval/published-bench/matrix.csv`](https://github.com/lBroth/nullpii/blob/main/packages/eval/published-bench/matrix.csv). Run: `packages/eval/scripts/bench_full.py`.
51
+
52
+ v0.3.0 (M5 Pro CPU, full 9×16 matrix). OOD macro F1 for `nullpii` = **0.7784** (`presidio-synthetic` + `isotonic-{en,de,fr,it}-heldout` + `ai4privacy-300k-heldout` + `tab-echr`).
53
+
54
+ | Dataset | n | **`nullpii`** | **`nullpii-bare`** | `nemotron-pii-raw` | `gliner-pii-large-v1` | `gliner-onnx-pii-fp32` | `deberta` | `piiranha` | `presidio` | `opf` |
55
+ |---|---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
56
+ | `presidio-synthetic` | 5,000 | **0.9137** | 0.8487 | 0.7154 | 0.6749 | 0.5254 | 0.5111 | 0.3853 | 0.5511 § | 0.6530 |
57
+ | `isotonic-en-heldout` | 1,900 | 0.7197 | 0.5969 | **0.7518** | 0.6662 | 0.5485 | 0.6224 | 0.4124 | 0.4472 | 0.4095 |
58
+ | `isotonic-de-heldout` | 2,400 | **0.7297** | 0.6191 | 0.7271 | 0.6325 | 0.5432 | 0.3969 | 0.4112 | 0.3859 | 0.4155 |
59
+ | `isotonic-fr-heldout` | 2,800 | 0.7254 | 0.6001 | **0.7276** | 0.6663 | 0.5393 | 0.4824 | 0.4172 | 0.4042 | 0.4257 |
60
+ | `isotonic-it-heldout` | 2,200 | **0.7395** | 0.6148 | 0.7273 | 0.6605 | 0.5519 | 0.4509 | 0.4176 | 0.4057 | 0.4420 |
61
+ | `ai4privacy-300k-heldout` | 5,000 | **0.6966** | 0.5241 | 0.6608 | 0.4306 | 0.5131 | 0.2183 | 0.3266 | 0.4882 | 0.4630 |
62
+ | `tab-echr` ⚠ | 127 | 0.9239 | **0.9275** | 0.6026 | 0.6346 | 0.6463 | 0.2908 | 0.3163 | 0.7761 | 0.4166 |
63
+ | `nemotron-pii-test` ⚠ | 5,000 | 0.8063 | 0.6814 | **0.9286** ‡ | 0.7675 | 0.7352 | 0.4153 | 0.3286 | 0.4236 | 0.4005 |
64
+ | `nullpii-internal-bench` ⚐ | 2,361 | **0.4228** | 0.3090 | 0.3065 | 0.2851 | 0.2936 | 0.1711 | 0.1669 | 0.1436 | 0.2488 |
65
 
66
+ Full 16-row matrix at [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
 
67
 
68
+ Legend:
69
+ - **bold** = row max
70
+ - ⚠ training-distribution overlap with at least one competitor in the row
71
+ - ⚐ in-distribution for `nullpii` itself (regression cell, **not** counted in the OOD headline)
72
+ - ‡ competitor on its own training distribution (best-case self-report)
73
+ - § Presidio benched on its own evaluator dataset (best-case self-report)
74
 
75
  ## Latency
76
 
77
+ M5 Pro CPU, Node 24, `nullpii` runtime full pipeline:
78
 
79
  | Input | p50 | p95 | p99 |
80
  |---:|---:|---:|---:|
 
82
  | 1,000 chars | 95 ms | 113 ms | 114 ms |
83
  | 10,000 chars | 938 ms | 972 ms | 1,122 ms |
84
 
85
+ Cold start (first `sanitize()`, ONNX load included): ~756 ms.
86
+
87
+ ## When to pick which
88
+
89
+ - **`nullpii-bare`** — clean OOD splits, raw F1 priority, integrate directly via Python / your own runtime.
90
+ - **`nullpii` (npm full runtime)** — production LLM proxy. Token-shape PII (Stripe / IBAN / SSN / 50+ secret patterns), adversarial inputs (zero-width / base64 / URL-encoded), reversible-vault round-trip with session-bound placeholders. `npm i nullpii`.
91
+
92
+ ## Schema (12 classes)
93
+
94
+ ML-trained (8): `account_number` · `private_address` · `private_date` · `private_email` · `private_person` · `private_phone` · `private_url` · `secret`
95
+
96
+ Zero-shot prompted + regex post-pass (4): `private_passport` · `private_driver_license` · `private_vehicle_id` · `private_geolocation`
97
+
98
+ Pure regex post-pass (2): `private_ip` · `private_mac` (GLiNER head not trained on them).
99
+
100
+ ## Intended use
101
+
102
+ - Pre-LLM PII redaction for prompts / RAG corpora / log scrubbing.
103
+ - Span-level PII tagging for batch redaction.
104
+ - Geographic scope: EU + Romance + English. Limited coverage outside.
105
+
106
+ ## Out-of-scope
107
 
108
+ - **GDPR Article 9 special categories** (health, biometric, genetic, religious, political, trade-union, sexual orientation). Not represented in the schema.
109
+ - **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier.
110
+ - **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
111
+ - **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
112
+
113
+ ## Limitations
114
+
115
+ - Adversarial robustness comes from the npm runtime pipeline, not the model alone. The bare-model column does not include the adversarial preprocessor or the recognizer pack.
116
+ - Long-input chunking at 512-token boundaries (npm word-chunker 140 words / 30 overlap; bare GLiNER chunker 1400 / 200 chars). Boundary spans dedupe via IoU.
117
+ - `nullpii-bench` is in-distribution for the project pipeline — treat as regression test, not OOD claim.
118
+
119
+ ## How to use
120
+
121
+ ### npm (production path)
122
 
123
  ```bash
124
  npm install nullpii onnxruntime-node
125
  ```
126
 
127
  ```ts
128
+ import { sanitize, restore, wrapForLLM } from 'nullpii';
 
 
 
 
 
129
 
130
+ const safe = await sanitize('Email John Smith at john@acme.io about SSN 123-45-6789');
131
+ const prompt = wrapForLLM(safe, 'Translate to Italian');
132
+ // … LLM call …
133
  const back = restore(reply, safe.sessionId);
 
134
  ```
135
 
136
+ First call downloads the artifacts here into `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
 
137
 
138
+ ### Python (bare model)
139
 
140
+ ```python
 
141
  from gliner import GLiNER
142
+ m = GLiNER.from_pretrained("lBroth/nullpii", load_onnx_model=True)
143
+ labels = [
144
+ "account_number", "private_address", "private_date", "private_email",
145
+ "private_person", "private_phone", "private_url", "secret",
146
+ # zero-shot prompted (recall lower; pair with regex pack in production)
147
+ "private_passport", "private_driver_license",
148
+ "private_vehicle_id", "private_geolocation",
149
+ ]
150
+ m.predict_entities("Email John at john@acme.io", labels, threshold=0.5)
151
  ```
152
 
153
+ ## License
 
 
 
 
 
 
 
 
 
154
 
155
+ Apache-2.0. Combined upstream attribution: Apache-2.0 (base model) + CC-BY-4.0 (Nemotron-PII derivative content — attribution required, see header). Commercial redistribution permitted subject to the Nemotron attribution.
 
 
 
156
 
157
+ ## Citation
 
158
 
159
+ > nullpii contributors (2026). *nullpii — multilingual PII detection.* https://huggingface.co/lBroth/nullpii
160
 
161
+ Built on [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) (Zaratiana et al., NAACL 2024).