Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
v0.3.0 model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- de
|
| 6 |
+
- fr
|
| 7 |
+
- es
|
| 8 |
+
- it
|
| 9 |
+
- multilingual
|
| 10 |
+
base_model:
|
| 11 |
+
- urchade/gliner_multi_pii-v1
|
| 12 |
+
library_name: gliner
|
| 13 |
+
tags:
|
| 14 |
+
- pii
|
| 15 |
+
- privacy
|
| 16 |
+
- ner
|
| 17 |
+
- llm-safety
|
| 18 |
+
- gdpr
|
| 19 |
+
- pii-redaction
|
| 20 |
+
- multilingual
|
| 21 |
+
- onnx
|
| 22 |
+
pipeline_tag: token-classification
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
# nullpii
|
| 26 |
+
|
| 27 |
+
Multilingual PII detection. ONNX-exported GLiNER built on
|
| 28 |
+
[`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
|
| 29 |
+
(mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32.
|
| 30 |
+
12-category span output.
|
| 31 |
+
|
| 32 |
+
**Hobby / experiment.** A nights-and-weekends project. No SLA, no
|
| 33 |
+
roadmap commitments. If it helps you, great.
|
| 34 |
+
|
| 35 |
+
Companion runtime: [`nullpii`](https://www.npmjs.com/package/nullpii)
|
| 36 |
+
(npm) + [`@lbroth/nullpii-gateway`](https://www.npmjs.com/package/@lbroth/nullpii-gateway)
|
| 37 |
+
(Anthropic Messages API proxy).
|
| 38 |
+
Source: [github.com/lBroth/nullpii](https://github.com/lBroth/nullpii).
|
| 39 |
+
|
| 40 |
+
## What gets detected
|
| 41 |
+
|
| 42 |
+
12 categories. ML-trained (8): `account_number`, `private_address`,
|
| 43 |
+
`private_date`, `private_email`, `private_person`, `private_phone`,
|
| 44 |
+
`private_url`, `secret`. Zero-shot prompted + regex post-pass (4):
|
| 45 |
+
`private_passport`, `private_driver_license`, `private_vehicle_id`,
|
| 46 |
+
`private_geolocation`. Pure regex (2): `private_ip`, `private_mac`.
|
| 47 |
+
|
| 48 |
+
Full label table + adversarial-input coverage matrix:
|
| 49 |
+
[GitHub README §What gets caught](https://github.com/lBroth/nullpii#what-gets-caught).
|
| 50 |
+
|
| 51 |
+
## Two runtime modes
|
| 52 |
+
|
| 53 |
+
This repo ships the raw ONNX + tokenizer + `gliner_config.json`. F1
|
| 54 |
+
depends on which runtime you pair it with:
|
| 55 |
+
|
| 56 |
+
| Mode | What it is | Best for |
|
| 57 |
+
|---|---|---|
|
| 58 |
+
| **`nullpii-bare`** | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | benchmark parity with other GLiNER family models. |
|
| 59 |
+
| **`nullpii`** | the npm package full runtime: adversarial-input preprocessor (NFKC + transliteration + URL/HTML decode + zero-width strip), 50+ recognizer regex pack with validators (IBAN mod-97, Luhn, VIN ISO 3779, etc.), base64 decode-then-classify, never-PII filter, boundary refinement, reversible in-memory vault. | production PII sanitization, OOD generalization. |
|
| 60 |
+
|
| 61 |
+
Held-out OOD macro F1 (`nullpii`): **0.7784** across `presidio-synthetic`,
|
| 62 |
+
`isotonic-{en,de,fr,it}-heldout`, `ai4privacy-300k-heldout`, `tab-echr`.
|
| 63 |
+
|
| 64 |
+
Full 9-tool × 16-dataset matrix vs Piiranha, DeBERTa-PII, Presidio,
|
| 65 |
+
Nemotron-PII, OpenAI privacy-filter, GLiNER native:
|
| 66 |
+
[github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
|
| 67 |
+
|
| 68 |
+
## Latency
|
| 69 |
+
|
| 70 |
+
M5 Pro CPU, Node 24, `nullpii` runtime:
|
| 71 |
+
|
| 72 |
+
| Input | p50 | p95 | p99 |
|
| 73 |
+
|---:|---:|---:|---:|
|
| 74 |
+
| 100 chars | 23 ms | 25 ms | 27 ms |
|
| 75 |
+
| 1,000 chars | 95 ms | 113 ms | 114 ms |
|
| 76 |
+
| 10,000 chars | 938 ms | 972 ms | 1,122 ms |
|
| 77 |
+
|
| 78 |
+
Cold start (first call, includes ONNX load): ~756 ms.
|
| 79 |
+
|
| 80 |
+
## Usage
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
npm install nullpii onnxruntime-node
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
```ts
|
| 87 |
+
import { sanitize, restore } from 'nullpii';
|
| 88 |
+
|
| 89 |
+
const safe = await sanitize('Email John Smith at john@acme.io');
|
| 90 |
+
// → 'Email {{PII_PRIVATE_PERSON_0_…}} at {{PII_PRIVATE_EMAIL_0_…}}'
|
| 91 |
+
|
| 92 |
+
// ... your LLM call (OpenAI, Anthropic, Gemini, anything) ...
|
| 93 |
+
|
| 94 |
+
const back = restore(reply, safe.sessionId);
|
| 95 |
+
// → original text restored
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
The npm package downloads this model on first `sanitize()` into
|
| 99 |
+
`~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
|
| 100 |
+
|
| 101 |
+
## Direct ONNX (no runtime)
|
| 102 |
+
|
| 103 |
+
```py
|
| 104 |
+
# Python — minimum viable: tokenizer + ONNX inference + decoder
|
| 105 |
+
from gliner import GLiNER
|
| 106 |
+
m = GLiNER.from_pretrained('lBroth/nullpii', load_onnx_model=True)
|
| 107 |
+
spans = m.predict_entities(
|
| 108 |
+
text='Email John at john@acme.io',
|
| 109 |
+
labels=['private_person', 'private_email'],
|
| 110 |
+
)
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## Limitations
|
| 114 |
+
|
| 115 |
+
- Detector misses (no model is 100% accurate).
|
| 116 |
+
- Not a HIPAA de-identifier — diagnoses, ICD codes, dosages,
|
| 117 |
+
biometric / genetic identifiers are out of scope.
|
| 118 |
+
- `private_ip` / `private_mac` come from the regex pack, not the model.
|
| 119 |
+
- Detection is best-effort. Defence in depth, not the sole privacy
|
| 120 |
+
control.
|
| 121 |
+
|
| 122 |
+
## Attribution
|
| 123 |
+
|
| 124 |
+
- Base model: [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
|
| 125 |
+
(GLiNER, Zaratiana et al., NAACL 2024) — Apache-2.0.
|
| 126 |
+
- mDeBERTa-v3 base — MIT.
|
| 127 |
+
- Includes Nemotron-PII derivative content — CC-BY-4.0.
|
| 128 |
+
|
| 129 |
+
See [`NOTICE`](https://github.com/lBroth/nullpii/blob/main/NOTICE) for the full
|
| 130 |
+
attribution + license posture.
|
| 131 |
+
|
| 132 |
+
## License
|
| 133 |
+
|
| 134 |
+
Apache-2.0.
|