lBroth commited on
Commit
554e4c8
·
verified ·
1 Parent(s): 230a434

v0.3.0 model card

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - de
6
+ - fr
7
+ - es
8
+ - it
9
+ - multilingual
10
+ base_model:
11
+ - urchade/gliner_multi_pii-v1
12
+ library_name: gliner
13
+ tags:
14
+ - pii
15
+ - privacy
16
+ - ner
17
+ - llm-safety
18
+ - gdpr
19
+ - pii-redaction
20
+ - multilingual
21
+ - onnx
22
+ pipeline_tag: token-classification
23
+ ---
24
+
25
+ # nullpii
26
+
27
+ Multilingual PII detection. ONNX-exported GLiNER built on
28
+ [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
29
+ (mDeBERTa-v3 base + GLiNER head, ~278M params). ~1.2 GB FP32.
30
+ 12-category span output.
31
+
32
+ **Hobby / experiment.** A nights-and-weekends project. No SLA, no
33
+ roadmap commitments. If it helps you, great.
34
+
35
+ Companion runtime: [`nullpii`](https://www.npmjs.com/package/nullpii)
36
+ (npm) + [`@lbroth/nullpii-gateway`](https://www.npmjs.com/package/@lbroth/nullpii-gateway)
37
+ (Anthropic Messages API proxy).
38
+ Source: [github.com/lBroth/nullpii](https://github.com/lBroth/nullpii).
39
+
40
+ ## What gets detected
41
+
42
+ 12 categories. ML-trained (8): `account_number`, `private_address`,
43
+ `private_date`, `private_email`, `private_person`, `private_phone`,
44
+ `private_url`, `secret`. Zero-shot prompted + regex post-pass (4):
45
+ `private_passport`, `private_driver_license`, `private_vehicle_id`,
46
+ `private_geolocation`. Pure regex (2): `private_ip`, `private_mac`.
47
+
48
+ Full label table + adversarial-input coverage matrix:
49
+ [GitHub README §What gets caught](https://github.com/lBroth/nullpii#what-gets-caught).
50
+
51
+ ## Two runtime modes
52
+
53
+ This repo ships the raw ONNX + tokenizer + `gliner_config.json`. F1
54
+ depends on which runtime you pair it with:
55
+
56
+ | Mode | What it is | Best for |
57
+ |---|---|---|
58
+ | **`nullpii-bare`** | this ONNX + GLiNER decoder + 1400-char chunking. No post-processing. | benchmark parity with other GLiNER family models. |
59
+ | **`nullpii`** | the npm package full runtime: adversarial-input preprocessor (NFKC + transliteration + URL/HTML decode + zero-width strip), 50+ recognizer regex pack with validators (IBAN mod-97, Luhn, VIN ISO 3779, etc.), base64 decode-then-classify, never-PII filter, boundary refinement, reversible in-memory vault. | production PII sanitization, OOD generalization. |
60
+
61
+ Held-out OOD macro F1 (`nullpii`): **0.7784** across `presidio-synthetic`,
62
+ `isotonic-{en,de,fr,it}-heldout`, `ai4privacy-300k-heldout`, `tab-echr`.
63
+
64
+ Full 9-tool × 16-dataset matrix vs Piiranha, DeBERTa-PII, Presidio,
65
+ Nemotron-PII, OpenAI privacy-filter, GLiNER native:
66
+ [github.com/lBroth/nullpii/tree/main/packages/eval/published-bench](https://github.com/lBroth/nullpii/tree/main/packages/eval/published-bench).
67
+
68
+ ## Latency
69
+
70
+ M5 Pro CPU, Node 24, `nullpii` runtime:
71
+
72
+ | Input | p50 | p95 | p99 |
73
+ |---:|---:|---:|---:|
74
+ | 100 chars | 23 ms | 25 ms | 27 ms |
75
+ | 1,000 chars | 95 ms | 113 ms | 114 ms |
76
+ | 10,000 chars | 938 ms | 972 ms | 1,122 ms |
77
+
78
+ Cold start (first call, includes ONNX load): ~756 ms.
79
+
80
+ ## Usage
81
+
82
+ ```bash
83
+ npm install nullpii onnxruntime-node
84
+ ```
85
+
86
+ ```ts
87
+ import { sanitize, restore } from 'nullpii';
88
+
89
+ const safe = await sanitize('Email John Smith at john@acme.io');
90
+ // → 'Email {{PII_PRIVATE_PERSON_0_…}} at {{PII_PRIVATE_EMAIL_0_…}}'
91
+
92
+ // ... your LLM call (OpenAI, Anthropic, Gemini, anything) ...
93
+
94
+ const back = restore(reply, safe.sessionId);
95
+ // → original text restored
96
+ ```
97
+
98
+ The npm package downloads this model on first `sanitize()` into
99
+ `~/.cache/nullpii/`. Pre-warm with `npx nullpii prefetch`.
100
+
101
+ ## Direct ONNX (no runtime)
102
+
103
+ ```py
104
+ # Python — minimum viable: tokenizer + ONNX inference + decoder
105
+ from gliner import GLiNER
106
+ m = GLiNER.from_pretrained('lBroth/nullpii', load_onnx_model=True)
107
+ spans = m.predict_entities(
108
+ text='Email John at john@acme.io',
109
+ labels=['private_person', 'private_email'],
110
+ )
111
+ ```
112
+
113
+ ## Limitations
114
+
115
+ - Detector misses (no model is 100% accurate).
116
+ - Not a HIPAA de-identifier — diagnoses, ICD codes, dosages,
117
+ biometric / genetic identifiers are out of scope.
118
+ - `private_ip` / `private_mac` come from the regex pack, not the model.
119
+ - Detection is best-effort. Defence in depth, not the sole privacy
120
+ control.
121
+
122
+ ## Attribution
123
+
124
+ - Base model: [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1)
125
+ (GLiNER, Zaratiana et al., NAACL 2024) — Apache-2.0.
126
+ - mDeBERTa-v3 base — MIT.
127
+ - Includes Nemotron-PII derivative content — CC-BY-4.0.
128
+
129
+ See [`NOTICE`](https://github.com/lBroth/nullpii/blob/main/NOTICE) for the full
130
+ attribution + license posture.
131
+
132
+ ## License
133
+
134
+ Apache-2.0.