Instructions to use lBroth/nullpii with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use lBroth/nullpii with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("lBroth/nullpii") - Notebooks
- Google Colab
- Kaggle
model card: full 12-class label table + adversarial coverage + sync with GitHub README
Browse files
README.md
CHANGED
|
@@ -91,11 +91,41 @@ Cold start (first `sanitize()`, ONNX load included): ~756 ms.
|
|
| 91 |
|
| 92 |
## Schema (12 classes)
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
## Intended use
|
| 101 |
|
|
@@ -105,8 +135,8 @@ Pure regex post-pass (2): `private_ip` · `private_mac` (GLiNER head not trained
|
|
| 105 |
|
| 106 |
## Out-of-scope
|
| 107 |
|
| 108 |
-
- **
|
| 109 |
-
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier.
|
| 110 |
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
|
| 111 |
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
|
| 112 |
|
|
|
|
| 91 |
|
| 92 |
## Schema (12 classes)
|
| 93 |
|
| 94 |
+
| Label | Examples | Source |
|
| 95 |
+
|---|---|---|
|
| 96 |
+
| `private_person` | names | model |
|
| 97 |
+
| `private_email` | emails | model + regex |
|
| 98 |
+
| `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
|
| 99 |
+
| `private_address` | street, city, ZIP | model |
|
| 100 |
+
| `private_date` | birth / hire dates | model |
|
| 101 |
+
| `private_url` | `http(s)://`, `www.` | model + regex |
|
| 102 |
+
| `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
|
| 103 |
+
| `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass |
|
| 104 |
+
| `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
|
| 105 |
+
| `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
|
| 106 |
+
| `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
|
| 107 |
+
| `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
|
| 108 |
+
| `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
|
| 109 |
+
| `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |
|
| 110 |
+
|
| 111 |
+
The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.
|
| 112 |
+
|
| 113 |
+
## Tricky inputs the npm runtime still catches
|
| 114 |
+
|
| 115 |
+
Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:
|
| 116 |
+
|
| 117 |
+
| Surface | Input | Detected as |
|
| 118 |
+
|---|---|---|
|
| 119 |
+
| base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) |
|
| 120 |
+
| HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) |
|
| 121 |
+
| double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) |
|
| 122 |
+
| zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) |
|
| 123 |
+
| spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) |
|
| 124 |
+
| Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) |
|
| 125 |
+
| fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) |
|
| 126 |
+
| Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) |
|
| 127 |
|
| 128 |
+
Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.
|
| 129 |
|
| 130 |
## Intended use
|
| 131 |
|
|
|
|
| 135 |
|
| 136 |
## Out-of-scope
|
| 137 |
|
| 138 |
+
- **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
|
| 139 |
+
- **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
|
| 140 |
- **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
|
| 141 |
- **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
|
| 142 |
|