lBroth commited on
Commit
dbfe77c
·
verified ·
1 Parent(s): 0e67121

model card: full 12-class label table + adversarial coverage + sync with GitHub README

Browse files
Files changed (1) hide show
  1. README.md +36 -6
README.md CHANGED
@@ -91,11 +91,41 @@ Cold start (first `sanitize()`, ONNX load included): ~756 ms.
91
 
92
  ## Schema (12 classes)
93
 
94
- ML-trained (8): `account_number` · `private_address` · `private_date` · `private_email` · `private_person` · `private_phone` · `private_url` · `secret`
95
-
96
- Zero-shot prompted + regex post-pass (4): `private_passport` · `private_driver_license` · `private_vehicle_id` · `private_geolocation`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- Pure regex post-pass (2): `private_ip` · `private_mac` (GLiNER head not trained on them).
99
 
100
  ## Intended use
101
 
@@ -105,8 +135,8 @@ Pure regex post-pass (2): `private_ip` · `private_mac` (GLiNER head not trained
105
 
106
  ## Out-of-scope
107
 
108
- - **GDPR Article 9 special categories** (health, biometric, genetic, religious, political, trade-union, sexual orientation). Not represented in the schema.
109
- - **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier.
110
  - **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
111
  - **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
112
 
 
91
 
92
  ## Schema (12 classes)
93
 
94
+ | Label | Examples | Source |
95
+ |---|---|---|
96
+ | `private_person` | names | model |
97
+ | `private_email` | emails | model + regex |
98
+ | `private_phone` | int'l + IT / FR / ES / HIPAA-fax domestic | model + regex |
99
+ | `private_address` | street, city, ZIP | model |
100
+ | `private_date` | birth / hire dates | model |
101
+ | `private_url` | `http(s)://`, `www.` | model + regex |
102
+ | `private_ip` | IPv4, IPv6 (RFC 1918 / 5737 / loopback filtered) | regex post-pass |
103
+ | `private_mac` | MAC addresses (broadcast / multicast filtered) | regex post-pass |
104
+ | `private_passport` | US / IT / FR / ES / DE / UK + context-anchored generic (30 countries) | model (zero-shot) + regex post-pass |
105
+ | `private_driver_license` | US per-state + IT / EU per-country (context-anchored) | model (zero-shot) + regex post-pass |
106
+ | `private_vehicle_id` | VIN (ISO 3779 mod-11), plates IT / FR / DE / UK / ES / US | model (zero-shot) + regex (validated) |
107
+ | `private_geolocation` | lat/lon decimal pairs (range-validated) + DMS notation | model (zero-shot) + regex (validated) |
108
+ | `account_number` | IBAN mod-97, cards (Luhn), SSN, MRN, BTC / ETH, DNI / CPF / CF / EIN, Medicare MBI / HIC, NPI, insurance policy, IMEI | model + regex (validated) |
109
+ | `secret` | API keys (AWS / GitHub / OpenAI / Anthropic / Stripe / 30+), JWT, PEM, base64-wrapped PII | regex (50+) + base64 |
110
+
111
+ The GLiNER head is trained on 8 categories (the first 8 + `account_number` + `secret`). The other 4 (`private_passport` / `driver_license` / `vehicle_id` / `geolocation`) are prompted zero-shot and paired with validated regex post-pass. `private_ip` / `private_mac` are regex-only — the model is not trained on them.
112
+
113
+ ## Tricky inputs the npm runtime still catches
114
+
115
+ Where the adversarial-input preprocessor + recognizer pack pulls PII the bare model alone would miss:
116
+
117
+ | Surface | Input | Detected as |
118
+ |---|---|---|
119
+ | base64-wrapped secret | `(base64-encoded) c2stYW50LWFwaTAzLWFCY0RlRmcw…` | `sk-ant-api03-…` (Anthropic key) |
120
+ | HTML-entity-encoded secret | `sk-ant…` | `sk-ant-…` (Anthropic key) |
121
+ | double-URL-encoded email | `bob.jones%2540company.io` | `bob.jones@company.io` (email) |
122
+ | zero-width-obfuscated address | `221B Baker St`U+200B`re`U+200B`et `U+200B`London` | `221B Baker Street London` (address) |
123
+ | spaced-out email | `u s e r . 1 2 3 @ g m a i l . c o m` | `user.123@gmail.com` (email) |
124
+ | Cyrillic-homoglyph email | `pаyments@bank.com` (`а` = U+0430) | `payments@bank.com` (email) |
125
+ | fullwidth ASCII email | `USER.NAME@example.com` | `USER.NAME@example.com` (email) |
126
+ | Italian IBAN in prose | `IT60X0542811101000001023456` | `IT60X0542811101000001023456` (account_number, mod-97 verified) |
127
 
128
+ Five passes total: Unicode normalisation (NFKC + `any-ascii` transliteration), base64 decode-then-classify, iterative URL `%XX` + HTML-entity decode, zero-width strip with offset remap, 50+ validated regex pack.
129
 
130
  ## Intended use
131
 
 
135
 
136
  ## Out-of-scope
137
 
138
+ - **Implied / opinion-based attributes** (race, religion, health conditions, political views, sexual orientation). These need a different kind of model — this one only finds explicit text spans.
139
+ - **HIPAA PHI** — `account_number` catches MRN-shaped digit runs but the model is not a HIPAA de-identifier. Diagnoses, ICD codes, dosages, biometric / genetic identifiers — out of scope.
140
  - **CJK / RTL / Indic scripts** — limited coverage; treat as out-of-scope.
141
  - **Air-gapped first-run** — point at a local mirror via `NULLPII_MODEL_DIR` or `modelDir` config.
142