hugging-hat commited on
Commit
6ba6d38
·
verified ·
1 Parent(s): 90a6fbf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +150 -3
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NERPA — Fine-Tuned GLiNER2 for PII Anonymisation
2
+
3
+ A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).
4
+
5
+ ## Why NERPA?
6
+
7
+ AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.
8
+
9
+ GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:
10
+
11
+ 1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
12
+ 2. **Exceed AWS Comprehend accuracy** on our PII benchmark
13
+
14
+ | Model | Micro-Precision | Micro-Recall |
15
+ | --- | --- | --- |
16
+ | AWS Comprehend | 0.90 | 0.94 |
17
+ | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
18
+ | **NERPA (this model)** | **0.93** | **0.90** |
19
+
20
+ ## Fine-Tuning Details
21
+
22
+ - **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
23
+ - **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
24
+ - **Eval data:** 300 held-out snippets (no template overlap with training)
25
+ - **Strategy:** Full weight fine-tuning with differential learning rates:
26
+ - Encoder (DeBERTa v3): `1e-7`
27
+ - GLiNER-specific layers: `1e-6`
28
+ - **Batch size:** 64
29
+ - **Convergence:** 175 steps
30
+
31
+ The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**.
32
+
33
+ ## Supported Entity Types
34
+
35
+ | Entity | Description |
36
+ | --- | --- |
37
+ | `PERSON_NAME` | Person name |
38
+ | `DATE_OF_BIRTH` | Date of birth |
39
+ | `DATE_TIME` | Generic date and time |
40
+ | `EMAIL` | Email address |
41
+ | `PHONE` | Phone numbers |
42
+ | `LOCATION` | Address, city, country, postcode, street |
43
+ | `AGE` | Age of a person |
44
+ | `BUSINESS_NAME` | Business name |
45
+ | `USERNAME` | Username |
46
+ | `URL` | Any URL |
47
+ | `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. |
48
+ | `CARD_DETAILS` | Card number, CVV, expiration |
49
+ | `DIGITAL_KEYS` | Passwords, PINs, API keys |
50
+ | `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs |
51
+ | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
52
+ | `VEHICLE_ID_NUMBERS` | License plates, VINs |
53
+
54
+ ## Quick Start
55
+
56
+ ### Install dependencies
57
+
58
+ ```bash
59
+ pip install gliner2 torch
60
+ ```
61
+
62
+ ### Anonymise text (CLI)
63
+
64
+ ```bash
65
+ # Inline text
66
+ python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"
67
+
68
+ # From file
69
+ python anonymise.py --file input.txt --output anonymised.txt
70
+
71
+ # Show detected entities
72
+ python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
73
+ ```
74
+
75
+ ### Use in Python
76
+
77
+ ```python
78
+ from anonymise import load_model, detect_entities, anonymise
79
+
80
+ model = load_model(".") # path to this repo
81
+
82
+ text = (
83
+ "Dear John Smith, your appointment is on 2025-03-15. "
84
+ "Your date of birth (15/03/1990) has been verified. "
85
+ "Please contact support at help@acme.com or call 020-7946-0958. "
86
+ "Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
87
+ )
88
+
89
+ entities = detect_entities(model, text)
90
+ print(anonymise(text, entities))
91
+ ```
92
+
93
+ Output:
94
+
95
+ ```
96
+ Dear [PERSON_NAME], your appointment is on [DATE_TIME].
97
+ Your date of birth ([DATE_OF_BIRTH]) has been verified.
98
+ Please contact support at [EMAIL] or call [PHONE].
99
+ Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
100
+ ```
101
+
102
+ ### Entity detection only
103
+
104
+ If you just need the raw entity offsets (e.g. for your own replacement logic):
105
+
106
+ ```python
107
+ entities = detect_entities(model, text)
108
+ for e in entities:
109
+ print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"')
110
+ ```
111
+
112
+ ```
113
+ PERSON_NAME [5:15] score=1.00 "John Smith"
114
+ DATE_TIME [40:50] score=1.00 "2025-03-15"
115
+ DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990"
116
+ EMAIL [129:142] score=1.00 "help@acme.com"
117
+ PHONE [151:164] score=1.00 "020-7946-0958"
118
+ BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819"
119
+ ```
120
+
121
+ ### Detect a subset of entities
122
+
123
+ ```python
124
+ entities = detect_entities(model, text, entities={
125
+ "PERSON_NAME": "Person name",
126
+ "EMAIL": "Email",
127
+ })
128
+ ```
129
+
130
+ ## How It Works
131
+
132
+ The inference pipeline in `anonymise.py`:
133
+
134
+ 1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
135
+ 2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
136
+ 3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
137
+ 4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
138
+ 5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.
139
+
140
+ ## Notes
141
+
142
+ - **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
143
+ - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
144
+ - **Device:** Automatically uses CUDA > MPS > CPU.
145
+
146
+ ## Citation
147
+
148
+ Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).
149
+
150
+ Base model: [GLiNER2](https://huggingface.co/fastino/gliner2-large-v1) by Fastino AI.