File size: 7,452 Bytes
aea1b07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ba6d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aea1b07
 
 
 
6ba6d38
 
aea1b07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ba6d38
 
aea1b07
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
language:
  - en
license: apache-2.0
library_name: gliner2
tags:
  - named-entity-recognition
  - ner
  - pii
  - anonymisation
  - gliner
  - gliner2
  - token-classification
  - privacy
datasets:
  - synthetic
base_model: fastino/gliner2-large-v1
model-index:
  - name: NERPA
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        metrics:
          - type: precision
            value: 0.93
            name: Micro-Precision
          - type: recall
            value: 0.90
            name: Micro-Recall
pipeline_tag: token-classification
---

# NERPA β€” Fine-Tuned GLiNER2 for PII Anonymisation

A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).

## Why NERPA?

AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** β€” Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
2. **Exceed AWS Comprehend accuracy** on our PII benchmark

| Model | Micro-Precision | Micro-Recall |
| --- | --- | --- |
| AWS Comprehend | 0.90 | 0.94 |
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
| **NERPA (this model)** | **0.93** | **0.90** |

## Fine-Tuning Details

- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
- **Eval data:** 300 held-out snippets (no template overlap with training)
- **Strategy:** Full weight fine-tuning with differential learning rates:
  - Encoder (DeBERTa v3): `1e-7`
  - GLiNER-specific layers: `1e-6`
- **Batch size:** 64
- **Convergence:** 175 steps

The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model β€” what we call **indirect distillation**.

## Supported Entity Types

| Entity | Description |
| --- | --- |
| `PERSON_NAME` | Person name |
| `DATE_OF_BIRTH` | Date of birth |
| `DATE_TIME` | Generic date and time |
| `EMAIL` | Email address |
| `PHONE` | Phone numbers |
| `LOCATION` | Address, city, country, postcode, street |
| `AGE` | Age of a person |
| `BUSINESS_NAME` | Business name |
| `USERNAME` | Username |
| `URL` | Any URL |
| `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. |
| `CARD_DETAILS` | Card number, CVV, expiration |
| `DIGITAL_KEYS` | Passwords, PINs, API keys |
| `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs |
| `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
| `VEHICLE_ID_NUMBERS` | License plates, VINs |

## Quick Start

### Install dependencies

```bash
pip install gliner2 torch
```

### Anonymise text (CLI)

```bash
# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

# From file
python anonymise.py --file input.txt --output anonymised.txt

# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
```

### Use in Python

```python
from anonymise import load_model, detect_entities, anonymise

model = load_model(".")  # path to this repo

text = (
    "Dear John Smith, your appointment is on 2025-03-15. "
    "Your date of birth (15/03/1990) has been verified. "
    "Please contact support at help@acme.com or call 020-7946-0958. "
    "Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
)

entities = detect_entities(model, text)
print(anonymise(text, entities))
```

Output:

```
Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
```

### Entity detection only

If you just need the raw entity offsets (e.g. for your own replacement logic):

```python
entities = detect_entities(model, text)
for e in entities:
    print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f}  "{text[e["start"]:e["end"]]}"')
```

```
PERSON_NAME               [5:15]  score=1.00  "John Smith"
DATE_TIME                 [40:50] score=1.00  "2025-03-15"
DATE_OF_BIRTH             [72:82] score=1.00  "15/03/1990"
EMAIL                     [129:142] score=1.00  "help@acme.com"
PHONE                     [151:164] score=1.00  "020-7946-0958"
BANK_ACCOUNT_DETAILS      [187:209] score=1.00  "GB29NWBK60161331926819"
```

### Detect a subset of entities

```python
entities = detect_entities(model, text, entities={
    "PERSON_NAME": "Person name",
    "EMAIL": "Email",
})
```

## How It Works

The inference pipeline in `anonymise.py`:

1. **Chunking** β€” Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
2. **Batch prediction** β€” Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
3. **Date disambiguation** β€” Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
4. **De-duplication** β€” Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
5. **Replacement** β€” Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.

## Notes

- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
- **Device:** Automatically uses CUDA > MPS > CPU.

## Acknowledgements

This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available.

## Citation

If you use NERPA, please cite both this model and the original GLiNER2 paper:

```bibtex
@misc{nerpa2025,
  title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation},
  author={Akhat Rakishev},
  year={2025},
  url={https://huggingface.co/OvermindLab/nerpa},
}

@misc{zaratiana2025gliner2efficientmultitaskinformation,
  title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface},
  author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis},
  year={2025},
  eprint={2507.18546},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.18546},
}
```

Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).

Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com).