File size: 10,885 Bytes
aea1b07 dff5567 6ba6d38 ac320eb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 a4107bb 6ba6d38 29ae185 6ba6d38 dff5567 6ba6d38 ac320eb 6ba6d38 aea1b07 6ba6d38 aea1b07 ac320eb 6ba6d38 dff5567 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
---
language:
- en
license: apache-2.0
library_name: gliner2
tags:
- named-entity-recognition
- ner
- pii
- anonymisation
- gliner
- gliner2
- token-classification
- privacy
datasets:
- synthetic
base_model: fastino/gliner2-large-v1
model-index:
- name: NERPA
results:
- task:
type: token-classification
name: Named Entity Recognition
metrics:
- type: precision
value: 0.93
name: Micro-Precision
- type: recall
value: 0.90
name: Micro-Recall
pipeline_tag: token-classification
---
# NERPA - Fine-Tuned GLiNER2 for PII Anonymisation
A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
## Fine-Tuning Details
- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
- **Eval data:** 300 held-out snippets (no template overlap with training)
- **Strategy:** Full weight fine-tuning with differential learning rates:
- Encoder (DeBERTa v3): `1e-7`
- GLiNER-specific layers: `1e-6`
- **Batch size:** 64
- **Convergence:** 175 steps
## Why NERPA?
NERPA combines two technical advantages that commercial NER services like AWS Comprehend cannot offer:
### 1. Bi-Encoder Architecture for Zero-Shot Entity Detection
GLiNER2 is a bi-encoder that takes both text and entity label descriptions as input, rather than treating entity types as fixed output classes. This architectural difference means you can define arbitrary entity types at inference time without retraining:
```python
# Standard PII entities
entities = detect_entities(model, text, entities={
"PERSON_NAME": "Person name",
"DATE_OF_BIRTH": "Date of birth",
"EMAIL": "Email address",
})
# Add domain-specific entities on the fly
entities = detect_entities(model, text, entities={
"PERSON_NAME": "Person name",
"MEDICATION": "Drug or medication name",
"DIAGNOSIS": "Medical condition or diagnosis",
"LAB_VALUE": "Laboratory test result",
})
```
This isn't prompt engineering or few-shot learning. The model's bi-encoder architecture natively supports arbitrary entity schemas. Fine-tuning on PII improves precision on those specific types without degrading the zero-shot capability.
**Example:** Context-dependent entity distinction
```python
text = """Last weekend, I visited Riverside Farm & Wildlife Park with my family.
The kids were excited to see the tigers first—magnificent creatures pacing behind
the reinforced glass. My daughter Sarah kept comparing them to our tabby cat at home,
saying how similar their stripes looked, though obviously Mittens is much smaller and
sleeps on our couch rather than prowling through artificial jungle habitats."""
entities = detect_entities(model, text, entities={
"ZOO": "Animals in a zoo or wildlife park",
"PET": "Pet animals owned by someone",
})
```
Output:
```
Last weekend, I visited Riverside Farm & Wildlife Park with my family. The kids were
excited to see the [ZOO] first—magnificent creatures pacing behind the reinforced glass.
My daughter Sarah kept comparing them to our [PET] at home, saying how similar their
stripes looked, though obviously [PET] is much smaller and sleeps on our couch rather
than prowling through artificial jungle habitats.
```
The model correctly distinguishes tigers (zoo animals) from the tabby cat and even the cat's name Mittens (pets) based purely on contextual cues. No retraining required.
### 2. Superior Performance on Standard PII
Fine-tuning GLiNER2 Large on 1,210 synthetic PII examples produced a model that outperforms AWS Comprehend on standard entity detection:
| Model | Micro-Precision | Micro-Recall |
| --- | --- | --- |
| AWS Comprehend | 0.90 | 0.94 |
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
| **NERPA (this model)** | **0.93** | **0.90** |
NERPA achieves **3% higher precision** than AWS Comprehend while maintaining comparable recall. The fine-tuning also enables fine-grained date disambiguation (DATE_OF_BIRTH vs DATE_TIME), which AWS Comprehend cannot do without custom model training.
### The Architecture Advantage
AWS Comprehend treats entity types as fixed classification targets. Adding a new entity type requires:
1. Annotating thousands of examples
2. Training a custom model
3. Paying for model hosting
4. Managing model versioning
NERPA's bi-encoder architecture makes entity types a runtime parameter. Adding new entities is a single line of code.
## Pre-Optimised PII Entity Types
NERPA is fine-tuned on these entity types (but you can add more at inference time):
| Entity | Description |
| --- | --- |
| `PERSON_NAME` | Person name |
| `DATE_OF_BIRTH` | Date of birth |
| `DATE_TIME` | Generic date and time |
| `EMAIL` | Email address |
| `PHONE` | Phone numbers |
| `LOCATION` | Address, city, country, postcode, street |
| `AGE` | Age of a person |
| `BUSINESS_NAME` | Business name |
| `USERNAME` | Username |
| `URL` | Any URL |
| `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. |
| `CARD_DETAILS` | Card number, CVV, expiration |
| `DIGITAL_KEYS` | Passwords, PINs, API keys |
| `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs |
| `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
| `VEHICLE_ID_NUMBERS` | License plates, VINs |
## Quick Start
### Install dependencies
```bash
pip install gliner2 torch
```
### Anonymise text (CLI)
```bash
# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"
# From file
python anonymise.py --file input.txt --output anonymised.txt
# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
```
### Use in Python
```python
from anonymise import load_model, detect_entities, anonymise
model = load_model(".") # path to this repo
text = (
"Dear John Smith, your appointment is on 2025-03-15. "
"Your date of birth (15/03/1990) has been verified. "
"Please contact support at help@acme.com or call 020-7946-0958. "
)
entities = detect_entities(model, text)
print(anonymise(text, entities))
```
Output:
```
Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
```
### Entity detection only
If you just need the raw entity offsets (e.g. for your own replacement logic):
```python
entities = detect_entities(model, text)
for e in entities:
print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"')
```
```
PERSON_NAME [5:15] score=1.00 "John Smith"
DATE_TIME [40:50] score=1.00 "2025-03-15"
DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990"
EMAIL [129:142] score=1.00 "help@acme.com"
PHONE [151:164] score=1.00 "020-7946-0958"
BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819"
```
### Detect a subset of entities
```python
entities = detect_entities(model, text, entities={
"PERSON_NAME": "Person name",
"EMAIL": "Email",
})
```
### Custom entities
You can detect additional entity types beyond the built-in PII set. The model's zero-shot capability means any label + description pair will work — your custom entities are detected and anonymised alongside the fine-tuned ones.
**CLI** — use `--extra-entities` / `-e`:
```bash
python anonymise.py -e PRODUCT="Product name" -e SKILL="Professional skill" \
"John Smith is a senior Python developer who bought a MacBook Pro."
```
Output:
```
[PERSON_NAME] is a senior [SKILL] developer who bought a [PRODUCT].
```
**Python:**
```python
from anonymise import load_model, detect_entities, anonymise, PII_ENTITIES
model = load_model(".")
custom_entities = {
**PII_ENTITIES,
"PRODUCT": "Product name",
"SKILL": "Professional skill",
}
text = "John Smith is a senior Python developer who bought a MacBook Pro."
entities = detect_entities(model, text, entities=custom_entities)
print(anonymise(text, entities))
```
## How It Works
The inference pipeline in `anonymise.py`:
1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v3 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones.
2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.
## Notes
- **Confidence threshold:** Default is `0.25`. The model sometimes tends to be conservative, so a lower threshold works well for high recall.
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
- **Device:** Automatically uses CUDA > MPS > CPU.
## Acknowledgements
This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available.
## Citation
If you use NERPA, please cite both this model and the original GLiNER2 paper:
```bibtex
@misc{nerpa2025,
title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation},
author={Akhat Rakishev},
year={2025},
url={https://huggingface.co/OvermindLab/nerpa},
}
@misc{zaratiana2025gliner2efficientmultitaskinformation,
title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface},
author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis},
year={2025},
eprint={2507.18546},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.18546},
}
```
Built by [Akhat Rakishev](https://github.com/akhatre) at [Overmind](https://overmindlab.ai).
Overmind is infrastructure for end-to-end agent optimisation. Learn more at [overmindlab.ai](https://overmindlab.ai).
|