|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: gliner2 |
|
|
tags: |
|
|
- named-entity-recognition |
|
|
- ner |
|
|
- pii |
|
|
- anonymisation |
|
|
- gliner |
|
|
- gliner2 |
|
|
- token-classification |
|
|
- privacy |
|
|
datasets: |
|
|
- synthetic |
|
|
base_model: fastino/gliner2-large-v1 |
|
|
model-index: |
|
|
- name: NERPA |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
metrics: |
|
|
- type: precision |
|
|
value: 0.93 |
|
|
name: Micro-Precision |
|
|
- type: recall |
|
|
value: 0.90 |
|
|
name: Micro-Recall |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# NERPA — Fine-Tuned GLiNER2 for PII Anonymisation |
|
|
|
|
|
A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com). |
|
|
|
|
|
## Why NERPA? |
|
|
|
|
|
AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context. |
|
|
|
|
|
GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to: |
|
|
|
|
|
1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME) |
|
|
2. **Exceed AWS Comprehend accuracy** on our PII benchmark |
|
|
|
|
|
| Model | Micro-Precision | Micro-Recall | |
|
|
| --- | --- | --- | |
|
|
| AWS Comprehend | 0.90 | 0.94 | |
|
|
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 | |
|
|
| **NERPA (this model)** | **0.93** | **0.90** | |
|
|
|
|
|
## Fine-Tuning Details |
|
|
|
|
|
- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params) |
|
|
- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities |
|
|
- **Eval data:** 300 held-out snippets (no template overlap with training) |
|
|
- **Strategy:** Full weight fine-tuning with differential learning rates: |
|
|
- Encoder (DeBERTa v3): `1e-7` |
|
|
- GLiNER-specific layers: `1e-6` |
|
|
- **Batch size:** 64 |
|
|
- **Convergence:** 175 steps |
|
|
|
|
|
The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**. |
|
|
|
|
|
## Supported Entity Types |
|
|
|
|
|
| Entity | Description | |
|
|
| --- | --- | |
|
|
| `PERSON_NAME` | Person name | |
|
|
| `DATE_OF_BIRTH` | Date of birth | |
|
|
| `DATE_TIME` | Generic date and time | |
|
|
| `EMAIL` | Email address | |
|
|
| `PHONE` | Phone numbers | |
|
|
| `LOCATION` | Address, city, country, postcode, street | |
|
|
| `AGE` | Age of a person | |
|
|
| `BUSINESS_NAME` | Business name | |
|
|
| `USERNAME` | Username | |
|
|
| `URL` | Any URL | |
|
|
| `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. | |
|
|
| `CARD_DETAILS` | Card number, CVV, expiration | |
|
|
| `DIGITAL_KEYS` | Passwords, PINs, API keys | |
|
|
| `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs | |
|
|
| `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers | |
|
|
| `VEHICLE_ID_NUMBERS` | License plates, VINs | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Install dependencies |
|
|
|
|
|
```bash |
|
|
pip install gliner2 torch |
|
|
``` |
|
|
|
|
|
### Anonymise text (CLI) |
|
|
|
|
|
```bash |
|
|
# Inline text |
|
|
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com" |
|
|
|
|
|
# From file |
|
|
python anonymise.py --file input.txt --output anonymised.txt |
|
|
|
|
|
# Show detected entities |
|
|
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819." |
|
|
``` |
|
|
|
|
|
### Use in Python |
|
|
|
|
|
```python |
|
|
from anonymise import load_model, detect_entities, anonymise |
|
|
|
|
|
model = load_model(".") # path to this repo |
|
|
|
|
|
text = ( |
|
|
"Dear John Smith, your appointment is on 2025-03-15. " |
|
|
"Your date of birth (15/03/1990) has been verified. " |
|
|
"Please contact support at help@acme.com or call 020-7946-0958. " |
|
|
"Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp." |
|
|
) |
|
|
|
|
|
entities = detect_entities(model, text) |
|
|
print(anonymise(text, entities)) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
``` |
|
|
Dear [PERSON_NAME], your appointment is on [DATE_TIME]. |
|
|
Your date of birth ([DATE_OF_BIRTH]) has been verified. |
|
|
Please contact support at [EMAIL] or call [PHONE]. |
|
|
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp. |
|
|
``` |
|
|
|
|
|
### Entity detection only |
|
|
|
|
|
If you just need the raw entity offsets (e.g. for your own replacement logic): |
|
|
|
|
|
```python |
|
|
entities = detect_entities(model, text) |
|
|
for e in entities: |
|
|
print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"') |
|
|
``` |
|
|
|
|
|
``` |
|
|
PERSON_NAME [5:15] score=1.00 "John Smith" |
|
|
DATE_TIME [40:50] score=1.00 "2025-03-15" |
|
|
DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990" |
|
|
EMAIL [129:142] score=1.00 "help@acme.com" |
|
|
PHONE [151:164] score=1.00 "020-7946-0958" |
|
|
BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819" |
|
|
``` |
|
|
|
|
|
### Detect a subset of entities |
|
|
|
|
|
```python |
|
|
entities = detect_entities(model, text, entities={ |
|
|
"PERSON_NAME": "Person name", |
|
|
"EMAIL": "Email", |
|
|
}) |
|
|
``` |
|
|
|
|
|
## How It Works |
|
|
|
|
|
The inference pipeline in `anonymise.py`: |
|
|
|
|
|
1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. |
|
|
2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets. |
|
|
3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span. |
|
|
4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position. |
|
|
5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders. |
|
|
|
|
|
## Notes |
|
|
|
|
|
- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall. |
|
|
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+. |
|
|
- **Device:** Automatically uses CUDA > MPS > CPU. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use NERPA, please cite both this model and the original GLiNER2 paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{nerpa2025, |
|
|
title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation}, |
|
|
author={Akhat Rakishev}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/OvermindLab/nerpa}, |
|
|
} |
|
|
|
|
|
@misc{zaratiana2025gliner2efficientmultitaskinformation, |
|
|
title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface}, |
|
|
author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis}, |
|
|
year={2025}, |
|
|
eprint={2507.18546}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2507.18546}, |
|
|
} |
|
|
``` |
|
|
|
|
|
Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com). |
|
|
|
|
|
Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com). |
|
|
|