---
language:
  - en
license: apache-2.0
library_name: gliner2
tags:
  - named-entity-recognition
  - ner
  - pii
  - anonymisation
  - gliner
  - gliner2
  - token-classification
  - privacy
datasets:
  - synthetic
base_model: fastino/gliner2-large-v1
model-index:
  - name: NERPA
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        metrics:
          - type: precision
            value: 0.93
            name: Micro-Precision
          - type: recall
            value: 0.90
            name: Micro-Recall
pipeline_tag: token-classification
---

# NERPA — Fine-Tuned GLiNER2 for PII Anonymisation

A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).

## Why NERPA?

AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
2. **Exceed AWS Comprehend accuracy** on our PII benchmark

| Model | Micro-Precision | Micro-Recall |
| --- | --- | --- |
| AWS Comprehend | 0.90 | 0.94 |
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
| **NERPA (this model)** | **0.93** | **0.90** |

## Fine-Tuning Details

- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
- **Eval data:** 300 held-out snippets (no template overlap with training)
- **Strategy:** Full weight fine-tuning with differential learning rates:
  - Encoder (DeBERTa v3): `1e-7`
  - GLiNER-specific layers: `1e-6`
- **Batch size:** 64
- **Convergence:** 175 steps

The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**.

## Supported Entity Types

| Entity | Description |
| --- | --- |
| `PERSON_NAME` | Person name |
| `DATE_OF_BIRTH` | Date of birth |
| `DATE_TIME` | Generic date and time |
| `EMAIL` | Email address |
| `PHONE` | Phone numbers |
| `LOCATION` | Address, city, country, postcode, street |
| `AGE` | Age of a person |
| `BUSINESS_NAME` | Business name |
| `USERNAME` | Username |
| `URL` | Any URL |
| `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. |
| `CARD_DETAILS` | Card number, CVV, expiration |
| `DIGITAL_KEYS` | Passwords, PINs, API keys |
| `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs |
| `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
| `VEHICLE_ID_NUMBERS` | License plates, VINs |

## Quick Start

### Install dependencies

```bash
pip install gliner2 torch
```

### Anonymise text (CLI)

```bash
# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

# From file
python anonymise.py --file input.txt --output anonymised.txt

# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
```

### Use in Python

```python
from anonymise import load_model, detect_entities, anonymise

model = load_model(".")  # path to this repo

text = (
    "Dear John Smith, your appointment is on 2025-03-15. "
    "Your date of birth (15/03/1990) has been verified. "
    "Please contact support at help@acme.com or call 020-7946-0958. "
    "Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
)

entities = detect_entities(model, text)
print(anonymise(text, entities))
```

Output:

```
Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
```

### Entity detection only

If you just need the raw entity offsets (e.g. for your own replacement logic):

```python
entities = detect_entities(model, text)
for e in entities:
    print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f}  "{text[e["start"]:e["end"]]}"')
```

```
PERSON_NAME               [5:15]  score=1.00  "John Smith"
DATE_TIME                 [40:50] score=1.00  "2025-03-15"
DATE_OF_BIRTH             [72:82] score=1.00  "15/03/1990"
EMAIL                     [129:142] score=1.00  "help@acme.com"
PHONE                     [151:164] score=1.00  "020-7946-0958"
BANK_ACCOUNT_DETAILS      [187:209] score=1.00  "GB29NWBK60161331926819"
```

### Detect a subset of entities

```python
entities = detect_entities(model, text, entities={
    "PERSON_NAME": "Person name",
    "EMAIL": "Email",
})
```

## How It Works

The inference pipeline in `anonymise.py`:

1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.

## Notes

- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
- **Device:** Automatically uses CUDA > MPS > CPU.

## Acknowledgements

This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available.

## Citation

If you use NERPA, please cite both this model and the original GLiNER2 paper:

```bibtex
@misc{nerpa2025,
  title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation},
  author={Akhat Rakishev},
  year={2025},
  url={https://huggingface.co/OvermindLab/nerpa},
}

@misc{zaratiana2025gliner2efficientmultitaskinformation,
  title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface},
  author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis},
  year={2025},
  eprint={2507.18546},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.18546},
}
```

Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).

Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com).