NERPA — Fine-Tuned GLiNER2 for PII Anonymisation

A fine-tuned GLiNER2 Large (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at Overmind.

Why NERPA?

AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was date granularity — Comprehend labels both a Date of Birth and an Appointment Date as DATE, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

Distinguish fine-grained date types (DATE_OF_BIRTH vs DATE_TIME)
Exceed AWS Comprehend accuracy on our PII benchmark

Model	Micro-Precision	Micro-Recall
AWS Comprehend	0.90	0.94
GLiNER2 Large (off-the-shelf)	0.84	0.89
NERPA (this model)	0.93	0.90

Fine-Tuning Details

Base model: fastino/gliner2-large-v1 (DeBERTa v3 Large backbone, 340M params)
Training data: 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
Eval data: 300 held-out snippets (no template overlap with training)
Strategy: Full weight fine-tuning with differential learning rates:
- Encoder (DeBERTa v3): 1e-7
- GLiNER-specific layers: 1e-6
Batch size: 64
Convergence: 175 steps

The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call indirect distillation.

Supported Entity Types

Entity	Description
`PERSON_NAME`	Person name
`DATE_OF_BIRTH`	Date of birth
`DATE_TIME`	Generic date and time
`EMAIL`	Email address
`PHONE`	Phone numbers
`LOCATION`	Address, city, country, postcode, street
`AGE`	Age of a person
`BUSINESS_NAME`	Business name
`USERNAME`	Username
`URL`	Any URL
`BANK_ACCOUNT_DETAILS`	IBAN, SWIFT, routing numbers, etc.
`CARD_DETAILS`	Card number, CVV, expiration
`DIGITAL_KEYS`	Passwords, PINs, API keys
`PERSONAL_ID_NUMBERS`	Passport, driving licence, tax IDs
`TECHNICAL_ID_NUMBERS`	IP/MAC addresses, serial numbers
`VEHICLE_ID_NUMBERS`	License plates, VINs

Quick Start

Install dependencies

pip install gliner2 torch

Anonymise text (CLI)

# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

# From file
python anonymise.py --file input.txt --output anonymised.txt

# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."

Use in Python

from anonymise import load_model, detect_entities, anonymise

model = load_model(".")  # path to this repo

text = (
    "Dear John Smith, your appointment is on 2025-03-15. "
    "Your date of birth (15/03/1990) has been verified. "
    "Please contact support at help@acme.com or call 020-7946-0958. "
    "Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
)

entities = detect_entities(model, text)
print(anonymise(text, entities))

Output:

Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.

Entity detection only

If you just need the raw entity offsets (e.g. for your own replacement logic):

entities = detect_entities(model, text)
for e in entities:
    print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f}  "{text[e["start"]:e["end"]]}"')

PERSON_NAME               [5:15]  score=1.00  "John Smith"
DATE_TIME                 [40:50] score=1.00  "2025-03-15"
DATE_OF_BIRTH             [72:82] score=1.00  "15/03/1990"
EMAIL                     [129:142] score=1.00  "help@acme.com"
PHONE                     [151:164] score=1.00  "020-7946-0958"
BANK_ACCOUNT_DETAILS      [187:209] score=1.00  "GB29NWBK60161331926819"

Detect a subset of entities

entities = detect_entities(model, text, entities={
    "PERSON_NAME": "Person name",
    "EMAIL": "Email",
})

How It Works

The inference pipeline in anonymise.py:

Chunking — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
Batch prediction — Chunks are fed through GLiNER2.batch_extract_entities() with include_spans=True to get character-level offsets.
Date disambiguation — Both DATE_TIME and DATE_OF_BIRTH are always detected together so the model can choose the best label per span.
De-duplication — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
Replacement — Detected spans are replaced right-to-left with [ENTITY_TYPE] placeholders.

Notes

Confidence threshold: Default is 0.25. The model tends to be conservative, so a lower threshold works well for high recall.
GLiNER2 version: Requires gliner2>=1.2.4. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
Device: Automatically uses CUDA > MPS > CPU.

Citation

Built by Akhat Rakishev at Overmind.

Base model: GLiNER2 by Fastino AI.