NERPA β Fine-Tuned GLiNER2 for PII Anonymisation
A fine-tuned GLiNER2 Large (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at Overmind.
Why NERPA?
AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was date granularity β Comprehend labels both a Date of Birth and an Appointment Date as DATE, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.
GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:
- Distinguish fine-grained date types (DATE_OF_BIRTH vs DATE_TIME)
- Exceed AWS Comprehend accuracy on our PII benchmark
| Model | Micro-Precision | Micro-Recall |
|---|---|---|
| AWS Comprehend | 0.90 | 0.94 |
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
| NERPA (this model) | 0.93 | 0.90 |
Fine-Tuning Details
- Base model: fastino/gliner2-large-v1 (DeBERTa v3 Large backbone, 340M params)
- Training data: 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2β4 PII entities
- Eval data: 300 held-out snippets (no template overlap with training)
- Strategy: Full weight fine-tuning with differential learning rates:
- Encoder (DeBERTa v3):
1e-7 - GLiNER-specific layers:
1e-6
- Encoder (DeBERTa v3):
- Batch size: 64
- Convergence: 175 steps
The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model β what we call indirect distillation.
Supported Entity Types
| Entity | Description |
|---|---|
PERSON_NAME |
Person name |
DATE_OF_BIRTH |
Date of birth |
DATE_TIME |
Generic date and time |
EMAIL |
Email address |
PHONE |
Phone numbers |
LOCATION |
Address, city, country, postcode, street |
AGE |
Age of a person |
BUSINESS_NAME |
Business name |
USERNAME |
Username |
URL |
Any URL |
BANK_ACCOUNT_DETAILS |
IBAN, SWIFT, routing numbers, etc. |
CARD_DETAILS |
Card number, CVV, expiration |
DIGITAL_KEYS |
Passwords, PINs, API keys |
PERSONAL_ID_NUMBERS |
Passport, driving licence, tax IDs |
TECHNICAL_ID_NUMBERS |
IP/MAC addresses, serial numbers |
VEHICLE_ID_NUMBERS |
License plates, VINs |
Quick Start
Install dependencies
pip install gliner2 torch
Anonymise text (CLI)
# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"
# From file
python anonymise.py --file input.txt --output anonymised.txt
# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
Use in Python
from anonymise import load_model, detect_entities, anonymise
model = load_model(".") # path to this repo
text = (
"Dear John Smith, your appointment is on 2025-03-15. "
"Your date of birth (15/03/1990) has been verified. "
"Please contact support at help@acme.com or call 020-7946-0958. "
"Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
)
entities = detect_entities(model, text)
print(anonymise(text, entities))
Output:
Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
Entity detection only
If you just need the raw entity offsets (e.g. for your own replacement logic):
entities = detect_entities(model, text)
for e in entities:
print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"')
PERSON_NAME [5:15] score=1.00 "John Smith"
DATE_TIME [40:50] score=1.00 "2025-03-15"
DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990"
EMAIL [129:142] score=1.00 "help@acme.com"
PHONE [151:164] score=1.00 "020-7946-0958"
BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819"
Detect a subset of entities
entities = detect_entities(model, text, entities={
"PERSON_NAME": "Person name",
"EMAIL": "Email",
})
How It Works
The inference pipeline in anonymise.py:
- Chunking β Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
- Batch prediction β Chunks are fed through
GLiNER2.batch_extract_entities()withinclude_spans=Trueto get character-level offsets. - Date disambiguation β Both
DATE_TIMEandDATE_OF_BIRTHare always detected together so the model can choose the best label per span. - De-duplication β Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
- Replacement β Detected spans are replaced right-to-left with
[ENTITY_TYPE]placeholders.
Notes
- Confidence threshold: Default is
0.25. The model tends to be conservative, so a lower threshold works well for high recall. - GLiNER2 version: Requires
gliner2>=1.2.4. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+. - Device: Automatically uses CUDA > MPS > CPU.
Citation
Built by Akhat Rakishev at Overmind.
Base model: GLiNER2 by Fastino AI.