nerpa / README.md

Upload README.md with huggingface_hub

aea1b07 verified 10 days ago

7.45 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: gliner2
	tags:
	- named-entity-recognition
	- ner
	- pii
	- anonymisation
	- gliner
	- gliner2
	- token-classification
	- privacy
	datasets:
	- synthetic
	base_model: fastino/gliner2-large-v1
	model-index:
	- name: NERPA
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	metrics:
	- type: precision
	value: 0.93
	name: Micro-Precision
	- type: recall
	value: 0.90
	name: Micro-Recall
	pipeline_tag: token-classification
	---

	# NERPA — Fine-Tuned GLiNER2 for PII Anonymisation

	A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).

	## Why NERPA?

	AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was date granularity — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

	GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

	1. Distinguish fine-grained date types (DATE_OF_BIRTH vs DATE_TIME)
	2. Exceed AWS Comprehend accuracy on our PII benchmark

	\| Model \| Micro-Precision \| Micro-Recall \|
	\| --- \| --- \| --- \|
	\| AWS Comprehend \| 0.90 \| 0.94 \|
	\| GLiNER2 Large (off-the-shelf) \| 0.84 \| 0.89 \|
	\| NERPA (this model) \| 0.93 \| 0.90 \|

	## Fine-Tuning Details

	- Base model: [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
	- Training data: 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
	- Eval data: 300 held-out snippets (no template overlap with training)
	- Strategy: Full weight fine-tuning with differential learning rates:
	- Encoder (DeBERTa v3): `1e-7`
	- GLiNER-specific layers: `1e-6`
	- Batch size: 64
	- Convergence: 175 steps

	The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call indirect distillation.

	## Supported Entity Types

	\| Entity \| Description \|
	\| --- \| --- \|
	\| `PERSON_NAME` \| Person name \|
	\| `DATE_OF_BIRTH` \| Date of birth \|
	\| `DATE_TIME` \| Generic date and time \|
	\| `EMAIL` \| Email address \|
	\| `PHONE` \| Phone numbers \|
	\| `LOCATION` \| Address, city, country, postcode, street \|
	\| `AGE` \| Age of a person \|
	\| `BUSINESS_NAME` \| Business name \|
	\| `USERNAME` \| Username \|
	\| `URL` \| Any URL \|
	\| `BANK_ACCOUNT_DETAILS` \| IBAN, SWIFT, routing numbers, etc. \|
	\| `CARD_DETAILS` \| Card number, CVV, expiration \|
	\| `DIGITAL_KEYS` \| Passwords, PINs, API keys \|
	\| `PERSONAL_ID_NUMBERS` \| Passport, driving licence, tax IDs \|
	\| `TECHNICAL_ID_NUMBERS` \| IP/MAC addresses, serial numbers \|
	\| `VEHICLE_ID_NUMBERS` \| License plates, VINs \|

	## Quick Start

	### Install dependencies

	```bash
	pip install gliner2 torch
	```

	### Anonymise text (CLI)

	```bash
	# Inline text
	python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

	# From file
	python anonymise.py --file input.txt --output anonymised.txt

	# Show detected entities
	python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
	```

	### Use in Python

	```python
	from anonymise import load_model, detect_entities, anonymise

	model = load_model(".") # path to this repo

	text = (
	"Dear John Smith, your appointment is on 2025-03-15. "
	"Your date of birth (15/03/1990) has been verified. "
	"Please contact support at help@acme.com or call 020-7946-0958. "
	"Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
	)

	entities = detect_entities(model, text)
	print(anonymise(text, entities))
	```

	Output:

	```
	Dear [PERSON_NAME], your appointment is on [DATE_TIME].
	Your date of birth ([DATE_OF_BIRTH]) has been verified.
	Please contact support at [EMAIL] or call [PHONE].
	Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
	```

	### Entity detection only

	If you just need the raw entity offsets (e.g. for your own replacement logic):

	```python
	entities = detect_entities(model, text)
	for e in entities:
	print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"')
	```

	```
	PERSON_NAME [5:15] score=1.00 "John Smith"
	DATE_TIME [40:50] score=1.00 "2025-03-15"
	DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990"
	EMAIL [129:142] score=1.00 "help@acme.com"
	PHONE [151:164] score=1.00 "020-7946-0958"
	BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819"
	```

	### Detect a subset of entities

	```python
	entities = detect_entities(model, text, entities={
	"PERSON_NAME": "Person name",
	"EMAIL": "Email",
	})
	```

	## How It Works

	The inference pipeline in `anonymise.py`:

	1. Chunking — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
	2. Batch prediction — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
	3. Date disambiguation — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
	4. De-duplication — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
	5. Replacement — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.

	## Notes

	- Confidence threshold: Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
	- GLiNER2 version: Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
	- Device: Automatically uses CUDA > MPS > CPU.

	## Acknowledgements

	This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available.

	## Citation

	If you use NERPA, please cite both this model and the original GLiNER2 paper:

	```bibtex
	@misc{nerpa2025,
	title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation},
	author={Akhat Rakishev},
	year={2025},
	url={https://huggingface.co/OvermindLab/nerpa},
	}

	@misc{zaratiana2025gliner2efficientmultitaskinformation,
	title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface},
	author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis},
	year={2025},
	eprint={2507.18546},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2507.18546},
	}
	```

	Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).

	Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com).