| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: gliner2 |
| | tags: |
| | - named-entity-recognition |
| | - ner |
| | - pii |
| | - anonymisation |
| | - gliner |
| | - gliner2 |
| | - token-classification |
| | - privacy |
| | datasets: |
| | - synthetic |
| | base_model: fastino/gliner2-large-v1 |
| | model-index: |
| | - name: NERPA |
| | results: |
| | - task: |
| | type: token-classification |
| | name: Named Entity Recognition |
| | metrics: |
| | - type: precision |
| | value: 0.93 |
| | name: Micro-Precision |
| | - type: recall |
| | value: 0.90 |
| | name: Micro-Recall |
| | pipeline_tag: token-classification |
| | --- |
| | |
| | # NERPA - Fine-Tuned GLiNER2 for PII Anonymisation |
| |
|
| | A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai). |
| |
|
| | ## Why NERPA? |
| |
|
| | AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context. |
| |
|
| | GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to: |
| |
|
| | 1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME) |
| | 2. **Exceed AWS Comprehend accuracy** on our PII benchmark |
| | |
| | | Model | Micro-Precision | Micro-Recall | |
| | | --- | --- | --- | |
| | | AWS Comprehend | 0.90 | 0.94 | |
| | | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 | |
| | | **NERPA (this model)** | **0.93** | **0.90** | |
| | |
| | ## Fine-Tuning Details |
| | |
| | - **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params) |
| | - **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities |
| | - **Eval data:** 300 held-out snippets (no template overlap with training) |
| | - **Strategy:** Full weight fine-tuning with differential learning rates: |
| | - Encoder (DeBERTa v3): `1e-7` |
| | - GLiNER-specific layers: `1e-6` |
| | - **Batch size:** 64 |
| | - **Convergence:** 175 steps |
| | |
| | The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**. |
| | |
| | ## Supported Entity Types |
| | |
| | | Entity | Description | |
| | | --- | --- | |
| | | `PERSON_NAME` | Person name | |
| | | `DATE_OF_BIRTH` | Date of birth | |
| | | `DATE_TIME` | Generic date and time | |
| | | `EMAIL` | Email address | |
| | | `PHONE` | Phone numbers | |
| | | `LOCATION` | Address, city, country, postcode, street | |
| | | `AGE` | Age of a person | |
| | | `BUSINESS_NAME` | Business name | |
| | | `USERNAME` | Username | |
| | | `URL` | Any URL | |
| | | `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. | |
| | | `CARD_DETAILS` | Card number, CVV, expiration | |
| | | `DIGITAL_KEYS` | Passwords, PINs, API keys | |
| | | `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs | |
| | | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers | |
| | | `VEHICLE_ID_NUMBERS` | License plates, VINs | |
| |
|
| | ## Quick Start |
| |
|
| | ### Install dependencies |
| |
|
| | ```bash |
| | pip install gliner2 torch |
| | ``` |
| |
|
| | ### Anonymise text (CLI) |
| |
|
| | ```bash |
| | # Inline text |
| | python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com" |
| | |
| | # From file |
| | python anonymise.py --file input.txt --output anonymised.txt |
| | |
| | # Show detected entities |
| | python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819." |
| | ``` |
| |
|
| | ### Use in Python |
| |
|
| | ```python |
| | from anonymise import load_model, detect_entities, anonymise |
| | |
| | model = load_model(".") # path to this repo |
| | |
| | text = ( |
| | "Dear John Smith, your appointment is on 2025-03-15. " |
| | "Your date of birth (15/03/1990) has been verified. " |
| | "Please contact support at help@acme.com or call 020-7946-0958. " |
| | ) |
| | |
| | entities = detect_entities(model, text) |
| | print(anonymise(text, entities)) |
| | ``` |
| |
|
| | Output: |
| |
|
| | ``` |
| | Dear [PERSON_NAME], your appointment is on [DATE_TIME]. |
| | Your date of birth ([DATE_OF_BIRTH]) has been verified. |
| | Please contact support at [EMAIL] or call [PHONE]. |
| | ``` |
| |
|
| | ### Entity detection only |
| |
|
| | If you just need the raw entity offsets (e.g. for your own replacement logic): |
| |
|
| | ```python |
| | entities = detect_entities(model, text) |
| | for e in entities: |
| | print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"') |
| | ``` |
| |
|
| | ``` |
| | PERSON_NAME [5:15] score=1.00 "John Smith" |
| | DATE_TIME [40:50] score=1.00 "2025-03-15" |
| | DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990" |
| | EMAIL [129:142] score=1.00 "help@acme.com" |
| | PHONE [151:164] score=1.00 "020-7946-0958" |
| | BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819" |
| | ``` |
| |
|
| | ### Detect a subset of entities |
| |
|
| | ```python |
| | entities = detect_entities(model, text, entities={ |
| | "PERSON_NAME": "Person name", |
| | "EMAIL": "Email", |
| | }) |
| | ``` |
| |
|
| | ## How It Works |
| |
|
| | The inference pipeline in `anonymise.py`: |
| |
|
| | 1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v3 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones. |
| | 2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets. |
| | 3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span. |
| | 4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position. |
| | 5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders. |
| |
|
| | ## Notes |
| |
|
| | - **Confidence threshold:** Default is `0.25`. The model sometimes tends to be conservative, so a lower threshold works well for high recall. |
| | - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+. |
| | - **Device:** Automatically uses CUDA > MPS > CPU. |
| |
|
| | ## Acknowledgements |
| |
|
| | This model is a fine-tuned version of [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) by [Fastino AI](https://fastino.ai). We thank the GLiNER2 authors for making their model and library openly available. |
| |
|
| | ## Citation |
| |
|
| | If you use NERPA, please cite both this model and the original GLiNER2 paper: |
| |
|
| | ```bibtex |
| | @misc{nerpa2025, |
| | title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation}, |
| | author={Akhat Rakishev}, |
| | year={2025}, |
| | url={https://huggingface.co/OvermindLab/nerpa}, |
| | } |
| | |
| | @misc{zaratiana2025gliner2efficientmultitaskinformation, |
| | title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface}, |
| | author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis}, |
| | year={2025}, |
| | eprint={2507.18546}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2507.18546}, |
| | } |
| | ``` |
| |
|
| | Built by [Akhat Rakishev](https://github.com/akhatre) at [Overmind](https://overmindlab.ai). |
| |
|
| | Overmind is infrastructure for end-to-end agent optimisation. Learn more at [overmindlab.ai](https://overmindlab.ai). |
| |
|