nerpa / README.md
hugging-hat's picture
Upload README.md with huggingface_hub
6ba6d38 verified
|
raw
history blame
5.91 kB

NERPA β€” Fine-Tuned GLiNER2 for PII Anonymisation

A fine-tuned GLiNER2 Large (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at Overmind.

Why NERPA?

AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was date granularity β€” Comprehend labels both a Date of Birth and an Appointment Date as DATE, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

  1. Distinguish fine-grained date types (DATE_OF_BIRTH vs DATE_TIME)
  2. Exceed AWS Comprehend accuracy on our PII benchmark
Model Micro-Precision Micro-Recall
AWS Comprehend 0.90 0.94
GLiNER2 Large (off-the-shelf) 0.84 0.89
NERPA (this model) 0.93 0.90

Fine-Tuning Details

  • Base model: fastino/gliner2-large-v1 (DeBERTa v3 Large backbone, 340M params)
  • Training data: 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
  • Eval data: 300 held-out snippets (no template overlap with training)
  • Strategy: Full weight fine-tuning with differential learning rates:
    • Encoder (DeBERTa v3): 1e-7
    • GLiNER-specific layers: 1e-6
  • Batch size: 64
  • Convergence: 175 steps

The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model β€” what we call indirect distillation.

Supported Entity Types

Entity Description
PERSON_NAME Person name
DATE_OF_BIRTH Date of birth
DATE_TIME Generic date and time
EMAIL Email address
PHONE Phone numbers
LOCATION Address, city, country, postcode, street
AGE Age of a person
BUSINESS_NAME Business name
USERNAME Username
URL Any URL
BANK_ACCOUNT_DETAILS IBAN, SWIFT, routing numbers, etc.
CARD_DETAILS Card number, CVV, expiration
DIGITAL_KEYS Passwords, PINs, API keys
PERSONAL_ID_NUMBERS Passport, driving licence, tax IDs
TECHNICAL_ID_NUMBERS IP/MAC addresses, serial numbers
VEHICLE_ID_NUMBERS License plates, VINs

Quick Start

Install dependencies

pip install gliner2 torch

Anonymise text (CLI)

# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

# From file
python anonymise.py --file input.txt --output anonymised.txt

# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."

Use in Python

from anonymise import load_model, detect_entities, anonymise

model = load_model(".")  # path to this repo

text = (
    "Dear John Smith, your appointment is on 2025-03-15. "
    "Your date of birth (15/03/1990) has been verified. "
    "Please contact support at help@acme.com or call 020-7946-0958. "
    "Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
)

entities = detect_entities(model, text)
print(anonymise(text, entities))

Output:

Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.

Entity detection only

If you just need the raw entity offsets (e.g. for your own replacement logic):

entities = detect_entities(model, text)
for e in entities:
    print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f}  "{text[e["start"]:e["end"]]}"')
PERSON_NAME               [5:15]  score=1.00  "John Smith"
DATE_TIME                 [40:50] score=1.00  "2025-03-15"
DATE_OF_BIRTH             [72:82] score=1.00  "15/03/1990"
EMAIL                     [129:142] score=1.00  "help@acme.com"
PHONE                     [151:164] score=1.00  "020-7946-0958"
BANK_ACCOUNT_DETAILS      [187:209] score=1.00  "GB29NWBK60161331926819"

Detect a subset of entities

entities = detect_entities(model, text, entities={
    "PERSON_NAME": "Person name",
    "EMAIL": "Email",
})

How It Works

The inference pipeline in anonymise.py:

  1. Chunking β€” Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
  2. Batch prediction β€” Chunks are fed through GLiNER2.batch_extract_entities() with include_spans=True to get character-level offsets.
  3. Date disambiguation β€” Both DATE_TIME and DATE_OF_BIRTH are always detected together so the model can choose the best label per span.
  4. De-duplication β€” Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
  5. Replacement β€” Detected spans are replaced right-to-left with [ENTITY_TYPE] placeholders.

Notes

  • Confidence threshold: Default is 0.25. The model tends to be conservative, so a lower threshold works well for high recall.
  • GLiNER2 version: Requires gliner2>=1.2.4. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
  • Device: Automatically uses CUDA > MPS > CPU.

Citation

Built by Akhat Rakishev at Overmind.

Base model: GLiNER2 by Fastino AI.