OvermindLab
/

nerpa

@@ -35,14 +35,80 @@ pipeline_tag: token-classification
 A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
 ## Why NERPA?
-AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.
-GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:
-1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
-2. **Exceed AWS Comprehend accuracy** on our PII benchmark
 | Model | Micro-Precision | Micro-Recall |
 | --- | --- | --- |
@@ -50,20 +116,21 @@ GLiNER2 is a bi-encoder model that takes both text and entity label descriptions
 | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
 | **NERPA (this model)** | **0.93** | **0.90** |
-## Fine-Tuning Details
-- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
-- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
-- **Eval data:** 300 held-out snippets (no template overlap with training)
-- **Strategy:** Full weight fine-tuning with differential learning rates:
-  - Encoder (DeBERTa v3): `1e-7`
-  - GLiNER-specific layers: `1e-6`
-- **Batch size:** 64
-- **Convergence:** 175 steps
-The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**.
-## Supported Entity Types
 | Entity | Description |
 | --- | --- |
@@ -84,8 +151,6 @@ The synthetic data approach effectively distils the "knowledge" of a large LLM i
 | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
 | `VEHICLE_ID_NUMBERS` | License plates, VINs |
-Since NERPA is built on GLiNER2 (a zero-shot bi-encoder), it is **not limited** to the entities above. You can pass any custom entity types alongside the built-in ones — the fine-tuning does not reduce the model's ability to detect arbitrary categories. See [Custom entities](#custom-entities) below.
 ## Quick Start
 ### Install dependencies

 A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
+## Fine-Tuning Details
+- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
+- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
+- **Eval data:** 300 held-out snippets (no template overlap with training)
+- **Strategy:** Full weight fine-tuning with differential learning rates:
+  - Encoder (DeBERTa v3): `1e-7`
+  - GLiNER-specific layers: `1e-6`
+- **Batch size:** 64
+- **Convergence:** 175 steps
 ## Why NERPA?
+NERPA combines two technical advantages that commercial NER services like AWS Comprehend cannot offer:
+### 1. Bi-Encoder Architecture for Zero-Shot Entity Detection
+GLiNER2 is a bi-encoder that takes both text and entity label descriptions as input, rather than treating entity types as fixed output classes. This architectural difference means you can define arbitrary entity types at inference time without retraining:
+```python
+# Standard PII entities
+entities = detect_entities(model, text, entities={
+    "PERSON_NAME": "Person name",
+    "DATE_OF_BIRTH": "Date of birth",
+    "EMAIL": "Email address",
+})
+# Add domain-specific entities on the fly
+entities = detect_entities(model, text, entities={
+    "PERSON_NAME": "Person name",
+    "MEDICATION": "Drug or medication name",
+    "DIAGNOSIS": "Medical condition or diagnosis",
+    "LAB_VALUE": "Laboratory test result",
+})
+# Or even abstract analytical entities
+entities = detect_entities(model, text, entities={
+    "COMMITMENT": "A promise or obligation",
+    "ASSUMPTION": "An unstated premise or belief",
+    "RISK_FACTOR": "A potential source of risk or uncertainty",
+})
+```
+This isn't prompt engineering or few-shot learning. The model's bi-encoder architecture natively supports arbitrary entity schemas. Fine-tuning on PII improves precision on those specific types without degrading the zero-shot capability.
+**Example:** Context-dependent entity distinction
+```python
+text = """Last weekend, I visited Riverside Farm & Wildlife Park with my family.
+The kids were excited to see the tigers first—magnificent creatures pacing behind
+the reinforced glass. My daughter Sarah kept comparing them to our tabby cat at home,
+saying how similar their stripes looked, though obviously Mittens is much smaller and
+sleeps on our couch rather than prowling through artificial jungle habitats."""
+entities = detect_entities(model, text, entities={
+    "ZOO": "Animals in a zoo or wildlife park",
+    "PET": "Pet animals owned by someone",
+})
+```
+Output:
+```
+Last weekend, I visited Riverside Farm & Wildlife Park with my family. The kids were
+excited to see the [ZOO] first—magnificent creatures pacing behind the reinforced glass.
+My daughter Sarah kept comparing them to our [PET] at home, saying how similar their
+stripes looked, though obviously [PET] is much smaller and sleeps on our couch rather
+than prowling through artificial jungle habitats.
+```
+The model correctly distinguishes tigers (zoo animals) from the tabby cat and even the cat's name Mittens (pets) based purely on contextual cues. No retraining required.
+### 2. Superior Performance on Standard PII
+Fine-tuning GLiNER2 Large on 1,210 synthetic PII examples produced a model that outperforms AWS Comprehend on standard entity detection:
 | Model | Micro-Precision | Micro-Recall |
 | --- | --- | --- |
 | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
 | **NERPA (this model)** | **0.93** | **0.90** |
+NERPA achieves **3% higher precision** than AWS Comprehend while maintaining comparable recall. The fine-tuning also enables fine-grained date disambiguation (DATE_OF_BIRTH vs DATE_TIME), which AWS Comprehend cannot do without custom model training.
+### The Architecture Advantage
+AWS Comprehend treats entity types as fixed classification targets. Adding a new entity type requires:
+1. Annotating thousands of examples
+2. Training a custom model
+3. Paying for model hosting
+4. Managing model versioning
+NERPA's bi-encoder architecture makes entity types a runtime parameter. Adding new entities is a single line of code.
+## Pre-Optimised PII Entity Types
+NERPA is fine-tuned on these entity types (but you can add more at inference time):
 | Entity | Description |
 | --- | --- |
 | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
 | `VEHICLE_ID_NUMBERS` | License plates, VINs |
 ## Quick Start
 ### Install dependencies