Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,150 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NERPA — Fine-Tuned GLiNER2 for PII Anonymisation
|
| 2 |
+
|
| 3 |
+
A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).
|
| 4 |
+
|
| 5 |
+
## Why NERPA?
|
| 6 |
+
|
| 7 |
+
AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.
|
| 8 |
+
|
| 9 |
+
GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:
|
| 10 |
+
|
| 11 |
+
1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
|
| 12 |
+
2. **Exceed AWS Comprehend accuracy** on our PII benchmark
|
| 13 |
+
|
| 14 |
+
| Model | Micro-Precision | Micro-Recall |
|
| 15 |
+
| --- | --- | --- |
|
| 16 |
+
| AWS Comprehend | 0.90 | 0.94 |
|
| 17 |
+
| GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
|
| 18 |
+
| **NERPA (this model)** | **0.93** | **0.90** |
|
| 19 |
+
|
| 20 |
+
## Fine-Tuning Details
|
| 21 |
+
|
| 22 |
+
- **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
|
| 23 |
+
- **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
|
| 24 |
+
- **Eval data:** 300 held-out snippets (no template overlap with training)
|
| 25 |
+
- **Strategy:** Full weight fine-tuning with differential learning rates:
|
| 26 |
+
- Encoder (DeBERTa v3): `1e-7`
|
| 27 |
+
- GLiNER-specific layers: `1e-6`
|
| 28 |
+
- **Batch size:** 64
|
| 29 |
+
- **Convergence:** 175 steps
|
| 30 |
+
|
| 31 |
+
The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**.
|
| 32 |
+
|
| 33 |
+
## Supported Entity Types
|
| 34 |
+
|
| 35 |
+
| Entity | Description |
|
| 36 |
+
| --- | --- |
|
| 37 |
+
| `PERSON_NAME` | Person name |
|
| 38 |
+
| `DATE_OF_BIRTH` | Date of birth |
|
| 39 |
+
| `DATE_TIME` | Generic date and time |
|
| 40 |
+
| `EMAIL` | Email address |
|
| 41 |
+
| `PHONE` | Phone numbers |
|
| 42 |
+
| `LOCATION` | Address, city, country, postcode, street |
|
| 43 |
+
| `AGE` | Age of a person |
|
| 44 |
+
| `BUSINESS_NAME` | Business name |
|
| 45 |
+
| `USERNAME` | Username |
|
| 46 |
+
| `URL` | Any URL |
|
| 47 |
+
| `BANK_ACCOUNT_DETAILS` | IBAN, SWIFT, routing numbers, etc. |
|
| 48 |
+
| `CARD_DETAILS` | Card number, CVV, expiration |
|
| 49 |
+
| `DIGITAL_KEYS` | Passwords, PINs, API keys |
|
| 50 |
+
| `PERSONAL_ID_NUMBERS` | Passport, driving licence, tax IDs |
|
| 51 |
+
| `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
|
| 52 |
+
| `VEHICLE_ID_NUMBERS` | License plates, VINs |
|
| 53 |
+
|
| 54 |
+
## Quick Start
|
| 55 |
+
|
| 56 |
+
### Install dependencies
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
pip install gliner2 torch
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Anonymise text (CLI)
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
# Inline text
|
| 66 |
+
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"
|
| 67 |
+
|
| 68 |
+
# From file
|
| 69 |
+
python anonymise.py --file input.txt --output anonymised.txt
|
| 70 |
+
|
| 71 |
+
# Show detected entities
|
| 72 |
+
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Use in Python
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from anonymise import load_model, detect_entities, anonymise
|
| 79 |
+
|
| 80 |
+
model = load_model(".") # path to this repo
|
| 81 |
+
|
| 82 |
+
text = (
|
| 83 |
+
"Dear John Smith, your appointment is on 2025-03-15. "
|
| 84 |
+
"Your date of birth (15/03/1990) has been verified. "
|
| 85 |
+
"Please contact support at help@acme.com or call 020-7946-0958. "
|
| 86 |
+
"Your account IBAN is GB29NWBK60161331926819. Regards, Acme Corp."
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
entities = detect_entities(model, text)
|
| 90 |
+
print(anonymise(text, entities))
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
Output:
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
Dear [PERSON_NAME], your appointment is on [DATE_TIME].
|
| 97 |
+
Your date of birth ([DATE_OF_BIRTH]) has been verified.
|
| 98 |
+
Please contact support at [EMAIL] or call [PHONE].
|
| 99 |
+
Your account IBAN is [BANK_ACCOUNT_DETAILS]. Regards, Acme Corp.
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Entity detection only
|
| 103 |
+
|
| 104 |
+
If you just need the raw entity offsets (e.g. for your own replacement logic):
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
entities = detect_entities(model, text)
|
| 108 |
+
for e in entities:
|
| 109 |
+
print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f} "{text[e["start"]:e["end"]]}"')
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
```
|
| 113 |
+
PERSON_NAME [5:15] score=1.00 "John Smith"
|
| 114 |
+
DATE_TIME [40:50] score=1.00 "2025-03-15"
|
| 115 |
+
DATE_OF_BIRTH [72:82] score=1.00 "15/03/1990"
|
| 116 |
+
EMAIL [129:142] score=1.00 "help@acme.com"
|
| 117 |
+
PHONE [151:164] score=1.00 "020-7946-0958"
|
| 118 |
+
BANK_ACCOUNT_DETAILS [187:209] score=1.00 "GB29NWBK60161331926819"
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
### Detect a subset of entities
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
entities = detect_entities(model, text, entities={
|
| 125 |
+
"PERSON_NAME": "Person name",
|
| 126 |
+
"EMAIL": "Email",
|
| 127 |
+
})
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## How It Works
|
| 131 |
+
|
| 132 |
+
The inference pipeline in `anonymise.py`:
|
| 133 |
+
|
| 134 |
+
1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
|
| 135 |
+
2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
|
| 136 |
+
3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
|
| 137 |
+
4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
|
| 138 |
+
5. **Replacement** — Detected spans are replaced right-to-left with `[ENTITY_TYPE]` placeholders.
|
| 139 |
+
|
| 140 |
+
## Notes
|
| 141 |
+
|
| 142 |
+
- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
|
| 143 |
+
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
|
| 144 |
+
- **Device:** Automatically uses CUDA > MPS > CPU.
|
| 145 |
+
|
| 146 |
+
## Citation
|
| 147 |
+
|
| 148 |
+
Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).
|
| 149 |
+
|
| 150 |
+
Base model: [GLiNER2](https://huggingface.co/fastino/gliner2-large-v1) by Fastino AI.
|