TylerOvermind commited on
Commit
a4107bb
·
verified ·
1 Parent(s): 29ae185

Added customer entity examples

Browse files
Files changed (1) hide show
  1. README.md +82 -17
README.md CHANGED
@@ -35,14 +35,80 @@ pipeline_tag: token-classification
35
 
36
  A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
37
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Why NERPA?
39
 
40
- AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was **date granularity** — Comprehend labels both a Date of Birth and an Appointment Date as `DATE`, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:
43
 
44
- 1. **Distinguish fine-grained date types** (DATE_OF_BIRTH vs DATE_TIME)
45
- 2. **Exceed AWS Comprehend accuracy** on our PII benchmark
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  | Model | Micro-Precision | Micro-Recall |
48
  | --- | --- | --- |
@@ -50,20 +116,21 @@ GLiNER2 is a bi-encoder model that takes both text and entity label descriptions
50
  | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
51
  | **NERPA (this model)** | **0.93** | **0.90** |
52
 
53
- ## Fine-Tuning Details
54
 
55
- - **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
56
- - **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
57
- - **Eval data:** 300 held-out snippets (no template overlap with training)
58
- - **Strategy:** Full weight fine-tuning with differential learning rates:
59
- - Encoder (DeBERTa v3): `1e-7`
60
- - GLiNER-specific layers: `1e-6`
61
- - **Batch size:** 64
62
- - **Convergence:** 175 steps
63
 
64
- The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model — what we call **indirect distillation**.
 
 
 
 
65
 
66
- ## Supported Entity Types
 
 
 
 
67
 
68
  | Entity | Description |
69
  | --- | --- |
@@ -84,8 +151,6 @@ The synthetic data approach effectively distils the "knowledge" of a large LLM i
84
  | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
85
  | `VEHICLE_ID_NUMBERS` | License plates, VINs |
86
 
87
- Since NERPA is built on GLiNER2 (a zero-shot bi-encoder), it is **not limited** to the entities above. You can pass any custom entity types alongside the built-in ones — the fine-tuning does not reduce the model's ability to detect arbitrary categories. See [Custom entities](#custom-entities) below.
88
-
89
  ## Quick Start
90
 
91
  ### Install dependencies
 
35
 
36
  A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
37
 
38
+ ## Fine-Tuning Details
39
+
40
+ - **Base model:** [fastino/gliner2-large-v1](https://huggingface.co/fastino/gliner2-large-v1) (DeBERTa v3 Large backbone, 340M params)
41
+ - **Training data:** 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
42
+ - **Eval data:** 300 held-out snippets (no template overlap with training)
43
+ - **Strategy:** Full weight fine-tuning with differential learning rates:
44
+ - Encoder (DeBERTa v3): `1e-7`
45
+ - GLiNER-specific layers: `1e-6`
46
+ - **Batch size:** 64
47
+ - **Convergence:** 175 steps
48
+
49
  ## Why NERPA?
50
 
51
+ NERPA combines two technical advantages that commercial NER services like AWS Comprehend cannot offer:
52
+
53
+ ### 1. Bi-Encoder Architecture for Zero-Shot Entity Detection
54
+
55
+ GLiNER2 is a bi-encoder that takes both text and entity label descriptions as input, rather than treating entity types as fixed output classes. This architectural difference means you can define arbitrary entity types at inference time without retraining:
56
+
57
+ ```python
58
+ # Standard PII entities
59
+ entities = detect_entities(model, text, entities={
60
+ "PERSON_NAME": "Person name",
61
+ "DATE_OF_BIRTH": "Date of birth",
62
+ "EMAIL": "Email address",
63
+ })
64
+
65
+ # Add domain-specific entities on the fly
66
+ entities = detect_entities(model, text, entities={
67
+ "PERSON_NAME": "Person name",
68
+ "MEDICATION": "Drug or medication name",
69
+ "DIAGNOSIS": "Medical condition or diagnosis",
70
+ "LAB_VALUE": "Laboratory test result",
71
+ })
72
+
73
+ # Or even abstract analytical entities
74
+ entities = detect_entities(model, text, entities={
75
+ "COMMITMENT": "A promise or obligation",
76
+ "ASSUMPTION": "An unstated premise or belief",
77
+ "RISK_FACTOR": "A potential source of risk or uncertainty",
78
+ })
79
+ ```
80
 
81
+ This isn't prompt engineering or few-shot learning. The model's bi-encoder architecture natively supports arbitrary entity schemas. Fine-tuning on PII improves precision on those specific types without degrading the zero-shot capability.
82
 
83
+ **Example:** Context-dependent entity distinction
84
+
85
+ ```python
86
+ text = """Last weekend, I visited Riverside Farm & Wildlife Park with my family.
87
+ The kids were excited to see the tigers first—magnificent creatures pacing behind
88
+ the reinforced glass. My daughter Sarah kept comparing them to our tabby cat at home,
89
+ saying how similar their stripes looked, though obviously Mittens is much smaller and
90
+ sleeps on our couch rather than prowling through artificial jungle habitats."""
91
+
92
+ entities = detect_entities(model, text, entities={
93
+ "ZOO": "Animals in a zoo or wildlife park",
94
+ "PET": "Pet animals owned by someone",
95
+ })
96
+ ```
97
+
98
+ Output:
99
+ ```
100
+ Last weekend, I visited Riverside Farm & Wildlife Park with my family. The kids were
101
+ excited to see the [ZOO] first—magnificent creatures pacing behind the reinforced glass.
102
+ My daughter Sarah kept comparing them to our [PET] at home, saying how similar their
103
+ stripes looked, though obviously [PET] is much smaller and sleeps on our couch rather
104
+ than prowling through artificial jungle habitats.
105
+ ```
106
+
107
+ The model correctly distinguishes tigers (zoo animals) from the tabby cat and even the cat's name Mittens (pets) based purely on contextual cues. No retraining required.
108
+
109
+ ### 2. Superior Performance on Standard PII
110
+
111
+ Fine-tuning GLiNER2 Large on 1,210 synthetic PII examples produced a model that outperforms AWS Comprehend on standard entity detection:
112
 
113
  | Model | Micro-Precision | Micro-Recall |
114
  | --- | --- | --- |
 
116
  | GLiNER2 Large (off-the-shelf) | 0.84 | 0.89 |
117
  | **NERPA (this model)** | **0.93** | **0.90** |
118
 
119
+ NERPA achieves **3% higher precision** than AWS Comprehend while maintaining comparable recall. The fine-tuning also enables fine-grained date disambiguation (DATE_OF_BIRTH vs DATE_TIME), which AWS Comprehend cannot do without custom model training.
120
 
121
+ ### The Architecture Advantage
 
 
 
 
 
 
 
122
 
123
+ AWS Comprehend treats entity types as fixed classification targets. Adding a new entity type requires:
124
+ 1. Annotating thousands of examples
125
+ 2. Training a custom model
126
+ 3. Paying for model hosting
127
+ 4. Managing model versioning
128
 
129
+ NERPA's bi-encoder architecture makes entity types a runtime parameter. Adding new entities is a single line of code.
130
+
131
+ ## Pre-Optimised PII Entity Types
132
+
133
+ NERPA is fine-tuned on these entity types (but you can add more at inference time):
134
 
135
  | Entity | Description |
136
  | --- | --- |
 
151
  | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
152
  | `VEHICLE_ID_NUMBERS` | License plates, VINs |
153
 
 
 
154
  ## Quick Start
155
 
156
  ### Install dependencies