akhatre
commited on
Commit
Β·
ac320eb
1
Parent(s):
f4fe2f6
update readme
Browse files
README.md
CHANGED
|
@@ -33,7 +33,7 @@ pipeline_tag: token-classification
|
|
| 33 |
|
| 34 |
# NERPA β Fine-Tuned GLiNER2 for PII Anonymisation
|
| 35 |
|
| 36 |
-
A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://
|
| 37 |
|
| 38 |
## Why NERPA?
|
| 39 |
|
|
@@ -164,7 +164,7 @@ entities = detect_entities(model, text, entities={
|
|
| 164 |
|
| 165 |
The inference pipeline in `anonymise.py`:
|
| 166 |
|
| 167 |
-
1. **Chunking** β Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
|
| 168 |
2. **Batch prediction** β Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
|
| 169 |
3. **Date disambiguation** β Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
|
| 170 |
4. **De-duplication** β Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
|
|
@@ -172,7 +172,7 @@ The inference pipeline in `anonymise.py`:
|
|
| 172 |
|
| 173 |
## Notes
|
| 174 |
|
| 175 |
-
- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
|
| 176 |
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
|
| 177 |
- **Device:** Automatically uses CUDA > MPS > CPU.
|
| 178 |
|
|
@@ -203,6 +203,6 @@ If you use NERPA, please cite both this model and the original GLiNER2 paper:
|
|
| 203 |
}
|
| 204 |
```
|
| 205 |
|
| 206 |
-
Built by [Akhat Rakishev](https://github.com/
|
| 207 |
|
| 208 |
-
Overmind is infrastructure to make agents more reliable. Learn more at [
|
|
|
|
| 33 |
|
| 34 |
# NERPA β Fine-Tuned GLiNER2 for PII Anonymisation
|
| 35 |
|
| 36 |
+
A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
|
| 37 |
|
| 38 |
## Why NERPA?
|
| 39 |
|
|
|
|
| 164 |
|
| 165 |
The inference pipeline in `anonymise.py`:
|
| 166 |
|
| 167 |
+
1. **Chunking** β Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v2 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones.
|
| 168 |
2. **Batch prediction** β Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
|
| 169 |
3. **Date disambiguation** β Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
|
| 170 |
4. **De-duplication** β Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
|
|
|
|
| 172 |
|
| 173 |
## Notes
|
| 174 |
|
| 175 |
+
- **Confidence threshold:** Default is `0.25`. The model sometimes tends to be conservative, so a lower threshold works well for high recall.
|
| 176 |
- **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
|
| 177 |
- **Device:** Automatically uses CUDA > MPS > CPU.
|
| 178 |
|
|
|
|
| 203 |
}
|
| 204 |
```
|
| 205 |
|
| 206 |
+
Built by [Akhat Rakishev](https://github.com/akhatre) at [Overmind](https://overmindlab.ai).
|
| 207 |
|
| 208 |
+
Overmind is infrastructure to make agents more reliable. Learn more at [overmindlab.ai](https://overmindlab.ai).
|