OvermindLab
/

nerpa

@@ -33,7 +33,7 @@ pipeline_tag: token-classification
 # NERPA — Fine-Tuned GLiNER2 for PII Anonymisation
-A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).
 ## Why NERPA?
@@ -164,7 +164,7 @@ entities = detect_entities(model, text, entities={
 The inference pipeline in `anonymise.py`:
-1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
 2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
 3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
 4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
@@ -172,7 +172,7 @@ The inference pipeline in `anonymise.py`:
 ## Notes
-- **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
 - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
 - **Device:** Automatically uses CUDA > MPS > CPU.
@@ -203,6 +203,6 @@ If you use NERPA, please cite both this model and the original GLiNER2 paper:
 }
 ```
-Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).
-Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com).

 # NERPA — Fine-Tuned GLiNER2 for PII Anonymisation
+A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
 ## Why NERPA?
 The inference pipeline in `anonymise.py`:
+1. **Chunking** — Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v2 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones.
 2. **Batch prediction** — Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
 3. **Date disambiguation** — Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
 4. **De-duplication** — Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
 ## Notes
+- **Confidence threshold:** Default is `0.25`. The model sometimes tends to be conservative, so a lower threshold works well for high recall.
 - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
 - **Device:** Automatically uses CUDA > MPS > CPU.
 }
 ```
+Built by [Akhat Rakishev](https://github.com/akhatre) at [Overmind](https://overmindlab.ai).
+Overmind is infrastructure to make agents more reliable. Learn more at [overmindlab.ai](https://overmindlab.ai).