akhatre commited on
Commit
ac320eb
Β·
1 Parent(s): f4fe2f6

update readme

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -33,7 +33,7 @@ pipeline_tag: token-classification
33
 
34
  # NERPA β€” Fine-Tuned GLiNER2 for PII Anonymisation
35
 
36
- A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindai.com).
37
 
38
  ## Why NERPA?
39
 
@@ -164,7 +164,7 @@ entities = detect_entities(model, text, entities={
164
 
165
  The inference pipeline in `anonymise.py`:
166
 
167
- 1. **Chunking** β€” Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window.
168
  2. **Batch prediction** β€” Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
169
  3. **Date disambiguation** β€” Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
170
  4. **De-duplication** β€” Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
@@ -172,7 +172,7 @@ The inference pipeline in `anonymise.py`:
172
 
173
  ## Notes
174
 
175
- - **Confidence threshold:** Default is `0.25`. The model tends to be conservative, so a lower threshold works well for high recall.
176
  - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
177
  - **Device:** Automatically uses CUDA > MPS > CPU.
178
 
@@ -203,6 +203,6 @@ If you use NERPA, please cite both this model and the original GLiNER2 paper:
203
  }
204
  ```
205
 
206
- Built by [Akhat Rakishev](https://github.com/workhat) at [Overmind](https://overmindai.com).
207
 
208
- Overmind is infrastructure to make agents more reliable. Learn more at [overmindai.com](https://overmindai.com).
 
33
 
34
  # NERPA β€” Fine-Tuned GLiNER2 for PII Anonymisation
35
 
36
+ A fine-tuned [GLiNER2 Large](https://huggingface.co/fastino/gliner2-large-v1) (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at [Overmind](https://overmindlab.ai).
37
 
38
  ## Why NERPA?
39
 
 
164
 
165
  The inference pipeline in `anonymise.py`:
166
 
167
+ 1. **Chunking** β€” Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v2 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones.
168
  2. **Batch prediction** β€” Chunks are fed through `GLiNER2.batch_extract_entities()` with `include_spans=True` to get character-level offsets.
169
  3. **Date disambiguation** β€” Both `DATE_TIME` and `DATE_OF_BIRTH` are always detected together so the model can choose the best label per span.
170
  4. **De-duplication** β€” Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
 
172
 
173
  ## Notes
174
 
175
+ - **Confidence threshold:** Default is `0.25`. The model sometimes tends to be conservative, so a lower threshold works well for high recall.
176
  - **GLiNER2 version:** Requires `gliner2>=1.2.4`. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
177
  - **Device:** Automatically uses CUDA > MPS > CPU.
178
 
 
203
  }
204
  ```
205
 
206
+ Built by [Akhat Rakishev](https://github.com/akhatre) at [Overmind](https://overmindlab.ai).
207
 
208
+ Overmind is infrastructure to make agents more reliable. Learn more at [overmindlab.ai](https://overmindlab.ai).