--- language: - en license: apache-2.0 tags: - token-classification - ner - biology - entomology - natural-history - deberta base_model: - microsoft/deberta-v3-small - microsoft/deberta-v3-base - microsoft/deberta-v3-large pipeline_tag: token-classification --- # ento-label-deberta DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw label string the model extracts semantic fields as verbatim character spans. Three sizes are included in this repo: `small`, `base`, and `large` (subdirectories of the same name). ONNX exports are in `onnx/small`, `onnx/base`, and `onnx/large`. ## Entity types | Label | Description | |---|---| | `country` | Country name | | `state` | State, province, or region | | `verbatim_locality` | Locality description | | `verbatim_date` | Collection date as written | | `verbatim_elevation` | Elevation as written | | `verbatim_collectors` | Collector name(s) | | `verbatim_habitat` | Habitat description | | `verbatim_method` | Collection method | | `verbatim_latitude` | Latitude as written | | `verbatim_longitude` | Longitude as written | ## Evaluation results (macro F1 per entity) | Entity | small | base | large | |---|---|---|---| | country | 0.9695 | 0.9749 | 0.9751 | | state | 0.9046 | 0.9220 | 0.9212 | | verbatim_locality | 0.8282 | 0.8499 | 0.8573 | | verbatim_date | 0.9673 | 0.9700 | 0.9693 | | verbatim_elevation | 0.9722 | 0.9742 | 0.9739 | | verbatim_collectors | 0.4867 | 0.5393 | 0.5311 | | verbatim_habitat | 0.7485 | 0.7751 | 0.7930 | | verbatim_method | 0.9123 | 0.9205 | 0.9080 | | verbatim_latitude | 0.7154 | 0.7145 | 0.6512 | | verbatim_longitude | 0.8552 | 0.8528 | 0.7969 | | **macro avg** | **0.8360** | **0.8493** | **0.8377** | ## Usage (PyTorch) ```python from transformers import pipeline ner = pipeline( "token-classification", model="SpeciesFileGroup/ento-label-deberta/base", aggregation_strategy="simple", ) results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori") for r in results: print(r["entity_group"], repr(r["word"])) # country 'Sudan' # state 'Blue Nile' # verbatim_locality 'Abu Hashim' # verbatim_date '23-24.XI.1962' # verbatim_collectors 'Linnavuori' ``` ## Usage (ONNX / hugot) ONNX models are compatible with [hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load from `onnx/small`, `onnx/base`, or `onnx/large`. ## Training Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters: | Parameter | small / base | large | |---|---|---| | Learning rate | 5e-6 | 2e-6 | | Batch size | 16 | 16 | | LR scheduler | linear | linear | | Warmup ratio | 0.06 | 0.06 | | Weight decay | 0.01 | 0.01 | | Max seq length | 128 | 128 | Training data: ~22 000 insect collection label strings with character-span annotations for the 10 entity types above.