| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - token-classification |
| - ner |
| - biology |
| - entomology |
| - natural-history |
| - deberta |
| base_model: |
| - microsoft/deberta-v3-small |
| - microsoft/deberta-v3-base |
| - microsoft/deberta-v3-large |
| pipeline_tag: token-classification |
| --- |
| |
| # ento-label-deberta |
|
|
| DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw |
| label string the model extracts semantic fields as verbatim character spans. |
|
|
| Three sizes are included in this repo: `small`, `base`, and `large` |
| (subdirectories of the same name). ONNX exports are in `onnx/small`, |
| `onnx/base`, and `onnx/large`. |
|
|
| ## Entity types |
|
|
| | Label | Description | |
| |---|---| |
| | `country` | Country name | |
| | `state` | State, province, or region | |
| | `verbatim_locality` | Locality description | |
| | `verbatim_date` | Collection date as written | |
| | `verbatim_elevation` | Elevation as written | |
| | `verbatim_collectors` | Collector name(s) | |
| | `verbatim_habitat` | Habitat description | |
| | `verbatim_method` | Collection method | |
| | `verbatim_latitude` | Latitude as written | |
| | `verbatim_longitude` | Longitude as written | |
|
|
| ## Evaluation results (macro F1 per entity) |
|
|
| | Entity | small | base | large | |
| |---|---|---|---| |
| | country | 0.9695 | 0.9749 | 0.9751 | |
| | state | 0.9046 | 0.9220 | 0.9212 | |
| | verbatim_locality | 0.8282 | 0.8499 | 0.8573 | |
| | verbatim_date | 0.9673 | 0.9700 | 0.9693 | |
| | verbatim_elevation | 0.9722 | 0.9742 | 0.9739 | |
| | verbatim_collectors | 0.4867 | 0.5393 | 0.5311 | |
| | verbatim_habitat | 0.7485 | 0.7751 | 0.7930 | |
| | verbatim_method | 0.9123 | 0.9205 | 0.9080 | |
| | verbatim_latitude | 0.7154 | 0.7145 | 0.6512 | |
| | verbatim_longitude | 0.8552 | 0.8528 | 0.7969 | |
| | **macro avg** | **0.8360** | **0.8493** | **0.8377** | |
|
|
| ## Usage (PyTorch) |
|
|
| ```python |
| from transformers import pipeline |
| |
| ner = pipeline( |
| "token-classification", |
| model="SpeciesFileGroup/ento-label-deberta/base", |
| aggregation_strategy="simple", |
| ) |
| |
| results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori") |
| for r in results: |
| print(r["entity_group"], repr(r["word"])) |
| # country 'Sudan' |
| # state 'Blue Nile' |
| # verbatim_locality 'Abu Hashim' |
| # verbatim_date '23-24.XI.1962' |
| # verbatim_collectors 'Linnavuori' |
| ``` |
|
|
| ## Usage (ONNX / hugot) |
|
|
| ONNX models are compatible with |
| [hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load |
| from `onnx/small`, `onnx/base`, or `onnx/large`. |
|
|
| ## Training |
|
|
| Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters: |
|
|
| | Parameter | small / base | large | |
| |---|---|---| |
| | Learning rate | 5e-6 | 2e-6 | |
| | Batch size | 16 | 16 | |
| | LR scheduler | linear | linear | |
| | Warmup ratio | 0.06 | 0.06 | |
| | Weight decay | 0.01 | 0.01 | |
| | Max seq length | 128 | 128 | |
|
|
| Training data: ~22 000 insect collection label strings with character-span |
| annotations for the 10 entity types above. |
|
|