File size: 2,852 Bytes
70bd5df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language:
- en
license: apache-2.0
tags:
- token-classification
- ner
- biology
- entomology
- natural-history
- deberta
base_model:
- microsoft/deberta-v3-small
- microsoft/deberta-v3-base
- microsoft/deberta-v3-large
pipeline_tag: token-classification
---

# ento-label-deberta

DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw
label string the model extracts semantic fields as verbatim character spans.

Three sizes are included in this repo: `small`, `base`, and `large`
(subdirectories of the same name). ONNX exports are in `onnx/small`,
`onnx/base`, and `onnx/large`.

## Entity types

| Label | Description |
|---|---|
| `country` | Country name |
| `state` | State, province, or region |
| `verbatim_locality` | Locality description |
| `verbatim_date` | Collection date as written |
| `verbatim_elevation` | Elevation as written |
| `verbatim_collectors` | Collector name(s) |
| `verbatim_habitat` | Habitat description |
| `verbatim_method` | Collection method |
| `verbatim_latitude` | Latitude as written |
| `verbatim_longitude` | Longitude as written |

## Evaluation results (macro F1 per entity)

| Entity | small | base | large |
|---|---|---|---|
| country | 0.9695 | 0.9749 | 0.9751 |
| state | 0.9046 | 0.9220 | 0.9212 |
| verbatim_locality | 0.8282 | 0.8499 | 0.8573 |
| verbatim_date | 0.9673 | 0.9700 | 0.9693 |
| verbatim_elevation | 0.9722 | 0.9742 | 0.9739 |
| verbatim_collectors | 0.4867 | 0.5393 | 0.5311 |
| verbatim_habitat | 0.7485 | 0.7751 | 0.7930 |
| verbatim_method | 0.9123 | 0.9205 | 0.9080 |
| verbatim_latitude | 0.7154 | 0.7145 | 0.6512 |
| verbatim_longitude | 0.8552 | 0.8528 | 0.7969 |
| **macro avg** | **0.8360** | **0.8493** | **0.8377** |

## Usage (PyTorch)

```python
from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="SpeciesFileGroup/ento-label-deberta/base",
    aggregation_strategy="simple",
)

results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
for r in results:
    print(r["entity_group"], repr(r["word"]))
# country      'Sudan'
# state        'Blue Nile'
# verbatim_locality  'Abu Hashim'
# verbatim_date      '23-24.XI.1962'
# verbatim_collectors 'Linnavuori'
```

## Usage (ONNX / hugot)

ONNX models are compatible with
[hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load
from `onnx/small`, `onnx/base`, or `onnx/large`.

## Training

Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters:

| Parameter | small / base | large |
|---|---|---|
| Learning rate | 5e-6 | 2e-6 |
| Batch size | 16 | 16 |
| LR scheduler | linear | linear |
| Warmup ratio | 0.06 | 0.06 |
| Weight decay | 0.01 | 0.01 |
| Max seq length | 128 | 128 |

Training data: ~22 000 insect collection label strings with character-span
annotations for the 10 entity types above.