SpeciesFileGroup
/

ento-label-deberta

Token Classification

natural-history

Model card Files Files and versions

ento-label-deberta / README.md

dmozzherin's picture

Add model card

70bd5df verified 17 days ago

|

history blame contribute delete

2.85 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- token-classification
	- ner
	- biology
	- entomology
	- natural-history
	- deberta
	base_model:
	- microsoft/deberta-v3-small
	- microsoft/deberta-v3-base
	- microsoft/deberta-v3-large
	pipeline_tag: token-classification
	---

	# ento-label-deberta

	DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw
	label string the model extracts semantic fields as verbatim character spans.

	Three sizes are included in this repo: `small`, `base`, and `large`
	(subdirectories of the same name). ONNX exports are in `onnx/small`,
	`onnx/base`, and `onnx/large`.

	## Entity types

	\| Label \| Description \|
	\|---\|---\|
	\| `country` \| Country name \|
	\| `state` \| State, province, or region \|
	\| `verbatim_locality` \| Locality description \|
	\| `verbatim_date` \| Collection date as written \|
	\| `verbatim_elevation` \| Elevation as written \|
	\| `verbatim_collectors` \| Collector name(s) \|
	\| `verbatim_habitat` \| Habitat description \|
	\| `verbatim_method` \| Collection method \|
	\| `verbatim_latitude` \| Latitude as written \|
	\| `verbatim_longitude` \| Longitude as written \|

	## Evaluation results (macro F1 per entity)

	\| Entity \| small \| base \| large \|
	\|---\|---\|---\|---\|
	\| country \| 0.9695 \| 0.9749 \| 0.9751 \|
	\| state \| 0.9046 \| 0.9220 \| 0.9212 \|
	\| verbatim_locality \| 0.8282 \| 0.8499 \| 0.8573 \|
	\| verbatim_date \| 0.9673 \| 0.9700 \| 0.9693 \|
	\| verbatim_elevation \| 0.9722 \| 0.9742 \| 0.9739 \|
	\| verbatim_collectors \| 0.4867 \| 0.5393 \| 0.5311 \|
	\| verbatim_habitat \| 0.7485 \| 0.7751 \| 0.7930 \|
	\| verbatim_method \| 0.9123 \| 0.9205 \| 0.9080 \|
	\| verbatim_latitude \| 0.7154 \| 0.7145 \| 0.6512 \|
	\| verbatim_longitude \| 0.8552 \| 0.8528 \| 0.7969 \|
	\| macro avg \| 0.8360 \| 0.8493 \| 0.8377 \|

	## Usage (PyTorch)

	```python
	from transformers import pipeline

	ner = pipeline(
	"token-classification",
	model="SpeciesFileGroup/ento-label-deberta/base",
	aggregation_strategy="simple",
	)

	results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
	for r in results:
	print(r["entity_group"], repr(r["word"]))
	# country 'Sudan'
	# state 'Blue Nile'
	# verbatim_locality 'Abu Hashim'
	# verbatim_date '23-24.XI.1962'
	# verbatim_collectors 'Linnavuori'
	```

	## Usage (ONNX / hugot)

	ONNX models are compatible with
	[hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load
	from `onnx/small`, `onnx/base`, or `onnx/large`.

	## Training

	Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters:

	\| Parameter \| small / base \| large \|
	\|---\|---\|---\|
	\| Learning rate \| 5e-6 \| 2e-6 \|
	\| Batch size \| 16 \| 16 \|
	\| LR scheduler \| linear \| linear \|
	\| Warmup ratio \| 0.06 \| 0.06 \|
	\| Weight decay \| 0.01 \| 0.01 \|
	\| Max seq length \| 128 \| 128 \|

	Training data: ~22 000 insect collection label strings with character-span
	annotations for the 10 entity types above.