NoYo25
/

BiodivBERT

Token Classification

bert-base-cased

sequence-classification

Model card Files Files and versions

BiodivBERT / README.md

NoYo25's picture

Update README.md

e837441 over 2 years ago

|

history blame contribute delete

3.48 kB

	---
	language:
	- en
	thumbnail: >-
	https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680
	tags:
	- bert-base-cased
	- biodiversity
	- token-classification
	- sequence-classification
	license: apache-2.0
	citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain."
	paper: https://ceur-ws.org/Vol-3415/paper-7.pdf
	metrics:
	- f1
	- precision
	- recall
	- accuracy
	evaluation datasets:
	- url: https://doi.org/10.5281/zenodo.6554208
	- named entity recognition:
	- COPIOUS
	- QEMP
	- BiodivNER
	- LINNAEUS
	- Species800
	- relation extraction:
	- GAD
	- EU-ADR
	- BiodivRE
	- BioRelEx
	training_data:
	- crawling-keywords:
	- biodivers
	- genetic diversity
	- omic diversity
	- phylogenetic diversity
	- soil diversity
	- population diversity
	- species diversity
	- ecosystem diversity
	- functional diversity
	- microbial diversity
	- corpora:
	- (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020
	- (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020
	pre-training-hyperparams:
	- MAX_LEN = 512 # Default of BERT Tokenizer
	- MLM_PROP = 0.15 # Data Collator
	- num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here
	- per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run
	- per_device_eval_batch_size = 16 # usually as above
	- gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.
	---

	# BiodivBERT

	## Model description
	* BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature.
	* It uses the tokenizer from BERTT base cased model.
	* BiodivBERT is pre-trained on abstracts and full text from biodiversity literature.
	* BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
	* Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details.

	## How to use
	* You can use BiodivBERT via huggingface library as follows:

	1. Masked Language Model

	````
	>>> from transformers import AutoTokenizer, AutoModelForMaskedLM

	>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

	>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
	````

	2. Token Classification - Named Entity Recognition

	````
	>>> from transformers import AutoTokenizer, AutoModelForTokenClassification

	>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

	>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
	````

	3. Sequence Classification - Relation Extraction

	````
	>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification

	>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")

	>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
	````

	## Training data

	* BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications.
	* We used both Elsevier and Springer APIs to crawl such data.
	* We covered publications over the duration of 1990-2020.

	## Evaluation results
	BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks.