| | --- |
| | language: |
| | - en |
| | thumbnail: >- |
| | https://www.fusion.uni-jena.de/fusionmedia/fusionpictures/fusion-service/fusion-transp.png?height=383&width=680 |
| | tags: |
| | - bert-base-cased |
| | - biodiversity |
| | - token-classification |
| | - sequence-classification |
| | license: apache-2.0 |
| | citation: "Abdelmageed, N., Löffler, F., & König-Ries, B. (2023). BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain." |
| | paper: https://ceur-ws.org/Vol-3415/paper-7.pdf |
| | metrics: |
| | - f1 |
| | - precision |
| | - recall |
| | - accuracy |
| | evaluation datasets: |
| | - url: https://doi.org/10.5281/zenodo.6554208 |
| | - named entity recognition: |
| | - COPIOUS |
| | - QEMP |
| | - BiodivNER |
| | - LINNAEUS |
| | - Species800 |
| | - relation extraction: |
| | - GAD |
| | - EU-ADR |
| | - BiodivRE |
| | - BioRelEx |
| | training_data: |
| | - crawling-keywords: |
| | - biodivers |
| | - genetic diversity |
| | - omic diversity |
| | - phylogenetic diversity |
| | - soil diversity |
| | - population diversity |
| | - species diversity |
| | - ecosystem diversity |
| | - functional diversity |
| | - microbial diversity |
| | - corpora: |
| | - (+Abs) Springer and Elsevier abstracts in the duration of 1990-2020 |
| | - (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990-2020 |
| | pre-training-hyperparams: |
| | - MAX_LEN = 512 |
| | - MLM_PROP = 0.15 |
| | - num_train_epochs = 3 |
| | - per_device_train_batch_size = 16 |
| | - per_device_eval_batch_size = 16 |
| | - gradient_accumulation_steps = 4 |
| | --- |
| | |
| | # BiodivBERT |
| |
|
| | ## Model description |
| | * BiodivBERT is a domain-specific BERT based cased model for the biodiversity literature. |
| | * It uses the tokenizer from BERTT base cased model. |
| | * BiodivBERT is pre-trained on abstracts and full text from biodiversity literature. |
| | * BiodivBERT is fine-tuned on two down stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain. |
| | * Please visit our [GitHub Repo](https://github.com/fusion-jena/BiodivBERT) for more details. |
| |
|
| | ## How to use |
| | * You can use BiodivBERT via huggingface library as follows: |
| |
|
| | 1. Masked Language Model |
| |
|
| | ```` |
| | >>> from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | |
| | >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") |
| | |
| | >>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT") |
| | ```` |
| |
|
| | 2. Token Classification - Named Entity Recognition |
| |
|
| | ```` |
| | >>> from transformers import AutoTokenizer, AutoModelForTokenClassification |
| | |
| | >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") |
| | |
| | >>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT") |
| | ```` |
| |
|
| | 3. Sequence Classification - Relation Extraction |
| |
|
| | ```` |
| | >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | >>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") |
| | |
| | >>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT") |
| | ```` |
| |
|
| | ## Training data |
| |
|
| | * BiodivBERT is pre-trained on abstracts and full text from biodiversity domain-related publications. |
| | * We used both Elsevier and Springer APIs to crawl such data. |
| | * We covered publications over the duration of 1990-2020. |
| |
|
| | ## Evaluation results |
| | BiodivBERT overperformed both ``BERT_base_cased``, ``biobert_v1.1``, and ``BiLSTM`` as a baseline approach on the down stream tasks. |