--- task: token-classification tags: - biomedical - bionlp license: mit base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext --- # bioner_tmvar3 This is a named entity recognition model fine-tuned from the [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) model. It predicts spans with 10 possible labels. The labels are **AcidChange, CellLine, DNAAllele, DNAMutation, Gene, OtherMutation, ProteinAllele, ProteinMutation, SNP and Species**. The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below. ## Example Usage The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed. ```python from transformers import pipeline # Load the model as part of an NER pipeline ner_pipeline = pipeline("token-classification", model="Glasgow-AI4BioMed/bioner_tmvar3", aggregation_strategy="max") # Apply it to some text ner_pipeline("The G1706A mutation is one of the most common variant in the BRCA1 gene.") # Output: # [ {"entity_group": "DNAMutation", "score": 0.99731, "word": "g1706a", "start": 4, "end": 10}, # {"entity_group": "Gene", "score": 0.99795, "word": "brca1", "start": 61, "end": 66} ] ``` ## Dataset Info **Source:** The tmVar3 dataset was downloaded from: https://ftp.ncbi.nlm.nih.gov/pub/lu/tmVar3/ The dataset should be cited with: Wei, Chih-Hsuan, et al. "tmVar 3.0: an improved variant concept recognition and normalization tool." Bioinformatics 38.18 (2022): 4449-4451. DOI: [10.1093/bioinformatics/btac537](https://doi.org/10.1093/bioinformatics/btac537) **Preprocessing:** The training/validation/test split was maintained from the original dataset. No changes were made to the annotations. The preprocessing script for this dataset is [prepare_tmvar.py](https://github.com/Glasgow-AI4BioMed/bioner/blob/main/prepare_tmvar.py). ## Performance The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling). | Label | Precision | Recall | F1-score | Support | | --- | --- | --- | --- | --- | | AcidChange | 0.562 | 0.818 | 0.667 | 11 | | CellLine | 0.550 | 0.688 | 0.611 | 16 | | DNAAllele | 0.786 | 0.733 | 0.759 | 15 | | DNAMutation | 0.824 | 0.864 | 0.844 | 125 | | Gene | 0.886 | 0.907 | 0.897 | 766 | | OtherMutation | 0.410 | 0.561 | 0.474 | 57 | | ProteinAllele | 0.857 | 0.706 | 0.774 | 17 | | ProteinMutation | 0.904 | 0.851 | 0.877 | 121 | | SNP | 0.815 | 1.000 | 0.898 | 22 | | Species | 0.955 | 0.949 | 0.952 | 333 | | macro avg | 0.755 | 0.808 | 0.775 | 1483 | | weighted avg | 0.871 | 0.889 | 0.879 | 1483 | ## Hyperparameters Hyperparameter tuning was done with [optuna](https://optuna.org/) and the [hyperparameter_search](https://huggingface.co/docs/transformers/en/hpo_train) functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below. | Hyperparameter | Value | |----------------|-------| | epochs | 17.0 | | learning_rate | 9.821730986128684e-05 | | per_device_train_batch_size | 16 | | weight_decay | 0.2819767106649835 | | warmup_ratio | 0.10342300416492177 |