EMBO
/

bio-lm

Model card Files Files and versions

Thomas Lemberger commited on Mar 10, 2021

Commit

d66184b

·

1 Parent(s): 7af11f5

initial card

Files changed (1) hide show

README.md +95 -0

README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+---
+language:
+-
+-
+thumbnail:
+tags:
+-
+-
+-
+license:
+datasets:
+-
+-
+metrics:
+-
+-
+---
+# MyModelName
+## Model description
+This model is a [RoBERTa base model](https://huggingface.co/roberta-base) pre-trained model further trained with masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang).
+## Intended uses & limitations
+#### How to use
+The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular.
+To have a quick check of the model as-is in a fill-mask task:
+```python
+from transformers import pipeline, RobertaTokenizerFast
+tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
+text = "Let us try this model to see if it <mask>."
+fill_mask = pipeline(
+    "fill-mask",
+    model='EMBO/bio-lm',
+    tokenizer=tokenizer
+)
+fill_mask(text)
+```
+#### Limitations and bias
+This model should be fine-tuned on a specifi task like token classification.
+The model must be used with the `roberta-base` tokenizer.
+## Training data
+The model was trained with a masked language modeling taskon the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences.
+## Training procedure
+The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
+Training code is available at https://github.com/source-data/soda-roberta
+- Command: `python -m lm.train /data/json/oapmc_abstracts_figs/ MLM`
+- Tokenizer vocab size: 50265
+- Training data: bio_lang/MLM
+- Training with: 12005390 examples.
+- Evaluating on: 36713 examples.
+- Epochs :3.0
+- per_device_train_batch_size: 16,
+- per_device_eval_batch_size; 16,
+- learning_rate: 5e-05,
+- weight_decay: 0.0,
+- adam_beta1: 0.9,
+- adam_beta2: 0.999,
+- adam_epsilon: 1e-08,
+- max_grad_norm: 1.0,
+- tensorboard run: lm-MLM-2021-01-27T15-17-43.113766
+End of training eval on validation set:
+```
+{'loss': 0.8653350830078125, 'learning_rate': 6.708070119323685e-08, 'epoch': 2.995975157928406}
+{'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597, 'epoch': 2.995975157928406}
+```
+## Eval results
+Eval on test set:
+{'test_loss': 0.8240728974342346, 'test_recall': 0.814471959728645}
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{...,
+  year={2020}
+}
+```