NeuML
/

biomedbert-small

Model card Files Files and versions

biomedbert-small / README.md

davidmezzetti's picture

Add model

03666b4 about 1 month ago

|

history blame contribute delete

2.99 kB

	---
	language: en
	license: apache-2.0
	---

	# BiomedBERT Small

	This is a `22.7M` parameter [BERT](https://arxiv.org/abs/1810.04805) encoder-only model trained on data from [PubMed](https://pubmed.ncbi.nlm.nih.gov/). The raw data was transformed using [PaperETL](https://github.com/neuml/paperetl) with the results stored as a local dataset via the [Hugging Face Datasets library](https://huggingface.co/docs/datasets/en/index).

	This model is designed to be a solid-performing small model fitting in between the [110M parameter BiomedBERT Base model](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) and the [tiny BiomedBERT Hash series of models](https://huggingface.co/blog/NeuML/biomedbert-hash-nano).

	## Usage

	`biomedbert-small` can be loaded using [Hugging Face Transformers](https://huggingface.co/docs/transformers/en/index) as follows.

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained("neuml/biomedbert-small")
	```

	The model is intended to be further fine-tuned for a specific task such as Text Classification, Entity Extraction, Sentence Embeddings and so on.

	## Evaluation Results

	This [Medical Abstracts Text Classification Dataset](https://huggingface.co/datasets/TimSchopf/medical_abstracts) was used to evaluate the model's performance. A handful of biomedical models and general models were selected for comparison.

	Metrics were generated using Hugging Face's standard [run_glue script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) as shown below.

	```bash
	python run_glue.py --model_name_or_path neuml/biomedbert-small --dataset-name medclassify --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 1e-4 --num_train_epochs 4 --output_dir outputs --trust-remote-code True
	```

	_Note: The original dataset was saved locally as `medclassify` the the `condition_label` column renamed to `label` to work more easily with the glue script_

	\| Model \| Parameters \| Accuracy \| Loss \|
	\| ----- \| ---------- \| --------------- \| ---------------- \|
	\| [biomedbert-hash-nano](https://hf.co/neuml/biomedbert-hash-nano) \| 0.969M \| 0.6195 \| 0.9464 \|
	\| [biomedbert-small](https://hf.co/neuml/biomedbert-small) \| 22.7M \| 0.6274 \| 0.8647 \|
	\| [bert-base-uncased](https://hf.co/google-bert/bert-base-uncased) \| 110M \| 0.6118 \| 0.9712 \|
	\| [biomedbert-base](https://hf.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) \| 110M \| 0.6195 \| 0.9037 \|
	\| [ModernBERT-base](https://hf.co/answerdotai/ModernBERT-base) \| 149M \| 0.5672 \| 1.1079 \|
	\| [BioClinical-ModernBERT-base](https://hf.co/thomas-sounack/BioClinical-ModernBERT-base) \| 149M \| 0.5679 \| 1.0915 \|

	As we can see, this model performs very well against models much larger in size. This dataset is a challenging one!

	## More Information

	Read more about the model in [this article](https://hf.co/blog/NeuML/biomedbert-small).