updated readme
Browse files
README.md
CHANGED
|
@@ -1,133 +1,9 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
# medAlpaca: Finetuned Large Language Models for Medical Question Answering
|
| 4 |
-
|
| 5 |
-
## Project Overview
|
| 6 |
-
MedAlpaca expands upon both [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and
|
| 7 |
-
[AlpacaLoRA](https://github.com/tloen/alpaca-lora) to offer an advanced suite of large language
|
| 8 |
-
models specifically fine-tuned for medical question-answering and dialogue applications.
|
| 9 |
-
Our primary objective is to deliver an array of open-source language models, paving the way for
|
| 10 |
-
seamless development of medical chatbot solutions.
|
| 11 |
-
|
| 12 |
-
These models have been trained using a variety of medical texts, encompassing resources such as
|
| 13 |
-
medical flashcards, wikis, and dialogue datasets. For more details on the data utilized, please consult the data section.
|
| 14 |
-
|
| 15 |
-
## Getting Started
|
| 16 |
-
Create a new virtual environment, e.g. with conda
|
| 17 |
-
|
| 18 |
-
```bash
|
| 19 |
-
conda create -n medalpaca python>=3.9
|
| 20 |
-
```
|
| 21 |
-
|
| 22 |
-
Install the required packages:
|
| 23 |
-
```bash
|
| 24 |
-
pip install -r requirements.txt
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
## Training of medAlpaca
|
| 28 |
-
<img width="256" alt="training your alpaca" src="https://user-images.githubusercontent.com/37253540/229250535-98f28e1c-0a8e-46e7-9e61-aeb98ef115cc.png">
|
| 29 |
-
|
| 30 |
-
### Memory Requirements
|
| 31 |
-
We have benchmarked the needed GPU memory as well as the approximate duration per epoch
|
| 32 |
-
for finetuning LLaMA 7b on the Medical Meadow small dataset (~6000 Q/A pairs) on a single GPU:
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
| Model | 8bit trainig | LoRA | fp16 | bf16 | VRAM Used | Gradient cktp | Duration/epoch |
|
| 36 |
-
|----------|--------------|-------|-------|-------|-----------|---------------|----------------|
|
| 37 |
-
| LLaMA 7b | True | True | True | False | 8.9 GB | False | 77:30 |
|
| 38 |
-
| LLaMA 7b | False | True | True | False | 18.8 GB | False | 14:30 |
|
| 39 |
-
| LLaMA 7b | False | False | True | False | OOM | False | - |
|
| 40 |
-
| LLaMA 7b | False | False | False | True | 79.5 GB | True | 35:30 |
|
| 41 |
-
| LLaMA 7b | False | False | False | False | OOM | True | - |
|
| 42 |
-
|
| 43 |
-
### Train medAlpaca based on LLaMA
|
| 44 |
-
If you have access to the [LLaMA](https://arxiv.org/abs/2302.13971) or [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html)
|
| 45 |
-
weights you can finetune the model with the following command.
|
| 46 |
-
Just replace `<PATH_TO_LLAMA_WEIGHTS>` with the folder containing you LLaMA or Alpaca weights.
|
| 47 |
-
|
| 48 |
-
```bash
|
| 49 |
-
python medalpaca/train.py \
|
| 50 |
-
--model PATH_TO_LLAMA_WEIGHTS \
|
| 51 |
-
--data_path medical_meadow_small.json \
|
| 52 |
-
--output_dir 'output' \
|
| 53 |
-
--train_in_8bit True \
|
| 54 |
-
--bf16 True \
|
| 55 |
-
--tf32 False \
|
| 56 |
-
--fp16 False \
|
| 57 |
-
--global_batch_size 128 \
|
| 58 |
-
--per_device_batch_size 8 \
|
| 59 |
-
```
|
| 60 |
-
Per default the script performs mixed precision training.
|
| 61 |
-
You can toggle 8bit training with the `train_in_8bit` flag.
|
| 62 |
-
While 8 bit training currently only works with `use_lora True`, however you can use
|
| 63 |
-
LoRA without 8 bit training.
|
| 64 |
-
It is also able to train other models such as `facebook/opt-6.7` with the above script.
|
| 65 |
-
|
| 66 |
-
## Data
|
| 67 |
-
<img width="256" alt="Screenshot 2023-03-31 at 09 37 41" src="https://user-images.githubusercontent.com/37253540/229244284-72b00e82-0da1-4218-b08e-63864306631e.png">
|
| 68 |
-
|
| 69 |
-
To ensure your cherished llamas and alpacas are well-fed and thriving,
|
| 70 |
-
we have diligently gathered high-quality biomedical open-source datasets
|
| 71 |
-
and transformed them into instruction tuning formats.
|
| 72 |
-
We have dubbed this endeavor **Medical Meadow**.
|
| 73 |
-
Medical Meadow currently encompasses roughly 1.5 million data points across a diverse range of tasks,
|
| 74 |
-
including openly curated medical data transformed into Q/A pairs with OpenAI's `gpt-3.5-turbo`
|
| 75 |
-
and a collection of established NLP tasks in the medical domain.
|
| 76 |
-
Please note, that not all data is of the same quantitiy and quality and you may need tp subsample
|
| 77 |
-
the data for training your own model.
|
| 78 |
-
We will persistently update and refine the dataset, and we welcome everyone to contribute more 'grass' to Medical Meadow!
|
| 79 |
-
|
| 80 |
-
### Data Overview
|
| 81 |
-
|
| 82 |
-
| Name | Source | n | n included in training |
|
| 83 |
-
|----------------------|-------------------------------------------------------------------------|----------|-------------------------|
|
| 84 |
-
| Medical Flashcards | [medalpaca/medical_meadow_medical_flashcards](https://huggingface.co/datasets/medalpaca/medical_meadow_medical_flashcards) | 33955 | 33955 |
|
| 85 |
-
| Wikidoc | [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc) | 67704 | 10000 |
|
| 86 |
-
| Wikidoc Patient Information | [medalpaca/medical_meadow_wikidoc_patient_information](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc_patient_information) | 5942 | 5942 |
|
| 87 |
-
| Stackexchange academia | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 40865 | 40865 |
|
| 88 |
-
| Stackexchange biology | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 27887 | 27887 |
|
| 89 |
-
| Stackexchange fitness | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 9833 | 9833 |
|
| 90 |
-
| Stackexchange health | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 7721 | 7721 |
|
| 91 |
-
| Stackexchange bioinformatics | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/datasets/medalpaca/medical_meadow_stackexchange) | 5407 | 5407 |
|
| 92 |
-
| USMLE Self Assessment Step 1 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 119 | 92 (test only) |
|
| 93 |
-
| USMLE Self Assessment Step 2 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 120 | 110 (test only) |
|
| 94 |
-
| USMLE Self Assessment Step 3 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 135 | 122 (test only) |
|
| 95 |
-
| MEDIQA | [original](https://osf.io/fyg46/?view_only=), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_mediqa) | 2208 | 2208 |
|
| 96 |
-
| CORD-19 | [original](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge ), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_cord19) | 1056660 | 50000 |
|
| 97 |
-
| MMMLU | [original](https://github.com/hendrycks/test), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_mmmlu) | 3787 | 3787 |
|
| 98 |
-
| Pubmed Health Advice | [original](https://aclanthology.org/D19-1473/), [preprocessed](vhuggingface.co/datasets/medalpaca/health_advice) | 10178 | 10178 |
|
| 99 |
-
| Pubmed Causal | [original](https://aclanthology.org/2020.coling-main.427/ ), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_pubmed_causal) | 2446 | 2446 |
|
| 100 |
-
| ChatDoctor | [original](https://github.com/Kent0n-Li/ChatDoctor ) | 215000 | 10000 |
|
| 101 |
-
| OpenAssistant | [original](https://huggingface.co/OpenAssistant) | 9209 | 9209 |
|
| 102 |
|
| 103 |
|
| 104 |
### Data description
|
| 105 |
please refer to [DATA_DESCRIPTION.md](DATA_DESCRIPTION.md)
|
| 106 |
|
| 107 |
-
|
| 108 |
-
## Benchmarks
|
| 109 |
-
<img width="256" alt="benchmarks" src="https://user-images.githubusercontent.com/37253540/229249302-20ff8a88-95b4-42a3-bdd8-96a9dce9a92b.png">
|
| 110 |
-
|
| 111 |
-
We are benchmarking all models on the USMLE self assessment, which is available at this [link](https://www.usmle.org/prepare-your-exam).
|
| 112 |
-
Note, that we removed all questions with images, as our models are not multimodal.
|
| 113 |
-
|
| 114 |
-
| **Model** | **Step1** | **Step2** | **Step3** |
|
| 115 |
-
|--------------------------------------------------------------------------------------------|-------------------|------------------|------------------|
|
| 116 |
-
| [LLaMA 7b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) | 0.198 | 0.202 | 0.203 |
|
| 117 |
-
| [Alpaca 7b naive](https://github.com/tatsu-lab/stanford_alpaca) ([weights](https://huggingface.co/chavinlo/alpaca-native)) | 0.275 | 0.266 | 0.293 |
|
| 118 |
-
| [Alpaca 7b LoRA](https://github.com/tloen/alpaca-lora) | 0.220 | 0.138 | 0.252 |
|
| 119 |
-
| [MedAlpaca 7b](https://huggingface.co/medalpaca/medalpaca-7b) | 0.297 | 0.312 | 0.398 |
|
| 120 |
-
| [MedAlpaca 7b LoRA](https://huggingface.co/medalpaca/medalpaca/medalpaca-lora-7b-16bit) | 0.231 | 0.202 | 0.179 |
|
| 121 |
-
| [MedAlpaca 7b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-7b-8bit) | 0.231 | 0.241 | 0.211 |
|
| 122 |
-
| [ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor) (7b) | 0.187 | 0.185 | 0.148 |
|
| 123 |
-
| [LLaMA 13b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) | 0.222 | 0.248 | 0.276 |
|
| 124 |
-
| [Alpaca 13b naive](https://huggingface.co/chavinlo/alpaca-13b) | 0.319 | 0.312 | 0.301 |
|
| 125 |
-
| [MedAlpaca 13b](https://huggingface.co/medalpaca/medalpaca-13b) | ***0.473*** | ***0.477*** | ***0.602*** |
|
| 126 |
-
| [MedAlpaca 13b LoRA](https://huggingface.co/medalpaca/medalpaca/medalpaca-lora-13b-16bit) | 0.250 | 0.255 | 0.255 |
|
| 127 |
-
| [MedAlpaca 13b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-13b-8bit) | 0.189 | 0.303 | 0.289 |
|
| 128 |
-
| [MedAlpaca 30b](https://huggingface.co/medalpaca/medalpaca-30b) (still training) | TBA | TBA | TBA |
|
| 129 |
-
| [MedAlpaca 30b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-30b-8bit) | 0.315 | 0.327 | 0.361 |+
|
| 130 |
-
|
| 131 |
We are continuously working on improving the training as well as our evaluation prompts.
|
| 132 |
Expect this table to change quite a bit.
|
| 133 |
|
|
@@ -142,15 +18,3 @@ extensive testing or validation, and their reliability cannot be guaranteed.
|
|
| 142 |
We kindly ask you to exercise caution when using these models,
|
| 143 |
and we appreciate your understanding as we continue to explore and develop this innovative technology.
|
| 144 |
|
| 145 |
-
|
| 146 |
-
## Paper
|
| 147 |
-
<img width="256" alt="chat-lama" src="https://user-images.githubusercontent.com/37253540/229261366-5cce9a60-176a-471b-80fd-ba390539da72.png">
|
| 148 |
-
|
| 149 |
-
```
|
| 150 |
-
@article{han2023medalpaca,
|
| 151 |
-
title={MedAlpaca--An Open-Source Collection of Medical Conversational AI Models and Training Data},
|
| 152 |
-
author={Han, Tianyu and Adams, Lisa C and Papaioannou, Jens-Michalis and Grundmann, Paul and Oberhauser, Tom and L{\"o}ser, Alexander and Truhn, Daniel and Bressem, Keno K},
|
| 153 |
-
journal={arXiv preprint arXiv:2304.08247},
|
| 154 |
-
year={2023}
|
| 155 |
-
}
|
| 156 |
-
```
|
|
|
|
| 1 |
+
# Amigo: Finetuned Large Language Models for Medical Question Answering
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
|
| 4 |
### Data description
|
| 5 |
please refer to [DATA_DESCRIPTION.md](DATA_DESCRIPTION.md)
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
We are continuously working on improving the training as well as our evaluation prompts.
|
| 8 |
Expect this table to change quite a bit.
|
| 9 |
|
|
|
|
| 18 |
We kindly ask you to exercise caution when using these models,
|
| 19 |
and we appreciate your understanding as we continue to explore and develop this innovative technology.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|