25.5 kB

Title: An Open-Source Collection of Medical Conversational AI Models and Training Data

URL Source: https://arxiv.org/html/2304.08247

Markdown Content: Tianyu Han Department of Radiology, University Hospital Aachen, Aachen, Germany

Email: {tianyu.han, dtruhn}@ukaachen.de Contributed equally Lisa C. Adams Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany

Email: lisa.adams@tum.de Contributed equally Jens-Michalis Papaioannou Berliner Hochschule für Technik (BHT), Berlin, Germany

Email: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de Paul Grundmann Berliner Hochschule für Technik (BHT), Berlin, Germany

Email: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de Tom Oberhauser Berliner Hochschule für Technik (BHT), Berlin, Germany

Email: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de Alexei Figueroa Berliner Hochschule für Technik (BHT), Berlin, Germany

Email: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de Alexander Löser Berliner Hochschule für Technik (BHT), Berlin, Germany

Email: {michalis.papaioannou, pgrundmann, tom.oberhauser, alexei.figueroa, aloeser}@bht-berlin.de Daniel Truhn Department of Radiology, University Hospital Aachen, Aachen, Germany

Email: {tianyu.han, dtruhn}@ukaachen.de Contributed equally Keno K. Bressem Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, Munich, Germany

Email: lisa.adams@tum.de Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, Munich, Germany

Email: keno.bressem@tum.de Contributed equally

Abstract

As large language models (LLMs) like OpenAI’s GPT series continue to make strides, we witness the emergence of artificial intelligence applications in an ever-expanding range of fields. In medicine, these LLMs hold considerable promise for improving medical workflows, diagnostics, patient care, and education. Yet, there is an urgent need for open-source models that can be deployed on-premises to safeguard patient privacy. In our work, we present an innovative dataset consisting of over 160,000 entries, specifically crafted to fine-tune LLMs for effective medical applications. We investigate the impact of fine-tuning these datasets on publicly accessible pre-trained LLMs, and subsequently, we juxtapose the performance of pre-trained-only models against the fine-tuned models concerning the examinations that future medical doctors must pass to achieve certification.

Keywords Natural Language Processing ⋅⋅\cdot⋅ Artificial Intelligence ⋅⋅\cdot⋅ Medicine

1 Introduction

The advent of large language models (LLMs), trained using reinforcement learning through human feedback (RLHF) and exemplified by OpenAI’s GPT series, has profoundly influenced the fields of natural language processing (NLP) and artificial intelligence (AI) research [1]. Their remarkable capacity to produce coherent, contextually apt, and intricate responses has increased their value across diverse domains. Notably, the medical field is poised to reap substantial benefits from the implementation of these models.

A salient benefit of these LLMs lies in their ability to perform tasks following instructions in natural language, thereby eliminating the necessity for users to have programming proficiency. This feature empowers medical professionals to seamlessly engage with and steer the models through diverse medical workflows.

Potential applications include aiding medical professionals in note-taking, composing discharge letters, retrieving information from extensive documents, summarizing content, and converting free-form texts into structured formats [2, 3]. Provided the model has been trained on a sufficient number of medical documents, it may possess the medical knowledge necessary to assist in consultations by supplying accurate information derived from its base texts [4]. Furthermore, the training of medical students can also benefit from these models, wherein they assume the role of a study partner, capable of quizzing students or elucidating complex subjects, provided the model demonstrates sufficient coherence and accuracy. However, the most adept LLM models are currently not openly accessible, being available exclusively through APIs that necessitate data transmission to the parent company for processing.

Considering the sensitive nature of medical data and the imperative for robust privacy safeguards, non-transparent models with unclear data management practices are ill-suited for medical applications. To tackle this challenge and avert unauthorized data transfers, it is essential to employ open-source models that enable on-site implementation, thus mitigating privacy concerns.

Addressing this demand, we present a compilation of language models specifically fine-tuned for biomedical tasks. Utilizing a blend of new and established open-source biomedical datasets, we adapt them into an instruction-following format. This structure facilitates supervised fine-tuning as the initial phase, as detailed in [1].

To assess the effectiveness of these models, we evaluate their performance on the United States Medical Licensing Examination (USMLE), a standardized assessment undertaken by medical students in the United States as part of their qualification process to become physicians. This evaluation offers valuable insights into the models’ competencies and prospective applications within the medical domain.

We make all models and datasets publicly available, anticipating that they will confer significant advantages to both medical and AI researchers as well as practitioners in their respective fields.

2 Materials and Methods

2.1 Datasets

In this section, we present Medical Meadow a collection of medical tasks that we have compiled for fine-tuning and evaluating the performance of large language models in the context of medicine. Medical Meadow consists of two main categories, a collection of established medical NLP tasks reformatted in instruction tuning formats as well as a crawl of various internet resources. Each dataset focuses on different aspects of medical knowledge and practice, providing a comprehensive training and evaluation framework. See Table 1 for a detailed overview of the datasets.

Table 1: Summary of medical datasets created for this work. For information regarding other, already published data, please refer to the respective original publication.

Dataset Source Description n Finetuning Medical Flash CardsAnki Flashcards Rephrased Q&A pairs derived from the front and back sides of medical flashcards 33,955 Stack ExchangeAcademiaQ&A pairs generated from questions and their top-rated answers 39,633 Biology 7,482 Fitness 3,026 Health 1,428 Bioinformatics 906 WikidocLiving Textbook Q&A pairs generated from paragraphs, where questions were formulated from rephrased paragraph titles, and answers were extracted from paragraph text 67,704 Patient Information Q&A pairs generated from paragraph headings and associated text content 5,942 Evaluation USMLEStep 1Multiple choice questions from the USMLE self-assessment with image-based questions excluded 119 Step 2120 Step 3135

2.1.1 Dataset 1: Flash Cards Used by Medical Students

Medicine as a whole encompasses a wide range of subjects that medical students and graduates must master in order to practice effectively. This includes a profound understanding of basic medical sciences, clinical knowledge, and clinical skills. The Anki Medical Curriculum flashcards are created and updated by medical students and cover the entirety of the medical school curriculum, addressing subjects such as anatomy, physiology, pathology, pharmacology, and more. These flashcards frequently feature succinct summaries and mnemonics to aid in the learning and retention of important medical concepts. In our investigation, we leveraged flashcards as a source to create question-answer pairs for training purposes. Upon excluding cards containing images, we harnessed OpenAI’s GPT-3.5-Turbo to restructure the cards into coherent, contextually pertinent question-answer pairs. Generally, the questions and answers are concise and targeted, as the flashcards offer limited space for incorporating extensive information. See Table 3 for representative Q/A pairs.

2.1.2 Dataset 2: Stackexchange Medical Sciences

The stackexchange dataset consists of 52,475 question-answer pairs obtained from five Stack Exchange forums related to biomedical sciences and related fields:

1.Academia: This forum offers insights into research methodologies, scientific publication processes, and career paths within the scientific community. While not directly affiliated with medicine, considering the volume of medical research, it is likely that medical professionals will also consult models pertaining to this subject matter.
2.Bioinformatics: As an interdisciplinary field combining biology, computer science, and data analysis, the Bioinformatics forum offers valuable information on the techniques and tools used for analyzing complex biological data, which is increasingly important in modern medical research.
3.Biology: Biology covers topics such as genetics, physiology, and molecular biology, which are all relevant to basic medical research. By including this forum, we aim to add core concepts of life sciences to the training data.
4.Fitness: This forum addresses the practical aspects of maintaining and improving physical health, including exercise routines, nutrition, and injury prevention. By incorporating the Fitness forum, we introduce models to health-related information that might be directly applicable to patient care and lifestyle recommendations.
5.Health: The Health forum covers a broad range of topics related to personal health, disease prevention, and medical treatments which could be directly transferable to medical care.

To maintain a high level of answer quality, we collected data exclusively from responses that received a minimum of five up-votes within the forum discussions and paired them with their corresponding questions. See Table 4 for representative Q/A pairs.

2.1.3 Dataset 3: Wikidoc

We incorporated medical question-answer pairs extracted from WikiDoc, a collaborative platform for medical professionals to share and contribute up-to-date medical knowledge. The platform has two main sub-sites, the "Living Textbook" and "Patient Information". The "Living Textbook" contains chapters for various medical specialties, which we crawled. We then used GTP-3.5-Turbo to rephrase the paragraph heading to a question and used the paragraph as answers. Patient Information is structured differently, in that each section subheading is already a question, making rephrasing obsolete. See Table 5 for representative Q/A pairs.

2.1.4 Dataset 4: medical NLP Benchmarks

We additionally use data from open NLP datasets and benchmarks, including:

1.The COVID-19 Open Research Dataset Challenge (CORD-19), consisting of more than one million scholarly articles [5]
2.Benchmark data from Measuring Massive Multitask Language Understanding [6, 7]
3.Training data from the MedQA benchmark, question answering datasets consisting of medical exam questions [8]
4.Training data from the Pubmed Causal Benchmark [9]
5.Conversational data from medical forums as presented in [10]
6.The OpenAssistant dataset. A Crowd sourced conversational dataset, especially targeted towards training models with RLHF

2.2 Model Training

Our models are built upon the LLaMA (Large Language Model Meta AI) foundation models. LLaMA represents a cutting-edge large language model released by Meta, demonstrating their commitment to open science. It is available in various sizes, including 7 billion, 13 billion, 33 billion, and 65 billion parameters. In this study, we fine-tuned the 7 and 13 billion parameter LLaMA variants, adhering to the approach delineated by Taori et al [11].

We trained each model for five epochs, employing a learning rate of 2⁢e−5 2 superscript 𝑒 5 2e^{-5}2 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the 7b model and 1⁢e−5 1 superscript 𝑒 5 1e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the 13b model, using a cosine learning rate scheduler. Gradient accumulation facilitated training with an effective batch size of 256. Given that this training impacts all model parameters, the hardware requirements are substantial. Consequently, we explored alternative training procedures.

First, we implemented Low-Rank Adaptation (LoRA) for weight updates to adapt the pre-trained language models to our specific tasks. LoRA is a method that involves freezing the pre-trained model weights and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture [12]. This approach substantially diminishes the number of trainable parameters and GPU memory requirements for downstream tasks, making it more efficient compared to full fine-tuning and significantly reducing training time.

To further decrease memory and compute demands, we employed 8-bit matrix multiplication for the feed-forward and attention projection layers, along with an 8-bit optimizer. When combined with LoRA, this strategy further reduces the memory needed for training [13][14]. All models trained with LoRA underwent three epochs of training at a learning rate of 2e-5.

2.3 Evaluation Procedure

To evaluate the performance of the fine-tuned language models, we devised an assessment methodology centered on their zero-shot performance across the United States Medical Licensing Examination (USMLE) Step 1, Step 2, and Step 3 self-assessment datasets. We excluded all questions containing images, as our primary interest lies in the models’ language capabilities, and they lack visual abilities. We instructed the models to present answers in the format "Option: Answer" (e.g., "A: Penicillin"). If a model’s output did not adhere to this format, they were prompted up to five times until the response was generated in the desired format. If the model failed to provide the response in the desired format, the last response was retained.

Interestingly, most of the fine-tuned models typically produced answers in the correct format after the first prompt, while only the base LLaMA models required multiple prompts. We conducted separate evaluations for each model, measuring their accuracy on the USMLE Step 1, Step 2, and Step 3 datasets individually. This approach allowed us to gain a comprehensive understanding of the models’ performance across the various stages of the medical licensing examination.

3 Results

Our findings on the USMLE test set are displayed in Table 2. Fine-tuned LLMs consistently surpassed the performance of their pre-trained-only counterparts. It is worth noting that while LoRa and 8-bit fine-tuning expedited the training process, employing these methods resulted in reduced accuracy.

Table 2: Zero shot performance on the USMLE self assessment

4 Discussion and conclusion

In this study, we introduced a novel, high-quality collection of medical text data specifically designed for training instruction-following, medical large language models (LLMs). This dataset serves as a comprehensive resource for enhancing LLM performance in the medical domain, laying the groundwork for potential integration into medical education and practice.

Using our medical text data, we fine-tuned several open-source LLM variants, adopting parameter-efficient tuning methodologies to address limited computing resources [16]. This approach is vital, as full fine-tuning of language model parameters is often unfeasible for most academic institutions. Our study demonstrates the viability of parameter-efficient fine-tuning.

We evaluated LLM performance using the United States Medical Licensing Examination (USMLE) for Steps 1, 2, and 3, which assess medical knowledge at various complexity levels. As expected, performance improved with larger pre-trained models. Applying approximation techniques, such as 8-bit precision and LoRa, during fine-tuning yielded less optimal results. However, due to considerable computational costs, we did not conduct extensive hyperparameter optimization and fine-tuning; thus, it may be possible to achieve performance comparable to vanilla-trained models through more thorough hyperparameter optimization, which we leave for future research.

The availability of additional medical datasets will likely enhance the applicability and performance of these models, creating various potential applications such as extracting structured medical information from unstructured text, supporting medical students’ education through question-answering interactions to reinforce their knowledge and clarify lecture uncertainties, or assisting patients in understanding their health and improving communication between doctors and patients who often find medical language challenging.

Nevertheless, implementing LLMs for these application scenarios presents challenges and concerns. Ensuring data privacy and compliance with ethical standards is critical when handling sensitive patient data; these concerns can be addressed by deploying models locally within secure hospital networks. Moreover, models must be thoroughly evaluated and safeguarded for potential biases and inaccuracies to prevent unintended consequences in medical decision-making.

A significant limitation is LLMs’ tendency to confabulate or generate text that appears plausible but is factually incorrect [17]. This issue is especially concerning in the medical domain, where disseminating incorrect information can have serious implications for patient care and safety. Guaranteeing the accuracy and reliability of generated information is therefore essential, necessitating rigorous evaluation and continuous monitoring to mitigate confabulation risks and the potential harm it may cause in medical settings.

In conclusion, our work substantially contributes to the field of LLMs in medicine by providing a novel, high-quality medical dataset for research and application purposes. Further, we successfully fine-tuned and evaluated various LLMs, demonstrating that their medical domain performance increases with pre-trained model size and high-quality data availability. This progress paves the way for further exploration and development of LLMs in medicine, with potential implications for medical education, patient care, and healthcare communication.

5 Acknowledgements

References

[1] L.Ouyang, J.Wu, X.Jiang, D.Almeida, C.Wainwright, P.Mishkin, C.Zhang, S.Agarwal, K.Slama, A.Ray, et al., ‘‘Training language models to follow instructions with human feedback,’’ Advances in Neural Information Processing Systems, vol.35, pp.27730--27744, 2022.
[2] L.C. Adams, D.Truhn, F.Busch, A.Kader, S.M. Niehues, M.R. Makowski, and K.K. Bressem, ‘‘Leveraging gpt-4 for post hoc transformation of free-text radiology reports into structured reporting: A multilingual feasibility study,’’ Radiology, p.230725, 2023. PMID: 37014240.
[3] M.Sallam, ‘‘The utility of chatgpt as an example of large language models in healthcare education, research and practice: Systematic review on the future perspectives and potential limitations,’’ medRxiv, 2023.
[4] P.Lee, S.Bubeck, and J.Petro, ‘‘Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine,’’ New England Journal of Medicine, vol.388, no.13, pp.1233--1239, 2023.
[5] AI2, CZI, MSR, Georgetown, NIH, and T.W. House, ‘‘Covid-19 open research dataset challenge (cord-19),’’ 2019. Kaggle Challenge.
[6] D.Hendrycks, C.Burns, S.Basart, A.Critch, J.Li, D.Song, and J.Steinhardt, ‘‘Aligning ai with shared human values,’’ Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[7] D.Hendrycks, C.Burns, S.Basart, A.Zou, M.Mazeika, D.Song, and J.Steinhardt, ‘‘Measuring massive multitask language understanding,’’ Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[8] D.Jin, E.Pan, N.Oufattole, W.-H. Weng, H.Fang, and P.Szolovits, ‘‘What disease does this patient have? a large-scale open domain question answering dataset from medical exams,’’ arXiv preprint arXiv:2009.13081, 2020.
[9] B.Yu, Y.Li, and J.Wang, ‘‘Detecting causal language use in science findings,’’ in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (Hong Kong, China), Association for Computational Linguistics, Nov. 2019.
[10] L.Yunxiang, L.Zihan, Z.Kai, D.Ruilong, and Z.You, ‘‘Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge,’’ arXiv preprint arXiv:2303.14070, 2023.
[11] R.Taori, I.Gulrajani, T.Zhang, Y.Dubois, X.Li, C.Guestrin, P.Liang, and T.B. Hashimoto, ‘‘Stanford alpaca: An instruction-following llama model.’’ https://github.com/tatsu-lab/stanford_alpaca, 2023.
[12] E.J. Hu, Y.Shen, P.Wallis, Z.Allen-Zhu, Y.Li, S.Wang, L.Wang, and W.Chen, ‘‘Lora: Low-rank adaptation of large language models,’’ arXiv preprint arXiv:2106.09685, 2021.
[13] T.Dettmers, M.Lewis, Y.Belkada, and L.Zettlemoyer, ‘‘Llm.int8(): 8-bit matrix multiplication for transformers at scale,’’ arXiv preprint arXiv:2208.07339, 2022.
[14] T.Dettmers, M.Lewis, S.Shleifer, and L.Zettlemoyer, ‘‘8-bit optimizers via block-wise quantization,’’ 9th International Conference on Learning Representations, ICLR, 2022.
[15] H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.-A. Lachaux, T.Lacroix, B.Rozière, N.Goyal, E.Hambro, F.Azhar, et al., ‘‘Llama: Open and efficient foundation language models,’’ arXiv preprint arXiv:2302.13971, 2023.
[16] S.Mangrulkar, S.Gugger, L.Debut, Y.Belkada, and S.Paul, ‘‘Peft: State-of-the-art parameter-efficient fine-tuning methods.’’ https://github.com/huggingface/peft, 2022.
[17] T.Brown, B.Mann, N.Ryder, M.Subbiah, J.D. Kaplan, P.Dhariwal, A.Neelakantan, P.Shyam, G.Sastry, A.Askell, et al., ‘‘Language models are few-shot learners,’’ Advances in neural information processing systems, vol.33, pp.1877--1901, 2020.

6 Appendix

Table 3: Representative question from the medical flashcards dataset.

Table 4: Representative question from the Stack Exchange dataset.

Table 5: Representative question from the Wikidoc Living Textbook and Patient Information.

Xet Storage Details

Size:: 25.5 kB
Xet hash:: bd9621a3fb10ba15b3ece5adccf33e09810b56a5608ea7365097f1799c2b6b45

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.