Update README.md
Browse files
README.md
CHANGED
|
@@ -16,48 +16,20 @@ This model takes in doctor's notes as inputs and summarizes them into patient-fr
|
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
|
| 19 |
-
## Introduction
|
| 20 |
|
| 21 |
-
Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by
|
| 22 |
|
| 23 |
-
## Data
|
| 24 |
|
| 25 |
-
After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model as
|
| 26 |
-
System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long. The goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon."
|
| 27 |
-
User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding. Doctor's note:{row["Doctor's Note"]}
|
| 28 |
-
I used a subset of the PMC-patients dataset (1000 rows) and set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.
|
| 29 |
|
| 30 |
-
## Methodology
|
| 31 |
|
| 32 |
Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context that the training task is medical at hand (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA), the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).
|
| 33 |
|
| 34 |
-
### Model Description
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
| 39 |
-
|
| 40 |
-
- **Developed by:** [More Information Needed]
|
| 41 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 42 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 43 |
-
- **Model type:** [More Information Needed]
|
| 44 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 45 |
-
- **License:** [More Information Needed]
|
| 46 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 47 |
-
|
| 48 |
-
### Model Sources [optional]
|
| 49 |
-
|
| 50 |
-
<!-- Provide the basic links for the model. -->
|
| 51 |
-
|
| 52 |
-
- **Repository:** [More Information Needed]
|
| 53 |
-
- **Paper [optional]:** [More Information Needed]
|
| 54 |
-
- **Demo [optional]:** [More Information Needed]
|
| 55 |
-
|
| 56 |
-
## Uses
|
| 57 |
-
|
| 58 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 59 |
-
|
| 60 |
-
## Evaluation
|
| 61 |
|
| 62 |
| Model | MMLU Philosophy | medqa_4options | xsum | test split |
|
| 63 |
|------------------------------------|-----------------|----------------|------|------------|
|
|
@@ -67,8 +39,18 @@ This is the model card of a 🤗 transformers model that has been pushed on the
|
|
| 67 |
| mistralai/Mistral-7B-Instruct-v0.2 | 0.77 | 0.37 | 0.77 | 0.85 |
|
| 68 |
| | | | | |
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
## Usage and Intended Uses
|
| 72 |
|
| 73 |
This model is meant to be used to summarize doctor's notes. More specifically, this model was trained on condensing long doctor's notes into 3-5 sentence patient friendly summaries. It could be used for other long medical text, similar in length to a doctor's note, however, the model is the most familiar with the language in a note describing the patient's diagnosis, health concerns, next steps, and health progression. Overall, the model is meant to be used to when a patient receives a long visit note from the doctor that is overwhelming and filled with medical jargon. By passing that note as input, a quick synopsis of the doctor's notes that omits much medical jargon and just keeps key details is returned. This can be easily used to understand the condition and next steps for yourself or a family member.
|
| 74 |
|
|
@@ -85,14 +67,14 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 85 |
```
|
| 86 |
|
| 87 |
|
| 88 |
-
## Prompt Format
|
| 89 |
|
| 90 |
The prompt format is two-fold as both a system and user prompt are used for this model. This way, the model is given clear instructions on how to act to actually word the output (using the system prompt) and then, also guided on the length and purpose of the output (user prompt). The format for what was used to train the model and generate summaries for the test data for evaluation is shown below as an example.
|
| 91 |
|
| 92 |
```
|
| 93 |
-
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
Example Doctor's Note for Prompt:
|
| 98 |
|
|
@@ -101,7 +83,7 @@ A 70-year-old woman presented in November 2017 to the Emergency Department at Sk
|
|
| 101 |
```
|
| 102 |
|
| 103 |
|
| 104 |
-
## Expected Output Format
|
| 105 |
|
| 106 |
The expected output format is 3-5 sentence summary that uses patient-friendly and layman language for ease of understanding. The output will keep key information from the doctor's notes to ensure that critical details regarding the patient's health are still disclosed, but are interpretable.
|
| 107 |
|
|
@@ -111,7 +93,7 @@ Below is an example of the expected output format for a summary:
|
|
| 111 |
A 70-year-old woman was admitted to the hospital because she had a sudden fever, chills, and a skin infection on her arm. She had a history of breast cancer and had experienced similar skin infections several times before. Blood tests showed that the infection was caused by a bacteria from the S. mitis group, which had caused problems in the past. She was treated with antibiotics, and after a few weeks, the infection cleared up. To prevent future infections, she was referred to specialists for her lymphoedema and to the dentist.
|
| 112 |
```
|
| 113 |
|
| 114 |
-
## Limitations
|
| 115 |
|
| 116 |
Some limitations of this model include:
|
| 117 |
|
|
@@ -121,34 +103,20 @@ Some limitations of this model include:
|
|
| 121 |
|
| 122 |
3. For any niche conditions/diagnoses that weren't covered in the training data, there is a risk of hallucination as the model may not be specialized enough to accurately output a summary.
|
| 123 |
|
| 124 |
-
4. Since the model
|
| 125 |
|
| 126 |
-
##
|
| 127 |
-
|
| 128 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 129 |
-
|
| 130 |
-
**BibTeX:**
|
| 131 |
-
|
| 132 |
-
[More Information Needed]
|
| 133 |
|
| 134 |
**APA:**
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
## Glossary [optional]
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
## More Information [optional]
|
| 145 |
|
| 146 |
-
[More Information Needed]
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
[More Information Needed]
|
| 151 |
-
|
| 152 |
-
## Model Card Contact
|
| 153 |
-
|
| 154 |
-
[More Information Needed]
|
|
|
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
|
| 19 |
+
## 1. Introduction
|
| 20 |
|
| 21 |
+
Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahin Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key to avoid noise from other unrelated training data as well.
|
| 22 |
|
| 23 |
+
## 2. Data
|
| 24 |
|
| 25 |
+
After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model with a system and user prompt as included in the "Prompt Format" section below. I did not use a random seed - I used a subset of the PMC-patients dataset (1000 rows). I have included the full dataset for users to explore other splits as well as the train, validation and test split data used for this model in this repo. I set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
## 3. Methodology
|
| 28 |
|
| 29 |
Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context that the training task is medical at hand (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA), the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).
|
| 30 |
|
|
|
|
| 31 |
|
| 32 |
+
## 4. Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
| Model | MMLU Philosophy | medqa_4options | xsum | test split |
|
| 35 |
|------------------------------------|-----------------|----------------|------|------------|
|
|
|
|
| 39 |
| mistralai/Mistral-7B-Instruct-v0.2 | 0.77 | 0.37 | 0.77 | 0.85 |
|
| 40 |
| | | | | |
|
| 41 |
|
| 42 |
+
To benchmark the model, a general, medical reasoning, and summarization specific benchmarks were utilized. For the general benchmark, Massive Multitask Language Understanding (MMLU) Philosophy (Caballar & Stryker, 2025) was chosen and for the summarization specific benchmark, Extreme Summarization (XSum) was used to evaluate the model’s ability to generate effective summaries/abstracts when given long inputs that may be unstructured or have lots of technical language (Narayan et al., 1970). My general benchmark plan was as follows:
|
| 43 |
+
|
| 44 |
+
medqa_4options (Domain / task specific): Assess model’s ability to perform medical reasoning when given complex multiple choice questions and be able to understand medical information
|
| 45 |
+
|
| 46 |
+
MMLU Philosophy: Assess model’s reasoning capability and breadth of knowledge across philosophy (Chugani, 2025) with multiple choice questions
|
| 47 |
+
|
| 48 |
+
XSum: Assess model’s ability to generate concise and accurate summaries (1 sentence) given long inputs without simply extracting information from the input (Narayan et al., 1970). This focuses on BBC News articles.
|
| 49 |
+
|
| 50 |
+
I chose Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.2 as the comparison models since they are similar in size to the google/gemma-3-4b-it baseline model I used. Furthermore, these models also can perform summarization and work with long context inputs as well, which is related to my training task. I was initially considering these models along with the baseline model when deciding which one performed best with few shot prompting and went with gemma-3 since it provided the most succinct outputs. Looking at the table above, the doctor-note-summarization model performed the best on the XSum and test split data compared to the baseline and comparison models as expected. However, it did have slightly lower accuracy on the medqa_4options compared to the Qwen model and slightly lower accuracy on MMLU philosophy compared to both comparison models. The lower accuracy on medqa_4options and MMLU philosophy could be due to the fact that the model is more specialized in just summarizing doctor’s notes and understanding clinical data/ summarization patterns. With this, the model isn’t focusing specifically on diagnosing a patient, which would help more with reasoning, and also, becomes more niche making it perform a little worse on MMLU philosophy as it is medical note centered.
|
| 51 |
+
|
| 52 |
|
| 53 |
+
## 5. Usage and Intended Uses
|
| 54 |
|
| 55 |
This model is meant to be used to summarize doctor's notes. More specifically, this model was trained on condensing long doctor's notes into 3-5 sentence patient friendly summaries. It could be used for other long medical text, similar in length to a doctor's note, however, the model is the most familiar with the language in a note describing the patient's diagnosis, health concerns, next steps, and health progression. Overall, the model is meant to be used to when a patient receives a long visit note from the doctor that is overwhelming and filled with medical jargon. By passing that note as input, a quick synopsis of the doctor's notes that omits much medical jargon and just keeps key details is returned. This can be easily used to understand the condition and next steps for yourself or a family member.
|
| 56 |
|
|
|
|
| 67 |
```
|
| 68 |
|
| 69 |
|
| 70 |
+
## 6. Prompt Format
|
| 71 |
|
| 72 |
The prompt format is two-fold as both a system and user prompt are used for this model. This way, the model is given clear instructions on how to act to actually word the output (using the system prompt) and then, also guided on the length and purpose of the output (user prompt). The format for what was used to train the model and generate summaries for the test data for evaluation is shown below as an example.
|
| 73 |
|
| 74 |
```
|
| 75 |
+
System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long.\nThe goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon.
|
| 76 |
|
| 77 |
+
User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding.
|
| 78 |
|
| 79 |
Example Doctor's Note for Prompt:
|
| 80 |
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
|
| 86 |
+
## 7. Expected Output Format
|
| 87 |
|
| 88 |
The expected output format is 3-5 sentence summary that uses patient-friendly and layman language for ease of understanding. The output will keep key information from the doctor's notes to ensure that critical details regarding the patient's health are still disclosed, but are interpretable.
|
| 89 |
|
|
|
|
| 93 |
A 70-year-old woman was admitted to the hospital because she had a sudden fever, chills, and a skin infection on her arm. She had a history of breast cancer and had experienced similar skin infections several times before. Blood tests showed that the infection was caused by a bacteria from the S. mitis group, which had caused problems in the past. She was treated with antibiotics, and after a few weeks, the infection cleared up. To prevent future infections, she was referred to specialists for her lymphoedema and to the dentist.
|
| 94 |
```
|
| 95 |
|
| 96 |
+
## 8. Limitations
|
| 97 |
|
| 98 |
Some limitations of this model include:
|
| 99 |
|
|
|
|
| 103 |
|
| 104 |
3. For any niche conditions/diagnoses that weren't covered in the training data, there is a risk of hallucination as the model may not be specialized enough to accurately output a summary.
|
| 105 |
|
| 106 |
+
4. Since the model is specifically trained on doctor note and summary pairs, it may not perform as well on reasoning tasks or general non STEM related tasks as it is niche to medical note summarization.
|
| 107 |
|
| 108 |
+
## Citations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
**APA:**
|
| 111 |
+
Caballar, R., & Stryker, C. (2025, July 22). What are LLM benchmarks?. IBM. https://www.ibm.com/think/topics/llm-benchmarks
|
| 112 |
|
| 113 |
+
Chugani, V. (2025, July 21). How to understand MMLU scores: The ‘SAT test’ for AI models. Statology. https://www.statology.org/how-to-understand-mmlu-scores-the-sat-test-for-ai-models/
|
|
|
|
|
|
|
| 114 |
|
| 115 |
+
Narayan, S., Cohen, S. B., & Lapata, M. (1970, January 1). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ACL Anthology. https://aclanthology.org/D18-1206/
|
| 116 |
|
| 117 |
+
Sahin Ahmed, D. S. (2024, October 29). The Common Failure Points of LLM Rag Systems and how to overcome them. Medium. https://medium.com/@sahin.samia/the-common-failure-points-of-llm-rag-systems-and-how-to-overcome-them-926d9090a88f
|
|
|
|
|
|
|
| 118 |
|
|
|
|
| 119 |
|
| 120 |
+
**APA:**
|
| 121 |
|
| 122 |
[More Information Needed]
|
|
|
|
|
|
|
|
|
|
|
|