hamsinimk commited on
Commit
03875b1
·
verified ·
1 Parent(s): ea7b956

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -61
README.md CHANGED
@@ -16,48 +16,20 @@ This model takes in doctor's notes as inputs and summarizes them into patient-fr
16
  ## Model Details
17
 
18
 
19
- ## Introduction
20
 
21
- Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahil Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key to avoid noise from other unrelated training data as well.
22
 
23
- ## Data
24
 
25
- After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model as follows:
26
- System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long. The goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon."
27
- User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding. Doctor's note:{row["Doctor's Note"]}
28
- I used a subset of the PMC-patients dataset (1000 rows) and set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.
29
 
30
- ## Methodology
31
 
32
  Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context that the training task is medical at hand (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA), the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).
33
 
34
- ### Model Description
35
 
36
- <!-- Provide a longer summary of what this model is. -->
37
-
38
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
39
-
40
- - **Developed by:** [More Information Needed]
41
- - **Funded by [optional]:** [More Information Needed]
42
- - **Shared by [optional]:** [More Information Needed]
43
- - **Model type:** [More Information Needed]
44
- - **Language(s) (NLP):** [More Information Needed]
45
- - **License:** [More Information Needed]
46
- - **Finetuned from model [optional]:** [More Information Needed]
47
-
48
- ### Model Sources [optional]
49
-
50
- <!-- Provide the basic links for the model. -->
51
-
52
- - **Repository:** [More Information Needed]
53
- - **Paper [optional]:** [More Information Needed]
54
- - **Demo [optional]:** [More Information Needed]
55
-
56
- ## Uses
57
-
58
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
59
-
60
- ## Evaluation
61
 
62
  | Model | MMLU Philosophy | medqa_4options | xsum | test split |
63
  |------------------------------------|-----------------|----------------|------|------------|
@@ -67,8 +39,18 @@ This is the model card of a 🤗 transformers model that has been pushed on the
67
  | mistralai/Mistral-7B-Instruct-v0.2 | 0.77 | 0.37 | 0.77 | 0.85 |
68
  | | | | | |
69
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- ## Usage and Intended Uses
72
 
73
  This model is meant to be used to summarize doctor's notes. More specifically, this model was trained on condensing long doctor's notes into 3-5 sentence patient friendly summaries. It could be used for other long medical text, similar in length to a doctor's note, however, the model is the most familiar with the language in a note describing the patient's diagnosis, health concerns, next steps, and health progression. Overall, the model is meant to be used to when a patient receives a long visit note from the doctor that is overwhelming and filled with medical jargon. By passing that note as input, a quick synopsis of the doctor's notes that omits much medical jargon and just keeps key details is returned. This can be easily used to understand the condition and next steps for yourself or a family member.
74
 
@@ -85,14 +67,14 @@ model = AutoModelForCausalLM.from_pretrained(
85
  ```
86
 
87
 
88
- ## Prompt Format
89
 
90
  The prompt format is two-fold as both a system and user prompt are used for this model. This way, the model is given clear instructions on how to act to actually word the output (using the system prompt) and then, also guided on the length and purpose of the output (user prompt). The format for what was used to train the model and generate summaries for the test data for evaluation is shown below as an example.
91
 
92
  ```
93
- The system prompt was: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long.\nThe goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon.
94
 
95
- The user prompt was: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding.
96
 
97
  Example Doctor's Note for Prompt:
98
 
@@ -101,7 +83,7 @@ A 70-year-old woman presented in November 2017 to the Emergency Department at Sk
101
  ```
102
 
103
 
104
- ## Expected Output Format
105
 
106
  The expected output format is 3-5 sentence summary that uses patient-friendly and layman language for ease of understanding. The output will keep key information from the doctor's notes to ensure that critical details regarding the patient's health are still disclosed, but are interpretable.
107
 
@@ -111,7 +93,7 @@ Below is an example of the expected output format for a summary:
111
  A 70-year-old woman was admitted to the hospital because she had a sudden fever, chills, and a skin infection on her arm. She had a history of breast cancer and had experienced similar skin infections several times before. Blood tests showed that the infection was caused by a bacteria from the S. mitis group, which had caused problems in the past. She was treated with antibiotics, and after a few weeks, the infection cleared up. To prevent future infections, she was referred to specialists for her lymphoedema and to the dentist.
112
  ```
113
 
114
- ## Limitations
115
 
116
  Some limitations of this model include:
117
 
@@ -121,34 +103,20 @@ Some limitations of this model include:
121
 
122
  3. For any niche conditions/diagnoses that weren't covered in the training data, there is a risk of hallucination as the model may not be specialized enough to accurately output a summary.
123
 
124
- 4. Since the model just focuses on summarization, ensure that the doctor's notes are redacted of any highly personal information. The model itself will not store this data, but to ensure privacy in the output summary, it is important to take out names or other personal information in the input. The model was trained on anonymous doctor's notes.
125
 
126
- ## Citation [optional]
127
-
128
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
129
-
130
- **BibTeX:**
131
-
132
- [More Information Needed]
133
 
134
  **APA:**
 
135
 
136
- [More Information Needed]
137
-
138
- ## Glossary [optional]
139
 
140
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
141
 
142
- [More Information Needed]
143
-
144
- ## More Information [optional]
145
 
146
- [More Information Needed]
147
 
148
- ## Model Card Authors [optional]
149
 
150
  [More Information Needed]
151
-
152
- ## Model Card Contact
153
-
154
- [More Information Needed]
 
16
  ## Model Details
17
 
18
 
19
+ ## 1. Introduction
20
 
21
+ Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahin Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key to avoid noise from other unrelated training data as well.
22
 
23
+ ## 2. Data
24
 
25
+ After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model with a system and user prompt as included in the "Prompt Format" section below. I did not use a random seed - I used a subset of the PMC-patients dataset (1000 rows). I have included the full dataset for users to explore other splits as well as the train, validation and test split data used for this model in this repo. I set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.
 
 
 
26
 
27
+ ## 3. Methodology
28
 
29
  Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context that the training task is medical at hand (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA), the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).
30
 
 
31
 
32
+ ## 4. Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  | Model | MMLU Philosophy | medqa_4options | xsum | test split |
35
  |------------------------------------|-----------------|----------------|------|------------|
 
39
  | mistralai/Mistral-7B-Instruct-v0.2 | 0.77 | 0.37 | 0.77 | 0.85 |
40
  | | | | | |
41
 
42
+ To benchmark the model, a general, medical reasoning, and summarization specific benchmarks were utilized. For the general benchmark, Massive Multitask Language Understanding (MMLU) Philosophy (Caballar & Stryker, 2025) was chosen and for the summarization specific benchmark, Extreme Summarization (XSum) was used to evaluate the model’s ability to generate effective summaries/abstracts when given long inputs that may be unstructured or have lots of technical language (Narayan et al., 1970). My general benchmark plan was as follows:
43
+
44
+ medqa_4options (Domain / task specific): Assess model’s ability to perform medical reasoning when given complex multiple choice questions and be able to understand medical information
45
+
46
+ MMLU Philosophy: Assess model’s reasoning capability and breadth of knowledge across philosophy (Chugani, 2025) with multiple choice questions
47
+
48
+ XSum: Assess model’s ability to generate concise and accurate summaries (1 sentence) given long inputs without simply extracting information from the input (Narayan et al., 1970). This focuses on BBC News articles.
49
+
50
+ I chose Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.2 as the comparison models since they are similar in size to the google/gemma-3-4b-it baseline model I used. Furthermore, these models also can perform summarization and work with long context inputs as well, which is related to my training task. I was initially considering these models along with the baseline model when deciding which one performed best with few shot prompting and went with gemma-3 since it provided the most succinct outputs. Looking at the table above, the doctor-note-summarization model performed the best on the XSum and test split data compared to the baseline and comparison models as expected. However, it did have slightly lower accuracy on the medqa_4options compared to the Qwen model and slightly lower accuracy on MMLU philosophy compared to both comparison models. The lower accuracy on medqa_4options and MMLU philosophy could be due to the fact that the model is more specialized in just summarizing doctor’s notes and understanding clinical data/ summarization patterns. With this, the model isn’t focusing specifically on diagnosing a patient, which would help more with reasoning, and also, becomes more niche making it perform a little worse on MMLU philosophy as it is medical note centered.
51
+
52
 
53
+ ## 5. Usage and Intended Uses
54
 
55
  This model is meant to be used to summarize doctor's notes. More specifically, this model was trained on condensing long doctor's notes into 3-5 sentence patient friendly summaries. It could be used for other long medical text, similar in length to a doctor's note, however, the model is the most familiar with the language in a note describing the patient's diagnosis, health concerns, next steps, and health progression. Overall, the model is meant to be used to when a patient receives a long visit note from the doctor that is overwhelming and filled with medical jargon. By passing that note as input, a quick synopsis of the doctor's notes that omits much medical jargon and just keeps key details is returned. This can be easily used to understand the condition and next steps for yourself or a family member.
56
 
 
67
  ```
68
 
69
 
70
+ ## 6. Prompt Format
71
 
72
  The prompt format is two-fold as both a system and user prompt are used for this model. This way, the model is given clear instructions on how to act to actually word the output (using the system prompt) and then, also guided on the length and purpose of the output (user prompt). The format for what was used to train the model and generate summaries for the test data for evaluation is shown below as an example.
73
 
74
  ```
75
+ System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long.\nThe goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon.
76
 
77
+ User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding.
78
 
79
  Example Doctor's Note for Prompt:
80
 
 
83
  ```
84
 
85
 
86
+ ## 7. Expected Output Format
87
 
88
  The expected output format is 3-5 sentence summary that uses patient-friendly and layman language for ease of understanding. The output will keep key information from the doctor's notes to ensure that critical details regarding the patient's health are still disclosed, but are interpretable.
89
 
 
93
  A 70-year-old woman was admitted to the hospital because she had a sudden fever, chills, and a skin infection on her arm. She had a history of breast cancer and had experienced similar skin infections several times before. Blood tests showed that the infection was caused by a bacteria from the S. mitis group, which had caused problems in the past. She was treated with antibiotics, and after a few weeks, the infection cleared up. To prevent future infections, she was referred to specialists for her lymphoedema and to the dentist.
94
  ```
95
 
96
+ ## 8. Limitations
97
 
98
  Some limitations of this model include:
99
 
 
103
 
104
  3. For any niche conditions/diagnoses that weren't covered in the training data, there is a risk of hallucination as the model may not be specialized enough to accurately output a summary.
105
 
106
+ 4. Since the model is specifically trained on doctor note and summary pairs, it may not perform as well on reasoning tasks or general non STEM related tasks as it is niche to medical note summarization.
107
 
108
+ ## Citations
 
 
 
 
 
 
109
 
110
  **APA:**
111
+ Caballar, R., & Stryker, C. (2025, July 22). What are LLM benchmarks?. IBM. https://www.ibm.com/think/topics/llm-benchmarks
112
 
113
+ Chugani, V. (2025, July 21). How to understand MMLU scores: The ‘SAT test’ for AI models. Statology. https://www.statology.org/how-to-understand-mmlu-scores-the-sat-test-for-ai-models/
 
 
114
 
115
+ Narayan, S., Cohen, S. B., & Lapata, M. (1970, January 1). Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ACL Anthology. https://aclanthology.org/D18-1206/
116
 
117
+ Sahin Ahmed, D. S. (2024, October 29). The Common Failure Points of LLM Rag Systems and how to overcome them. Medium. https://medium.com/@sahin.samia/the-common-failure-points-of-llm-rag-systems-and-how-to-overcome-them-926d9090a88f
 
 
118
 
 
119
 
120
+ **APA:**
121
 
122
  [More Information Needed]