hamsinimk
/

doctor_note_summarization_llm

Transformers

Safetensors

English

Model card Files Files and versions

xet

Community

hamsinimk commited on Nov 26, 2025

Commit

2516548

verified ·

1 Parent(s): fd3e401

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -109

README.md CHANGED Viewed

@@ -18,9 +18,18 @@ This model takes in doctor's notes as inputs and summarizes them into patient-fr
 ## Introduction
-Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahil Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key. This way, any noise from other unrelated training data can be avoided. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahil Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key. This way, any noise from other unrelated training data can be avoided.
 ### Model Description
@@ -48,136 +57,33 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 [More Information Needed]
-### Recommendations
 <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
 Use the code below to get started with the model.
 [More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]

 ## Introduction
+Healthcare communication and health literacy is a large gap that exists in the healthcare industry between physicians and patients. I often run into the issue of reading long doctor’s notes that are uploaded to my patient portal which I cannot understand. This is definitely a frustrating experience as a patient because the notes are not only long, but also include a lot of technical jargon surrounding the diagnosis which is overwhelming. I often look to doctors in my family to translate doctor’s notes and understand if / what the next steps are. Instead of having a middle man translate the notes, I’m hoping that the LLM can take doctor’s notes as input and summarize them into short, simple notes. I think current LLMs do need training in this as it is a niche topic and I would want to ensure that accuracy and key details are preserved when providing simple summaries to users/patients. Current LLMs may brush over key details if they haven’t been trained specifically in clinical/doctor’s notes or a large enough dataset to understand the context and style of writing. In my own experience, LLMs are great at summarizing, but can lack specificity or leave out information at times. As mentioned in Medium post by Sahil Ahmed (Data Scientist), LLMs, in general, as well as ones that implement RAG systems are not without their disadvantages. Ahmed notes one such failure point as “context limitation” which happens when many documents are passed through the LLM model which forces the system to “consolidate them to fit the LLM’s input limits, which may lead to truncation or selective prioritization, potentially leaving out crucial information” (Sahin Ahmed, 2024). In this medical use case, it is extremely important to maintain the accuracy for the patient such that key details are not brushed over so the model’s summarized output can be relied on for next steps. To ensure this accuracy, I think developing a LLM that is dedicated to this use case and has been trained specifically on doctor’s notes and summaries is key to avoid noise from other unrelated training data as well.
+## Data
+After looking more into the training data generation, I noticed that it is tough to find long doctor notes and patient summary pairs. Due to this, I performed synthetic data generation to generate the summaries from existing doctor’s notes. I found a huggingface dataset of 30,000 doctor’s notes (PMCpatientsdata) which I then subset to 1000 rows. As a note, I used google/gemma-3-4b-it as my model for data generation and training. For data generation, I prompted the model as follows:
+System Prompt: Imagine you are a useful medical assistant that is trying to summarize doctor notes that were taken during patient visits into patient friendly summaries that are 3-5 sentences long. The goal is just to summarize the given doctor's note and output a 3-5 sentence summary that captures key details of the note without too much medical jargon."
+User Prompt: Now provide a 3–5 sentence summary for the doctor's note written for a patient's understanding. Doctor's note:{row["Doctor's Note"]}
+I used a subset of the PMC-patients dataset (1000 rows) and set up a for loop to loop through each doctor’s note and generate a summary based on given prompt instructions. After generating the summaries, I saved the doctor’s notes and summary pairs to a .csv file to use later for training purposes. I employed 80/10/10 split for the training-validation-test data to ensure adequate training and evaluation. Essentially, I trained the model on 800 doctor’s notes + summary pairs and then validated / tested on a total of 200 notes + summary pairs.
+## Methodology
+Based on previous experimentation with LoRA, I think the method does decently well to increase the model's ability to perform medical reasoning through the finetuning process based on the accuracy increasing for a medical training task. With LoRA, the model is able to actually change how it thinks and reasons, rather than just reiterating/finetuning the context that the training task is medical at hand (which is what prompt tuning does). Since it focuses on updating a subset of weight matrices using low rank adaptation, the model is able to better understand reasoning patterns to properly answer more complex questions as the method modifies the attention heads. To prevent catastrophic forgetting and possibly increase overall accuracy, the number of epochs for training was increased to 3. Based on these factors, I would say for a complex medical dataset / reasoning task, I would choose LoRA as the appropriate finetuning method. After trying 3 different hyperparameter combinations (low capacity, medium capacity, and high capacity LoRA),  the medium and high capacity LoRA performed very similarly in terms of validation and training loss. It didn’t make sense to add more parameters with high capacity LoRA, so I went forward with medium capacity LoRA as it provided essentially the same performance. Medium capacity LoRA had r at 32, alpha at 64, and dropout at 15%. During training, the number of epochs was set to 3, learning rate was 0.00001, and the number of evaluation steps was 200. Furthermore, the auto_find_batch_size parameter was not used and instead, the per_device_train_batch_size and per_device_eval_batch_size was set to 1. As a note, the number of evaluation steps at 200 means that validation and training loss will be calculated every 200 steps up to 800 (training data size) per epoch (2400 steps total across all 3 epochs).
 ### Model Description
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+## Evaluation
+## Usage and Intended Uses
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 [More Information Needed]
+## Prompt Format
 <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## Expected Output Format
 Use the code below to get started with the model.
 [More Information Needed]
+## Limitations
 ## Citation [optional]