hamsinimk
/

doctor_note_summarization_llm

Model card Files Files and versions

hamsinimk commited on Dec 4, 2025

Commit

8443cb8

·

verified ·

1 Parent(s): 571739c

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -49,6 +49,7 @@ XSum: Assess model’s ability to generate concise and accurate summaries (1 sen
 I chose Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.2 as the comparison models since they are similar in size to the google/gemma-3-4b-it baseline model I used. Furthermore, these models also can perform summarization and work with long context inputs as well, which is related to my training task. I was initially considering these models along with the baseline model when deciding which one performed best with few shot prompting and went with gemma-3 since it provided the most succinct outputs. Looking at the table above, the doctor-note-summarization model performed the best on the XSum and test split data compared to the baseline and comparison models as expected. However, it did have slightly lower accuracy on the medqa_4options compared to the Qwen model and slightly lower accuracy on MMLU philosophy compared to both comparison models. The lower accuracy on medqa_4options and MMLU philosophy could be due to the fact that the model is more specialized in just summarizing doctor’s notes and understanding clinical data/ summarization patterns. With this, the model isn’t focusing specifically on diagnosing a patient, which would help more with reasoning, and also, becomes more niche making it perform a little worse on MMLU philosophy as it is medical note centered.
 ## 5. Usage and Intended Uses

 I chose Qwen/Qwen2.5-3B-Instruct and mistralai/Mistral-7B-Instruct-v0.2 as the comparison models since they are similar in size to the google/gemma-3-4b-it baseline model I used. Furthermore, these models also can perform summarization and work with long context inputs as well, which is related to my training task. I was initially considering these models along with the baseline model when deciding which one performed best with few shot prompting and went with gemma-3 since it provided the most succinct outputs. Looking at the table above, the doctor-note-summarization model performed the best on the XSum and test split data compared to the baseline and comparison models as expected. However, it did have slightly lower accuracy on the medqa_4options compared to the Qwen model and slightly lower accuracy on MMLU philosophy compared to both comparison models. The lower accuracy on medqa_4options and MMLU philosophy could be due to the fact that the model is more specialized in just summarizing doctor’s notes and understanding clinical data/ summarization patterns. With this, the model isn’t focusing specifically on diagnosing a patient, which would help more with reasoning, and also, becomes more niche making it perform a little worse on MMLU philosophy as it is medical note centered.
+As a note, I used BERTscore as the evaluation metric for XSum and the test split, while I used accuracy for MMLU Philosophy and medqa_4options (lm eval benchmarks).
 ## 5. Usage and Intended Uses