connectthapa84 commited on
Commit
515135e
·
verified ·
1 Parent(s): 0648af8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -3
README.md CHANGED
@@ -1,3 +1,65 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ base_model:
6
+ - meta-llama/Llama-3.1-8B-Instruct
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - medical
10
+ ---
11
+
12
+ <div align="center">
13
+ <h1>
14
+ Disentangling Reasoning and Knowledge in Medical Large Language Models
15
+ </h1>
16
+ </div>
17
+
18
+ ## Introduction
19
+
20
+ <div align="center">
21
+ <img src="overall_workflow.jpg" width="90%" alt="overall_workflow" />
22
+ </div>
23
+
24
+
25
+ Medical reasoning in large language models (LLMs) aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, evaluating true reasoning capabilities remains challenging, as widely used benchmarks-such as MedQA-USMLE, MedMCQA, and PubMedQA-often conflate questions requiring medical reasoning with those solvable through factual recall. We address this limitation by systematically disentangling reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks using a PubMedBERT-based classifier that achieves human-level performance (81\%). Our analysis reveals that only 32.8\% of benchmark questions involve complex reasoning, with the majority focused on factual understanding. Using this stratified dataset, we evaluate recent biomedical reasoning models (HuatuoGPT-o1, MedReason, m1) alongside general-domain models (DeepSeek-R1, o4-mini, Qwen3) and observe a consistent performance gap between knowledge and reasoning—for example, m1 scores 60.5\% vs. 47.1\%, respectively. To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 44.4\% to 29.3\%), while RL-trained and larger general-domain models are more resilient. Based on these insights, we train BioMed-R1-8B using supervised fine-tuning and reinforcement learning on reasoning-heavy examples. While it achieves the strongest overall and adversarial performance among similarly sized models, there remains ample room for improvement. Incorporating additional reasoning-rich data sources, such as clinical case reports, and training on adversarial or backtracking scenarios—with reinforcement learning to encourage self-correction—may further enhance robustness and reliability.
26
+
27
+
28
+ <div align=center>
29
+ <img src="reasoning_vs_knowledge.png" width = "90%" alt="reason_vs_knowledge" align=center/>
30
+ </div>
31
+
32
+
33
+ BioMed-R1 can be used just like `Llama-3.1-8B-Instruct`. You can deploy it with tools like [vllm](https://github.com/vllm-project/vllm) or [Sglang](https://github.com/sgl-project/sglang), or perform direct inference:
34
+ ```python
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer
36
+
37
+ model = AutoModelForCausalLM.from_pretrained("BioMed-R1",torch_dtype="auto",device_map="auto")
38
+ tokenizer = AutoTokenizer.from_pretrained("BioMed-R1")
39
+
40
+ input_text = "Does vagus nerve contribute to the development of steatohepatitis and obesity in phosphatidylethanolamine N-methyltransferase deficient mice?"
41
+ messages = [{"role": "user", "content": input_text}]
42
+
43
+ inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True
44
+ ), return_tensors="pt").to(model.device)
45
+ outputs = model.generate(**inputs, max_new_tokens=2048)
46
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
47
+ ```
48
+
49
+ ## 🙏🏼 Acknowledgement
50
+
51
+ We gratefully acknowledge the contributions of [HuatuoGPT-o1](https://github.com/FreedomIntelligence/HuatuoGPT-o1), [MedReason](https://github.com/UCSC-VLAA/MedReason), and [M1](https://github.com/UCSC-VLAA/m1).
52
+ We also thank the developers of the outstanding tools [Curator](https://github.com/bespokelabsai/curator), [TRL](https://github.com/huggingface/trl), [vLLM](https://github.com/vllm-project/vllm), and [SGLang](https://github.com/sgl-project/sglang), which made this work possible.
53
+
54
+
55
+ ## 📖 Citation
56
+
57
+ ```
58
+ @article{thapa2025disentangling,
59
+ title={Disentangling Reasoning and Knowledge in Medical Large Language Models},
60
+ author={Thapa, Rahul and Wu, Qingyang and Wu, Kevin and Zhang, Harrison and Zhang, Angela and Wu, Eric and Ye, Haotian and Bedi, Suhana and Aresh, Nevin and Boen, Joseph and Reddy, Shriya and Athiwaratkun, Ben and Song, Shuaiwen Leon and Zou, James},
61
+ journal={arXiv preprint arXiv:2505.11462},
62
+ year={2025},
63
+ url={https://arxiv.org/abs/2505.11462}
64
+ }
65
+ ```