zou-lab
/

BioMed-R1-8B

@@ -22,8 +22,7 @@ tags:
 </div>
-Medical reasoning in large language models (LLMs) aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, evaluating true reasoning capabilities remains challenging, as widely used benchmarks-such as MedQA-USMLE, MedMCQA, and PubMedQA-often conflate questions requiring medical reasoning with those solvable through factual recall. We address this limitation by systematically disentangling reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks using a PubMedBERT-based classifier that achieves human-level performance (81\%). Our analysis reveals that only 32.8\% of benchmark questions involve complex reasoning, with the majority focused on factual understanding. Using this stratified dataset, we evaluate recent biomedical reasoning models (HuatuoGPT-o1, MedReason, m1) alongside general-domain models (DeepSeek-R1, o4-mini, Qwen3) and observe a consistent performance gap between knowledge and reasoning—for example, m1 scores 60.5\% vs. 47.1\%, respectively. To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 44.4\% to 29.3\%), while RL-trained and larger general-domain models are more resilient. Based on these insights, we train BioMed-R1-8B using supervised fine-tuning and reinforcement learning on reasoning-heavy examples. While it achieves the strongest overall and adversarial performance among similarly sized models, there remains ample room for improvement. Incorporating additional reasoning-rich data sources, such as clinical case reports, and training on adversarial or backtracking scenarios—with reinforcement learning to encourage self-correction—may further enhance robustness and reliability.
 <div align=center>
 <img src="reasoning_vs_knowledge.png"  width = "90%" alt="reason_vs_knowledge" align=center/>

 </div>
+Medical reasoning in large language models aims to replicate clinicians' cognitive processes when interpreting patient data and making diagnostic decisions. However, widely used benchmarks—such as MedQA-USMLE, MedMCQA, and PubMedQA—mix questions that require multi-step reasoning with those answerable through factual recall, complicating reasoning evaluation. To address this, we develop a PubMedBERT-based classifier (81\% agreement with expert annotations) to disentangle reasoning-heavy from knowledge-heavy questions across 11 biomedical QA benchmarks, revealing that only 32.8\% require complex reasoning. Using this stratification, we evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), and consistently observe lower performance on reasoning versus knowledge (e.g., HuatuoGPT-o1: 56.9\% vs. 44.8\%). To assess robustness, we conduct adversarial evaluations where models are prefilled with incorrect answers before being asked to reconsider. Biomedical models show substantial degradation in this setting (e.g., MedReason drops from 50.4\% to 24.4\%), while RL-trained and larger general-domain models are more resilient. Performance declines more on reasoning-heavy questions, highlighting the brittleness of current medical reasoning capabilities. Based on these insights, we train BioMed-R1 models using supervised fine-tuning and reinforcement learning on reasoning-heavy and adversarial examples, encouraging self-correction and backtracking. Our models achieve the strongest overall and adversarial performance among similarly sized biomedical LLMs, yet ample room for improvement remains. Incorporating additional reasoning-rich data sources—such as clinical case reports—and developing training strategies that promote reasoning under uncertainty may further enhance robustness and diagnostic reliability.
 <div align=center>
 <img src="reasoning_vs_knowledge.png"  width = "90%" alt="reason_vs_knowledge" align=center/>