1120JJ commited on
Commit
a1d4c77
·
verified ·
1 Parent(s): e2a9570

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen2.5-32B-Instruct
5
  ---
@@ -8,6 +8,8 @@ this model is related to following work:
8
  ## MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
9
 
10
  [![arXiv](https://img.shields.io/badge/arxiv-2508.14880-blue)](https://arxiv.org/abs/2508.14880)
 
 
11
 
12
  ### author list
13
  > Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu
@@ -16,7 +18,7 @@ this model is related to following work:
16
  > Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges—the MedBrowseComp benchmark reveals even GPT-o3 deep research, the leading proprietary deep research system, achieves only 25.5% accuracy on complex medical queries. The key limitations are: (1) insufficient dense medical knowledge for clinical reasoning, and (2) lack of medical-specific retrieval tools. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting longest chains from subgraphs around rare medical entities to generate complex multi-hop QA pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2,100 diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our open-source 32B model achieves competitive performance on general benchmarks (GAIA: 53.4, xBench: 54), comparable to GPT-4o-mini, while outperforming significantly larger proprietary models. More importantly, we establish new state-of-the-art on MedBrowseComp with 27.5% accuracy, surpassing leading closed-source deep research systems including O3 deepresearch, substantially advancing medical deep research capabilities. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains. Code and datasets will be released to facilitate further research.
17
 
18
  ## Run Evaluation
19
-
20
 
21
 
22
  ## ✍️Citation
 
1
  ---
2
+ license: mit
3
  base_model:
4
  - Qwen/Qwen2.5-32B-Instruct
5
  ---
 
8
  ## MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
9
 
10
  [![arXiv](https://img.shields.io/badge/arxiv-2508.14880-blue)](https://arxiv.org/abs/2508.14880)
11
+ [![github](https://img.shields.io/badge/github-MedResearcher-orange)](https://github.com/AQ-MedAI/MedResearcher-R1)
12
+ [!license](https://img.shields.io/badge/license-mit-white)](https://github.com/AQ-MedAI/MedResearcher-R1/blob/main/LICENSE)
13
 
14
  ### author list
15
  > Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu
 
18
  > Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges—the MedBrowseComp benchmark reveals even GPT-o3 deep research, the leading proprietary deep research system, achieves only 25.5% accuracy on complex medical queries. The key limitations are: (1) insufficient dense medical knowledge for clinical reasoning, and (2) lack of medical-specific retrieval tools. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting longest chains from subgraphs around rare medical entities to generate complex multi-hop QA pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2,100 diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our open-source 32B model achieves competitive performance on general benchmarks (GAIA: 53.4, xBench: 54), comparable to GPT-4o-mini, while outperforming significantly larger proprietary models. More importantly, we establish new state-of-the-art on MedBrowseComp with 27.5% accuracy, surpassing leading closed-source deep research systems including O3 deepresearch, substantially advancing medical deep research capabilities. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains. Code and datasets will be released to facilitate further research.
19
 
20
  ## Run Evaluation
21
+ > If you would like to use our model for inference and evaluation, please refer to our GitHub repo [![github](https://img.shields.io/badge/github-MedResearcher-orange)](https://github.com/AQ-MedAI/MedResearcher-R1). We provide complete evaluation tools and code in the EvaluationPipeline so that you can verify the performance on some common rankings(such as gaia-103-text) or your own datasets.
22
 
23
 
24
  ## ✍️Citation