Update README.md
Browse files
README.md
CHANGED
|
@@ -13,13 +13,13 @@ this model is related to following work:
|
|
| 13 |
[!license](https://img.shields.io/badge/license-mit-white)](https://github.com/AQ-MedAI/MedResearcher-R1/blob/main/LICENSE)
|
| 14 |
|
| 15 |
### author list
|
| 16 |
-
|
| 17 |
|
| 18 |
### abstract
|
| 19 |
-
|
| 20 |
|
| 21 |
## Run Evaluation
|
| 22 |
-
|
| 23 |
|
| 24 |
|
| 25 |
## ✍️Citation
|
|
|
|
| 13 |
[!license](https://img.shields.io/badge/license-mit-white)](https://github.com/AQ-MedAI/MedResearcher-R1/blob/main/LICENSE)
|
| 14 |
|
| 15 |
### author list
|
| 16 |
+
Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu
|
| 17 |
|
| 18 |
### abstract
|
| 19 |
+
Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges—the MedBrowseComp benchmark reveals even GPT-o3 deep research, the leading proprietary deep research system, achieves only 25.5% accuracy on complex medical queries. The key limitations are: (1) insufficient dense medical knowledge for clinical reasoning, and (2) lack of medical-specific retrieval tools. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting longest chains from subgraphs around rare medical entities to generate complex multi-hop QA pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2,100 diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our open-source 32B model achieves competitive performance on general benchmarks (GAIA: 53.4, xBench: 54), comparable to GPT-4o-mini, while outperforming significantly larger proprietary models. More importantly, we establish new state-of-the-art on MedBrowseComp with 27.5% accuracy, surpassing leading closed-source deep research systems including O3 deepresearch, substantially advancing medical deep research capabilities. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains. Code and datasets will be released to facilitate further research.
|
| 20 |
|
| 21 |
## Run Evaluation
|
| 22 |
+
If you would like to use our model for inference and evaluation, please refer to our GitHub repo [](https://github.com/AQ-MedAI/MedResearcher-R1). We provide complete evaluation tools and code in the EvaluationPipeline so that you can verify the performance on some common rankings(such as gaia-103-text) or your own datasets.
|
| 23 |
|
| 24 |
|
| 25 |
## ✍️Citation
|