unicornftk commited on
Commit
621894b
·
verified ·
1 Parent(s): bd27df8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: en
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - conversational
7
+ - text-generation
8
+ - medical
9
+ - diagnosis
10
+ - agent
11
+ - reinforcement-learning
12
+ base_model: Qwen3-8B
13
+ datasets:
14
+ - HealthBench
15
+ - MAQuE
16
+ - MedQA
17
+ - MMLU
18
+ paper: 2510.04284
19
+ model_name: Doctor-R1
20
+ ---
21
+
22
+ # Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
23
+
24
+ **Doctor-R1** is an AI doctor agent trained to conduct strategic, multi-turn patient inquiries to guide its diagnostic decision-making. Unlike traditional models that excel at static medical QA, Doctor-R1 is designed to master the complete, dynamic consultation process, unifying the two core skills of a human physician: communication and decision-making.
25
+
26
+ This model is an 8B parameter agent built upon **Qwen3-8B** and fine-tuned using a novel **Experiential Agentic Reinforcement Learning** framework.
27
+
28
+ ## ✨ Key Features
29
+
30
+ * **Unified Clinical Skills:** The first agent framework to holistically integrate two core clinical skills, **strategic patient inquiry** and **accurate medical decision-making** within a single model.
31
+ * **Experiential Reinforcement Learning:** A novel closed-loop framework where the agent learns and improves from an accumulating repository of its own high-quality experiences.
32
+ * **Dual-Competency Reward System:** A sophisticated two-tiered reward architecture that separately optimizes for both conversational quality (soft skills) and diagnostic accuracy (hard skills), featuring a "safety-first" veto system.
33
+ * **State-of-the-Art Performance:** Outperforms leading open-source models on challenging dynamic benchmarks like HealthBench and MAQuE with high parameter efficiency (8B).
34
+
35
+ ## 🏆 Leaderboards
36
+
37
+ Doctor-R1 demonstrates state-of-the-art performance among open-source models and surpasses several powerful proprietary models on HealthBench. It demonstrates superior performance on dynamic benchmarks and strong foundational knowledge on static QA tasks.
38
+
39
+ | Benchmark | Key Metric | Doctor-R1 | Best Open-Source (>=32B) |
40
+ | :----------------- | :--------- | :-------: | :----------------------: |
41
+ | **HealthBench** | Avg. Score | **36.29** | 33.16 |
42
+ | **MAQuE** | Accuracy | **60.00** | 57.00 |
43
+ | **MedQA** | Accuracy | **83.50** | 81.50 |
44
+ | **MMLU (Medical)** | Accuracy | **85.00** | 84.00 |
45
+
46
+ The detailed breakdown of **HealthBench Main (Dynamic Consultation)** is as below:
47
+
48
+ | Model | Avg. Score | Accuracy | Comm. Quality | Context Aware. |
49
+ | :------------------------ | :--------: | :-------: | :-----------: | :------------: |
50
+ | **GPT-o3** (Proprietary) | 38.91 | 40.31 | 64.78 | 48.09 |
51
+ | **Doctor-R1 (8B)** | **36.29** | **37.84** | **64.15** | **49.24** |
52
+ | Baichuan-M2-32B | 33.16 | 33.95 | 58.01 | 46.80 |
53
+ | Grok-4 (Proprietary) | 33.03 | 37.95 | 61.35 | 45.62 |
54
+ | GPT-4.1 (Proprietary) | 31.18 | 34.78 | 60.65 | 44.81 |
55
+ | UltraMedical-8B | 22.19 | 25.50 | 57.40 | 40.26 |
56
+ | **Base Model (Qwen3-8B)** | 25.13 | 28.57 | 49.35 | 43.00 |
57
+
58
+
59
+
60
+ ## 👥 Human Evaluation
61
+
62
+ To validate that our quantitative results align with user experience, we conducted a pairwise human preference evaluation against other leading models. The results show a decisive preference for Doctor-R1, especially in patient-centric metrics.
63
+
64
+ ![](assets/human.png)
65
+
66
+
67
+
68
+ ## 🔬 Ablation Studies
69
+
70
+ Our ablation studies validate the critical contributions of our framework's key components.
71
+
72
+ ***Impact of Experience Retrieval Mechanism.*** The results show that our full retrieval mechanism with reward and novelty filtering provides a significant performance boost over both a no-experience baseline and a standard similarity-based retrieval, especially in communication skills.
73
+
74
+ <p align="center">
75
+ <img src="assets/radar_exp.jpg" style="width:60%;" />
76
+ </p>
77
+
78
+ ***Impact of Patient Agent Scaling.*** We observe a strong, positive correlation between the number of simulated patient interactions during training and the agent's final performance. This validates that our agentic framework effectively learns and improves from a large volume of diverse experiences.
79
+
80
+ ![](assets/patient_scaling.png)
81
+
82
+
83
+
84
+
85
+ ## 📜 Citation
86
+
87
+ If you find our work useful in your research, please consider citing our paper:
88
+
89
+ ```bibtex
90
+ @misc{lai2025doctorr1masteringclinicalinquiry,
91
+ title={Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning},
92
+ author={Yunghwei Lai and Kaiming Liu and Ziyue Wang and Weizhi Ma and Yang Liu},
93
+ year={2025},
94
+ eprint={2510.04284},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.AI},
97
+ url={https://arxiv.org/abs/2510.04284},
98
+ }
99
+
100
+ ```
101
+
102
+
103
+
104
+ ## 💬 Contact & Questions
105
+
106
+ For collaborations or inquiries, please contact [**laiyunghwei@gmail.com**](mailto:laiyunghwei@gmail.com). You’re also welcome to open an issue or join the discussion in this repository, we value your insights and contributions to **Doctor-R1**.
107
+
108
+ Stay tuned and join our community as we push the boundaries of intelligent healthcare. Together, let’s make medical AI safer, smarter, and more human. 🤝
109
+
110
+