File size: 7,159 Bytes
1fc2057 64d5f02 1fc2057 13e5d98 75f65e9 13e5d98 182fb0c 13e5d98 7964f8c 13e5d98 5e2ea5e 13e5d98 5e2ea5e 13e5d98 b4a5413 13e5d98 64d5f02 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
license: cc-by-nc-sa-4.0
language:
- en
tags:
- medical
- reinforcement-learning
- multimodal
- vision-language
- qwen3-vl
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
- Qwen/Qwen3-VL-30B-A3B-Instruct
---
# MediX-R1: Open-Ended Medical Reinforcement Learning
<p align="center">
<img src="assets/logo_white_no_bg.png" alt="MediX-R1" width="200">
</p>
<p align="center">
<img src="https://i.imgur.com/waxVImv.png" alt="MediX-R1">
</p>
[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)\*, [Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)\*, [Omair Mohamed](https://scholar.google.com), [Mohamed Zidan](https://scholar.google.com), [Fahad Khan](https://sites.google.com/view/fahadkhans/home), [Salman Khan](https://salman-h-khan.github.io/), [Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ), and [Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)
\**Equally contributing first authors*
#### **Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE**
[](https://medix.cvmbzuai.com)
[](https://arxiv.org/pdf/2602.23363)
[](https://huggingface.co/collections/MBZUAI/medix-r1)
[](https://medix.cvmbzuai.com/leaderboard)
---
## Overview
MediX-R1 is an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes vision-language backbones with Group-Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward, a medical embedding-based semantic reward, and lightweight format and modality rewards that enforce interpretable reasoning.
Despite using only ~50K instruction examples, MediX-R1 achieves excellent results across standard medical LLM and VLM benchmarks, outperforming strong open-source baselines.
**Highlights:**
- Our **8B** model achieves an overall average of **68.8%**, outperforming the much larger 27B MedGemma (68.4%).
- Our **30B** model achieves the best overall score of **73.6%**, demonstrating the effectiveness of our composite reward design.
---
## Contributions
- We introduce an **open-ended RL framework** for medical MLLMs that produces clinically grounded, free-form answers beyond MCQ formats.
- We design a **composite reward** combining LLM-based accuracy, embedding-based semantic similarity, format adherence, and modality recognition, providing stable and informative feedback where traditional verifiable or MCQ-only rewards fall short.
- We propose a **unified evaluation framework** for both text-only and image+text tasks using a Reference-based LLM-as-judge, capturing semantic correctness, reasoning, and contextual alignment.
- Despite using only **~50K** instruction examples, MediX-R1 achieves state-of-the-art results across diverse medical LLM and VLM benchmarks, with particularly large gains on open-ended clinical tasks.
---
## Architecture
<p align="center">
<img src="assets/medix-r1_arch.png" alt="MediX-R1 Architecture" width="100%">
</p>
---
## Composite Reward Design
MediX-R1 uses a multi-signal reward combining LLM-based accuracy, embedding-based semantic similarity, format adherence, and modality recognition. This stabilizes training and prevents reward hacking compared to single-signal approaches.
<p align="center">
<img src="assets/reward_design_graph.png" alt="Reward Design" width="60%">
</p>
---
## Qualitative Examples
<p align="center">
<img src="assets/microscopy_qualitative.png" alt="Microscopy Example" width="85%">
<img src="assets/xray_qualitative.png" alt="X-ray Example" width="85%">
</p>
---
## Training
We provide training configs for all model sizes using GRPO and DAPO algorithms. The training pipeline uses a vLLM-based reward server for LLM-as-judge scoring during RL training.
```bash
cd training
pip install -e .
bash vllm_serve.sh # Step 1: Start the reward server
bash run_train.sh # Step 2: Launch RL training
bash merge_model.sh # Step 3: Merge FSDP checkpoints
```
Training data: [MBZUAI/medix-rl-data](https://huggingface.co/datasets/MBZUAI/medix-rl-data) (~51K train, ~2.5K test samples)
See [`training/README.md`](https://github.com/mbzuai-oryx/MediX-R1/blob/main/training/README.md) for detailed setup, configuration options, and per-model scripts.
## Evaluation
We propose a unified evaluation framework for both text-only (LLM) and image+text (VLM) tasks using a Reference-based LLM-as-judge across 17 medical benchmarks.
```bash
cd eval
pip install uv && uv pip install -r requirements.txt
bash eval.sh # Run all phases: generate, evaluate, score
```
Supports self-hosted judge models via vLLM or [OpenRouter](https://openrouter.ai/) as a remote alternative. Results can be submitted to the [MediX Leaderboard](https://medix.cvmbzuai.com/leaderboard).
See [`eval/README.md`](https://github.com/mbzuai-oryx/MediX-R1/blob/main/eval/README.md) for task selection, CLI reference, and MMMU-Medical evaluation.
---
## Model Zoo
| Model | HuggingFace |
|-------|-------------|
| MediX-R1-2B | [MBZUAI/MediX-R1-2B](https://huggingface.co/MBZUAI/MediX-R1-2B) |
| MediX-R1-8B | [MBZUAI/MediX-R1-8B](https://huggingface.co/MBZUAI/MediX-R1-8B) |
| MediX-R1-30B | [MBZUAI/MediX-R1-30B](https://huggingface.co/MBZUAI/MediX-R1-30B) |
---
## Citation
If you use MediX-R1 in your research, please cite our work as follows:
```bibtex
@misc{mullappilly2026medixr1openendedmedical,
title={MediX-R1: Open Ended Medical Reinforcement Learning},
author={Sahal Shaji Mullappilly and Mohammed Irfan Kurpath and Omair Mohamed and Mohamed Zidan and Fahad Khan and Salman Khan and Rao Anwer and Hisham Cholakkal},
year={2026},
eprint={2602.23363},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.23363},
}
```
---
## License
This project is released for **research purposes only** under [*CC-BY-NC-SA 4.0*](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.en) License. It is not intended for clinical or commercial use.
Users are urged to employ MediX-R1 responsibly, especially when applying its outputs in real-world medical scenarios. It is imperative to verify the model's advice with qualified healthcare professionals and not rely on it for medical diagnoses or treatment decisions.
---
## Acknowledgements
We are thankful to [EasyR1](https://github.com/hiyouga/EasyR1) (a fork of [veRL](https://github.com/volcengine/verl)) for their open-source RL training framework.
This work was partially supported with *NVIDIA Academic Grant 2025* and *MBZUAI-IITD* Research Collaboration Seed Grant.
We are grateful to [MBZUAI](https://mbzuai.ac.ae/) for compute and support. |