When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
Abstract
Reinforcement learning enhances medical vision-language model performance primarily by sharpening output distributions when models already have sufficient reasoning support, with supervised fine-tuning expanding initial capability and enabling effective reinforcement learning.
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.
Community
MedBridgeRL
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images (2026)
- What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis (2026)
- Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training (2026)
- Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding (2026)
- MediX-R1: Open Ended Medical Reinforcement Learning (2026)
- Learning Self-Correction in Vision-Language Models via Rollout Augmentation (2026)
- On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper