nielsr's picture
nielsr HF Staff
Add pipeline_tag, library_name, paper link, and sample usage
eb6c9e3 verified
|
raw
history blame
8.01 kB
metadata
base_model:
  - Qwen/Qwen2-VL-2B-Instruct
datasets:
  - tanhuajie2001/Reason-RFT-CoT-Dataset
language:
  - en
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: image-text-to-text
library_name: transformers

πŸ€— Reason-RFT CoT Dateset

The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning".

This model is described in the paper Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models.

  β­οΈ Project   β”‚   πŸŒŽ Github   β”‚   πŸ”₯ Dataset   β”‚   πŸ“‘ ArXiv   β”‚   πŸ’¬ WeChat

  πŸ€– RoboBrain: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.

♣️ Model List

Tasks Reason-RFT-Zero-2B Reason-RFT-Zero-7B Reason-RFT-2B Reason-RFT-7B
Visual Counting πŸ€—VC-GRPO-Zero-2B πŸ€—VC-GRPO-Zero-7B πŸ€—VC-GRPO-2B πŸ€—VC-GRPO-7B
Structure Perception πŸ€—SP-GRPO-Zero-2B πŸ€—SP-GRPO-Zero-7B πŸ€—SP-GRPO-2B πŸ€—SP-GRPO-7B
Spatial Transformation πŸ€—ST-GRPO-Zero-2B πŸ€—ST-GRPO-Zero-7B πŸ€—ST-GRPO-2B πŸ€—ST-GRPO-7B
Embodied Tasks πŸ€– Stay Turned πŸ€– Stay Turned πŸ€– Stay Turned πŸ€– Stay Turned

πŸ”₯ Overview

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization. Experimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines; Reason-RFT introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.

πŸ—žοΈ News

  • 2025-09-18: πŸ”₯πŸ”₯πŸ”₯ Reason-RFT gets accepted to NeurIPS 2025! See you in Mexico City and San Diego, USA!
  • 2025-06-06: πŸ€– We're excited to announce the release of our more powerful RoboBrain 2.0 using Reason-RFT.
  • 2025-04-13: ✨ We released our model zoo to huggingface.
  • 2025-04-04: πŸ€— We released our datasets to huggingface for General Visual Reasoning Tasks.
  • 2025-04-02: πŸ”₯ We released codes and scripts for training/evaluation on General Visual Reasoning Tasks.
  • 2025-03-29: 🌍 We released the repository and roadmap for Reason-RFT.
  • 2025-03-26: πŸ“‘ We released our initial ArXiv paper of Reason-RFT.

⭐️ Sample Usage

To get started with Reason-RFT, please follow these steps for setting up the environment and training:

πŸ› οΈ Setup

# clone repo.
git clone https://github.com/tanhuajie/Reason-RFT.git
cd Reason-RFT

# build conda env. for stage_rl
conda create -n reasonrft_rl python=3.10
conda activate reasonrft_rl
pip install -r requirements_rl.txt

# build conda env. for stage_sft
conda create -n reasonrft_sft python=3.10
conda activate reasonrft_sft
pip install -r requirements_sft.txt

♣️ Dataset Preparation

# SFT Training:
change dataset paths defined in './train/stage_sft/dataset_info.json' file.

# RL Training:
change dataset paths defined in './scripts/train/reason_rft/stage_rl/xxx.bash' file.
change dataset paths defined in './scripts/train/reason_rft_zero/xxx.bash' file.

# Evaluation:
change dataset paths defined in './eval/eval_by_vllm_for_open_source.py' file.

πŸ“š Training Example

# Reason-RFT, Task1 (Visual-Counting), Qwen2-vl-2b, STAGE1 + STAGE2
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_2b_task1_stage1_sft.sh
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_2b_task1_stage2_rl.sh

Note: Please change the dataset, pre-trained model and image path in the scripts above.

πŸ“‘ Citation

If you find this project useful, welcome to cite us.

@article{tan2025reason,
  title={Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning},
  author={Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Lin, Minglan and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2503.20752},
  year={2025}
}