base_model:
- Qwen/Qwen2-VL-2B-Instruct
datasets:
- tanhuajie2001/Reason-RFT-CoT-Dataset
language:
- en
license: apache-2.0
metrics:
- accuracy
pipeline_tag: image-text-to-text
library_name: transformers
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
This repository contains the official model checkpoints for the project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models", presented in the paper Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models.
βοΈ Project β π Github β π₯ Dataset β π Paper β π ArXiv β π¬ WeChat
π€ RoboBrain: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
Model Zoo
| Tasks | Reason-RFT-Zero-2B | Reason-RFT-Zero-7B | Reason-RFT-2B | Reason-RFT-7B |
|---|---|---|---|---|
| Visual Counting | π€VC-GRPO-Zero-2B | π€VC-GRPO-Zero-7B | π€VC-GRPO-2B | π€VC-GRPO-7B |
| Structure Perception | π€SP-GRPO-Zero-2B | π€SP-GRPO-Zero-7B | π€SP-GRPO-2B | π€SP-GRPO-7B |
| Spatial Transformation | π€ST-GRPO-Zero-2B | π€ST-GRPO-Zero-7B | π€ST-GRPO-2B | π€ST-GRPO-7B |
| Embodied Tasks | π€ Stay Turned | π€ Stay Turned | π€ Stay Turned | π€ Stay Turned |
π₯ Overview
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization. Experimental results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines; Reason-RFT introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
ποΈ News
2025-04-12: βοΈ We released our Models to huggingface for General Visual Reasoning Tasks.2025-04-04: π€ We released our datasets to huggingface for General Visual Reasoning Tasks.2025-04-02: π₯ We released codes and scripts for training/evaluation on General Visual Reasoning Tasks.2025-03-29: π We released the repository and roadmap for Reason-RFT.2025-03-26: π We released our initial ArXiv paper of Reason-RFT.
βοΈ Quick Start Inference
For full details on usage, please refer to the Reason-RFT GitHub repository.
# git clone https://github.com/tanhuajie/Reason-RFT
import numpy as np
import torch
from longvu.builder import load_pretrained_model # Note: This import seems to be from a different project (LongVU),
# please verify if it's the correct way to load this model.
# For transformers compatibility, typically you'd use AutoModel/AutoProcessor
# as indicated by this model's config.json and tokenizer_config.json.
from longvu.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
KeywordsStoppingCriteria,
process_images,
tokenizer_image_token,
)
from decord import cpu, VideoReader
# Example loading for Reason-RFT, assuming it can be loaded directly as a transformers model or via a similar builder
# Replace with the actual model ID from the table above, e.g., "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
# For direct transformers loading (if supported, which is indicated by file info):
# from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
# model_id = "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
tokenizer, model, image_processor, context_len = load_pretrained_model(
"./checkpoints/longvu_qwen", None, "cambrian_qwen", # These paths/names might need adjustment for Reason-RFT
)
model.eval()
# Ensure to replace with an actual image path
image_path = "./path/to/your/image.png"
qs = "What is the count of blue objects in this image?" # Example question for Visual Counting
# For a full Hugging Face Transformers compatible example, you would typically do:
# from PIL import Image
# image = Image.open(image_path).convert('RGB')
# messages = [
# {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": qs}]},
# ]
# text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
# inputs = processor(text=text_input, images=image, return_tensors="pt").to(model.device)
# generated_ids = model.generate(**inputs, max_new_tokens=512)
# response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
# print(f"Assistant: {response}")
# Original usage from the GitHub repository:
image = Image.open(image_path).convert('RGB')
image_sizes = [image.size]
image_tensor = image_processor(images=image, return_tensors="pt").pixel_values
image_tensor = [image_tensor.to(model.device, dtype=torch.bfloat16)] # Or appropriate dtype
qs = DEFAULT_IMAGE_TOKEN + "
" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0.2,
max_new_tokens=128,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f'Assistant: {pred}')
π Citation
If you find this project useful, welcome to cite us.
@article{tan2025reason,
title={Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning},
author={Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Lin, Minglan and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
journal={arXiv preprint arXiv:2503.20752},
year={2025}
}