tanhuajie2001
/

Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-7B

@@ -1,22 +1,24 @@
 ---
-license: apache-2.0
-language:
-- en
 datasets:
 - tanhuajie2001/Reason-RFT-CoT-Dataset
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen2-VL-2B-Instruct
 ---
 <div align="center">
 <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
 </div>
-# 🤗 Reason-RFT CoT Dateset
-*The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*.
 <p align="center">
     </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
@@ -32,8 +34,8 @@ base_model:
 |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
 | Visual Counting        | [🤗VC-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-2B) | [🤗VC-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-7B) | [🤗VC-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B) | [🤗VC-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-7B) |
 | Structure Perception   | [🤗SP-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-2B) | [🤗SP-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-7B) | [🤗SP-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-2B) | [🤗SP-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-7B) |
-| Spatial Transformation | [🤗ST-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-2B) | [🤗ST-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-7B) | [🤗ST-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) | [🤗ST-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-7B) |
-| ***Embodied Tasks***   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*  |
 ## 🔥 Overview
@@ -43,9 +45,9 @@ However, this training paradigm may lead to overfitting and cognitive rigidity,
 To address these limitations, we propose **Reason-RFT**, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks.
 **Reason-RFT** introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks.
 To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
-Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
-**(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
-**(3) Data Efficiency**: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines;
 **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
 <div align="center">

 ---
+base_model:
+- Qwen/Qwen2-VL-2B-Instruct
 datasets:
 - tanhuajie2001/Reason-RFT-CoT-Dataset
+language:
+- en
+license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 <div align="center">
 <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
 </div>
+# Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
+This repository contains the model checkpoints for **Reason-RFT**, a model presented in the paper [Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models](https://huggingface.co/papers/2503.20752).
 <p align="center">
     </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
 |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
 | Visual Counting        | [🤗VC-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-2B) | [🤗VC-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-7B) | [🤗VC-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B) | [🤗VC-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-7B) |
 | Structure Perception   | [🤗SP-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-2B) | [🤗SP-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-7B) | [🤗SP-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-2B) | [🤗SP-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-7B) |
+| Spatial Transformation | [🤗ST-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-2B) | [🤗ST-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-7B) | [🤗ST-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) | [🤗ST-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-7B) |
+| ***Embodied Tasks***   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*   | 🤖 *Stay Turned*  |
 ## 🔥 Overview
 To address these limitations, we propose **Reason-RFT**, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks.
 **Reason-RFT** introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks.
 To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
+Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
+**(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
+**(3) Data Efficiency**: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines;
 **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
 <div align="center">