Improve model card for Reason-RFT: Add metadata, update title, news, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +100 -13
README.md CHANGED
@@ -1,22 +1,28 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  datasets:
6
  - tanhuajie2001/Reason-RFT-CoT-Dataset
 
 
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - Qwen/Qwen2-VL-2B-Instruct
 
 
11
  ---
12
 
 
 
 
 
13
  <div align="center">
14
  <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
15
  </div>
16
 
17
- # 🤗 Reason-RFT CoT Dateset
18
- *The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*.
19
-
20
 
21
  <p align="center">
22
  </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
@@ -45,7 +51,7 @@ To address these limitations, we propose **Reason-RFT**, a novel reinforcement f
45
  To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
46
  Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
47
  **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
48
- **(3) Data Efficiency**: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines;
49
  **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
50
 
51
  <div align="center">
@@ -54,16 +60,83 @@ Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Per
54
 
55
  ## 🗞️ News
56
 
 
 
 
57
  - **`2025-04-12`**: ⭐️ We released our [Models](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
58
  - **`2025-04-04`**: 🤗 We released our [datasets](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset/) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
59
  - **`2025-04-02`**: 🔥 We released codes and scripts for training/evaluation on [General Visual Reasoning Tasks](#GeneralVisualTasks).
60
  - **`2025-03-29`**: 🌍 We released the [repository](https://github.com/tanhuajie/Reason-RFT/) and [roadmap](#RoadMap) for **Reason-RFT**.
61
  - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
62
 
63
-
64
- ## ⭐️ Usage
65
-
66
- *Please refer to [Reason-RFT](https://github.com/tanhuajie/Reason-RFT) for more details.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ## 📑 Citation
69
  If you find this project useful, welcome to cite us.
@@ -74,4 +147,18 @@ If you find this project useful, welcome to cite us.
74
  journal={arXiv preprint arXiv:2503.20752},
75
  year={2025}
76
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-2B-Instruct
 
4
  datasets:
5
  - tanhuajie2001/Reason-RFT-CoT-Dataset
6
+ language:
7
+ - en
8
+ license: apache-2.0
9
  metrics:
10
  - accuracy
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - visual-reasoning
15
  ---
16
 
17
+ # Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
18
+
19
+ The model was presented in the paper [Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models](https://huggingface.co/papers/2503.20752).
20
+
21
  <div align="center">
22
  <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
23
  </div>
24
 
25
+ *The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*
 
 
26
 
27
  <p align="center">
28
  </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
 
51
  To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
52
  Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
53
  **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
54
+ **(3) Data Efficiency**: excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines;
55
  **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
56
 
57
  <div align="center">
 
60
 
61
  ## 🗞️ News
62
 
63
+ - **`2025-09-18`**: 🔥🔥🔥 **Reason-RFT** gets accepted to NeurIPS 2025! See you in Mexico City and San Diego, USA!
64
+ - **`2025-06-06`**: 🤖 We're excited to announce the release of our more powerful [RoboBrain 2.0](https://github.com/FlagOpen/RoboBrain2.0) using Reason-RFT.
65
+ - **`2025-04-13`**: ✨ We released our [model zoo](https://github.com/tanhuajie/Reason-RFT?tab=readme-ov-file#--model-zoo) to huggingface.
66
  - **`2025-04-12`**: ⭐️ We released our [Models](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
67
  - **`2025-04-04`**: 🤗 We released our [datasets](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset/) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
68
  - **`2025-04-02`**: 🔥 We released codes and scripts for training/evaluation on [General Visual Reasoning Tasks](#GeneralVisualTasks).
69
  - **`2025-03-29`**: 🌍 We released the [repository](https://github.com/tanhuajie/Reason-RFT/) and [roadmap](#RoadMap) for **Reason-RFT**.
70
  - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
71
 
72
+ ## ⭐️ Sample Usage
73
+
74
+ The following code snippet demonstrates how to perform quick inference with the model. For more details, you could refer to [Github](https://github.com/tanhuajie/Reason-RFT).
75
+
76
+ ```python
77
+ # git clone https://github.com/tanhuajie/Reason-RFT
78
+ import numpy as np
79
+ import torch
80
+ from longvu.builder import load_pretrained_model
81
+ from longvu.constants import (
82
+ DEFAULT_IMAGE_TOKEN,
83
+ IMAGE_TOKEN_INDEX,
84
+ )
85
+ from longvu.conversation import conv_templates, SeparatorStyle
86
+ from longvu.mm_datautils import (
87
+ KeywordsStoppingCriteria,
88
+ process_images,
89
+ tokenizer_image_token,
90
+ )
91
+ from decord import cpu, VideoReader
92
+
93
+ # This is an example, replace with the correct path to your downloaded checkpoint
94
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
95
+ "./checkpoints/longvu_qwen", None, "cambrian_qwen",
96
+ )
97
+
98
+ model.eval()
99
+ video_path = "./examples/video1.mp4" # Replace with your image/video path
100
+ qs = "Describe this video in detail" # Replace with your query
101
+
102
+ # For image input, replace this section with image loading and processing
103
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
104
+ fps = float(vr.get_avg_fps())
105
+ frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
106
+ video = []
107
+ for frame_index in frame_indices:
108
+ img = vr[frame_index].asnumpy()
109
+ video.append(img)
110
+ video = np.stack(video)
111
+ image_sizes = [video[0].shape[:2]]
112
+ video = process_images(video, image_processor, model.config)
113
+ video = [item.unsqueeze(0) for item in video]
114
+
115
+ qs = DEFAULT_IMAGE_TOKEN + "
116
+ " + qs
117
+ conv = conv_templates["qwen"].copy()
118
+ conv.append_message(conv.roles[0], qs)
119
+ conv.append_message(conv.roles[1], None)
120
+ prompt = conv.get_prompt()
121
+
122
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
123
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
124
+ keywords = [stop_str]
125
+ stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
126
+ with torch.inference_mode():
127
+ output_ids = model.generate(
128
+ input_ids,
129
+ images=video,
130
+ image_sizes=image_sizes,
131
+ do_sample=False,
132
+ temperature=0.2,
133
+ max_new_tokens=128,
134
+ use_cache=True,
135
+ stopping_criteria=[stopping_criteria],
136
+ )
137
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
138
+ print(f'Generated response: {pred}')
139
+ ```
140
 
141
  ## 📑 Citation
142
  If you find this project useful, welcome to cite us.
 
147
  journal={arXiv preprint arXiv:2503.20752},
148
  year={2025}
149
  }
150
+
151
+ @article{team2025robobrain,
152
+ title={Robobrain 2.0 technical report},
153
+ author={Team, BAAI RoboBrain and Cao, Mingyu and Tan, Huajie and Ji, Yuheng and Lin, Minglan and Li, Zhiyu and Cao, Zhou and Wang, Pengwei and Zhou, Enshen and Han, Yi and others},
154
+ journal={arXiv preprint arXiv:2507.02029},
155
+ year={2025}
156
+ }
157
+
158
+ @article{ji2025robobrain,
159
+ title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete},
160
+ author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
161
+ journal={arXiv preprint arXiv:2502.21257},
162
+ year={2025}
163
+ }
164
  ```