Improve model card for Reason-RFT models with pipeline tag, library name, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +100 -13
README.md CHANGED
@@ -1,32 +1,34 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  datasets:
6
  - tanhuajie2001/Reason-RFT-CoT-Dataset
 
 
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - Qwen/Qwen2-VL-2B-Instruct
11
  ---
12
 
13
  <div align="center">
14
  <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
15
  </div>
16
 
17
- # 🤗 Reason-RFT CoT Dateset
18
- *The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*.
19
 
 
20
 
21
  <p align="center">
22
- </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
23
  </p>
24
 
25
  <p align="center">
26
  </a>&nbsp&nbsp🤖 <a href="https://github.com/FlagOpen/RoboBrain/">RoboBrain</a>: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
27
  </p>
28
 
29
- ## ♣️ Model List
30
 
31
  | Tasks | Reason-RFT-Zero-2B | Reason-RFT-Zero-7B | Reason-RFT-2B | Reason-RFT-7B |
32
  |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
@@ -45,7 +47,7 @@ To address these limitations, we propose **Reason-RFT**, a novel reinforcement f
45
  To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
46
  Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
47
  **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
48
- **(3) Data Efficiency**: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines;
49
  **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
50
 
51
  <div align="center">
@@ -61,9 +63,94 @@ Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Per
61
  - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
62
 
63
 
64
- ## ⭐️ Usage
65
-
66
- *Please refer to [Reason-RFT](https://github.com/tanhuajie/Reason-RFT) for more details.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ## 📑 Citation
69
  If you find this project useful, welcome to cite us.
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-2B-Instruct
 
4
  datasets:
5
  - tanhuajie2001/Reason-RFT-CoT-Dataset
6
+ language:
7
+ - en
8
+ license: apache-2.0
9
  metrics:
10
  - accuracy
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
  ---
14
 
15
  <div align="center">
16
  <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
17
  </div>
18
 
19
+ # Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
 
20
 
21
+ This repository contains the official model checkpoints for the project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models", presented in the paper [Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models](https://huggingface.co/papers/2503.20752).
22
 
23
  <p align="center">
24
+ </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📄 <a href="https://huggingface.co/papers/2503.20752">Paper</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
25
  </p>
26
 
27
  <p align="center">
28
  </a>&nbsp&nbsp🤖 <a href="https://github.com/FlagOpen/RoboBrain/">RoboBrain</a>: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
29
  </p>
30
 
31
+ ## Model Zoo
32
 
33
  | Tasks | Reason-RFT-Zero-2B | Reason-RFT-Zero-7B | Reason-RFT-2B | Reason-RFT-7B |
34
  |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
 
47
  To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
48
  Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
49
  **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
50
+ **(3) Data Efficiency**: excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines;
51
  **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
52
 
53
  <div align="center">
 
63
  - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
64
 
65
 
66
+ ## ⭐️ Quick Start Inference
67
+
68
+ For full details on usage, please refer to the [Reason-RFT GitHub repository](https://github.com/tanhuajie/Reason-RFT).
69
+
70
+ ```python
71
+ # git clone https://github.com/tanhuajie/Reason-RFT
72
+ import numpy as np
73
+ import torch
74
+ from longvu.builder import load_pretrained_model # Note: This import seems to be from a different project (LongVU),
75
+ # please verify if it's the correct way to load this model.
76
+ # For transformers compatibility, typically you'd use AutoModel/AutoProcessor
77
+ # as indicated by this model's config.json and tokenizer_config.json.
78
+ from longvu.constants import (
79
+ DEFAULT_IMAGE_TOKEN,
80
+ IMAGE_TOKEN_INDEX,
81
+ )
82
+ from longvu.conversation import conv_templates, SeparatorStyle
83
+ from longvu.mm_datautils import (
84
+ KeywordsStoppingCriteria,
85
+ process_images,
86
+ tokenizer_image_token,
87
+ )
88
+ from decord import cpu, VideoReader
89
+
90
+ # Example loading for Reason-RFT, assuming it can be loaded directly as a transformers model or via a similar builder
91
+ # Replace with the actual model ID from the table above, e.g., "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
92
+ # For direct transformers loading (if supported, which is indicated by file info):
93
+ # from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
94
+ # model_id = "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
95
+ # model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
96
+ # tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
97
+ # processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
98
+
99
+
100
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
101
+ "./checkpoints/longvu_qwen", None, "cambrian_qwen", # These paths/names might need adjustment for Reason-RFT
102
+ )
103
+
104
+ model.eval()
105
+ # Ensure to replace with an actual image path
106
+ image_path = "./path/to/your/image.png"
107
+ qs = "What is the count of blue objects in this image?" # Example question for Visual Counting
108
+
109
+ # For a full Hugging Face Transformers compatible example, you would typically do:
110
+ # from PIL import Image
111
+ # image = Image.open(image_path).convert('RGB')
112
+ # messages = [
113
+ # {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": qs}]},
114
+ # ]
115
+ # text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
116
+ # inputs = processor(text=text_input, images=image, return_tensors="pt").to(model.device)
117
+ # generated_ids = model.generate(**inputs, max_new_tokens=512)
118
+ # response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
119
+ # print(f"Assistant: {response}")
120
+
121
+
122
+ # Original usage from the GitHub repository:
123
+ image = Image.open(image_path).convert('RGB')
124
+ image_sizes = [image.size]
125
+ image_tensor = image_processor(images=image, return_tensors="pt").pixel_values
126
+ image_tensor = [image_tensor.to(model.device, dtype=torch.bfloat16)] # Or appropriate dtype
127
+
128
+ qs = DEFAULT_IMAGE_TOKEN + "
129
+ " + qs
130
+ conv = conv_templates["qwen"].copy()
131
+ conv.append_message(conv.roles[0], qs)
132
+ conv.append_message(conv.roles[1], None)
133
+ prompt = conv.get_prompt()
134
+
135
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
136
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
137
+ keywords = [stop_str]
138
+ stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
139
+
140
+ with torch.inference_mode():
141
+ output_ids = model.generate(
142
+ input_ids,
143
+ images=image_tensor,
144
+ image_sizes=image_sizes,
145
+ do_sample=False,
146
+ temperature=0.2,
147
+ max_new_tokens=128,
148
+ use_cache=True,
149
+ stopping_criteria=[stopping_criteria],
150
+ )
151
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
152
+ print(f'Assistant: {pred}')
153
+ ```
154
 
155
  ## 📑 Citation
156
  If you find this project useful, welcome to cite us.