Improve model card for Reason-RFT models with pipeline tag, library name, and usage example

by nielsr HF Staff - opened Oct 7, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+100

-13

Files changed (1) hide show

README.md +100 -13

README.md CHANGED Viewed

@@ -1,32 +1,34 @@
 ---
-license: apache-2.0
-language:
-- en
 datasets:
 - tanhuajie2001/Reason-RFT-CoT-Dataset
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen2-VL-2B-Instruct
 ---
 <div align="center">
 <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
 </div>
-# 🤗 Reason-RFT CoT Dateset
-*The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*.
 <p align="center">
-    </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
 </p>
 <p align="center">
 </a>&nbsp&nbsp🤖 <a href="https://github.com/FlagOpen/RoboBrain/">RoboBrain</a>: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
 </p>
-## ♣️ Model List
 | Tasks                  | Reason-RFT-Zero-2B        | Reason-RFT-Zero-7B       | Reason-RFT-2B        | Reason-RFT-7B             |
 |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
@@ -45,7 +47,7 @@ To address these limitations, we propose **Reason-RFT**, a novel reinforcement f
 To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
 Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
 **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
-**(3) Data Efficiency**: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines;
 **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
 <div align="center">
@@ -61,9 +63,94 @@ Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Per
 - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
-## ⭐️ Usage
-*Please refer to [Reason-RFT](https://github.com/tanhuajie/Reason-RFT) for more details.*
 ## 📑 Citation
 If you find this project useful, welcome to cite us.

 ---
+base_model:
+- Qwen/Qwen2-VL-2B-Instruct
 datasets:
 - tanhuajie2001/Reason-RFT-CoT-Dataset
+language:
+- en
+license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 <div align="center">
 <img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
 </div>
+# Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
+This repository contains the official model checkpoints for the project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models", presented in the paper [Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models](https://huggingface.co/papers/2503.20752).
 <p align="center">
+    </a>&nbsp&nbsp⭐️ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a></a>&nbsp&nbsp │ &nbsp&nbsp🌎 <a href="https://github.com/tanhuajie/Reason-RFT">Github</a>&nbsp&nbsp │ &nbsp&nbsp🔥 <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a>&nbsp&nbsp │ &nbsp&nbsp📄 <a href="https://huggingface.co/papers/2503.20752">Paper</a>&nbsp&nbsp │ &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2503.20752">ArXiv</a>&nbsp&nbsp │ &nbsp&nbsp💬 <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
 </p>
 <p align="center">
 </a>&nbsp&nbsp🤖 <a href="https://github.com/FlagOpen/RoboBrain/">RoboBrain</a>: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
 </p>
+## Model Zoo
 | Tasks                  | Reason-RFT-Zero-2B        | Reason-RFT-Zero-7B       | Reason-RFT-2B        | Reason-RFT-7B             |
 |------------------------|---------------------------|---------------------|---------------------------|---------------------------|
 To evaluate **Reason-RFT**'s visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial transformation, serving as a benchmark to systematically assess visual cognition, geometric understanding, and spatial generalization.
 Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Performance Enhancement**: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models;
 **(2) Generalization Superiority**: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms;
+**(3) Data Efficiency**: excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines;
 **Reason-RFT** introduces a novel paradigm in visual reasoning, significantly advancing multimodal research.
 <div align="center">
 - **`2025-03-26`**: 📑 We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
+## ⭐️ Quick Start Inference
+For full details on usage, please refer to the [Reason-RFT GitHub repository](https://github.com/tanhuajie/Reason-RFT).
+```python
+# git clone https://github.com/tanhuajie/Reason-RFT
+import numpy as np
+import torch
+from longvu.builder import load_pretrained_model # Note: This import seems to be from a different project (LongVU),
+                                                # please verify if it's the correct way to load this model.
+                                                # For transformers compatibility, typically you'd use AutoModel/AutoProcessor
+                                                # as indicated by this model's config.json and tokenizer_config.json.
+from longvu.constants import (
+    DEFAULT_IMAGE_TOKEN,
+    IMAGE_TOKEN_INDEX,
+)
+from longvu.conversation import conv_templates, SeparatorStyle
+from longvu.mm_datautils import (
+    KeywordsStoppingCriteria,
+    process_images,
+    tokenizer_image_token,
+)
+from decord import cpu, VideoReader
+# Example loading for Reason-RFT, assuming it can be loaded directly as a transformers model or via a similar builder
+# Replace with the actual model ID from the table above, e.g., "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
+# For direct transformers loading (if supported, which is indicated by file info):
+# from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
+# model_id = "tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B"
+# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
+# tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+# processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+tokenizer, model, image_processor, context_len = load_pretrained_model(
+    "./checkpoints/longvu_qwen", None, "cambrian_qwen", # These paths/names might need adjustment for Reason-RFT
+)
+model.eval()
+# Ensure to replace with an actual image path
+image_path = "./path/to/your/image.png"
+qs = "What is the count of blue objects in this image?" # Example question for Visual Counting
+# For a full Hugging Face Transformers compatible example, you would typically do:
+# from PIL import Image
+# image = Image.open(image_path).convert('RGB')
+# messages = [
+#     {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": qs}]},
+# ]
+# text_input = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+# inputs = processor(text=text_input, images=image, return_tensors="pt").to(model.device)
+# generated_ids = model.generate(**inputs, max_new_tokens=512)
+# response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
+# print(f"Assistant: {response}")
+# Original usage from the GitHub repository:
+image = Image.open(image_path).convert('RGB')
+image_sizes = [image.size]
+image_tensor = image_processor(images=image, return_tensors="pt").pixel_values
+image_tensor = [image_tensor.to(model.device, dtype=torch.bfloat16)] # Or appropriate dtype
+qs = DEFAULT_IMAGE_TOKEN + "
+" + qs
+conv = conv_templates["qwen"].copy()
+conv.append_message(conv.roles[0], qs)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
+stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+keywords = [stop_str]
+stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+with torch.inference_mode():
+    output_ids = model.generate(
+        input_ids,
+        images=image_tensor,
+        image_sizes=image_sizes,
+        do_sample=False,
+        temperature=0.2,
+        max_new_tokens=128,
+        use_cache=True,
+        stopping_criteria=[stopping_criteria],
+    )
+pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(f'Assistant: {pred}')
+```
 ## 📑 Citation
 If you find this project useful, welcome to cite us.