Instructions to use WaltonFuture/Qwen2.5VL-3b-RLCS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WaltonFuture/Qwen2.5VL-3b-RLCS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="WaltonFuture/Qwen2.5VL-3b-RLCS") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("WaltonFuture/Qwen2.5VL-3b-RLCS") model = AutoModelForImageTextToText.from_pretrained("WaltonFuture/Qwen2.5VL-3b-RLCS") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use WaltonFuture/Qwen2.5VL-3b-RLCS with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WaltonFuture/Qwen2.5VL-3b-RLCS" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaltonFuture/Qwen2.5VL-3b-RLCS", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/WaltonFuture/Qwen2.5VL-3b-RLCS
- SGLang
How to use WaltonFuture/Qwen2.5VL-3b-RLCS with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WaltonFuture/Qwen2.5VL-3b-RLCS" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaltonFuture/Qwen2.5VL-3b-RLCS", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WaltonFuture/Qwen2.5VL-3b-RLCS" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaltonFuture/Qwen2.5VL-3b-RLCS", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use WaltonFuture/Qwen2.5VL-3b-RLCS with Docker Model Runner:
docker model run hf.co/WaltonFuture/Qwen2.5VL-3b-RLCS
Improve model card with detailed description, usage, and additional info
Browse filesThis PR significantly enhances the model card by incorporating a detailed introduction, key highlights, and a practical Python code snippet for sample usage. It also includes comprehensive sections for data access, related model access, acknowledgments, and citation, all extracted from the original GitHub README. This update makes the model card more informative and user-friendly for researchers and developers on the Hugging Face Hub.
README.md
CHANGED
|
@@ -4,10 +4,129 @@ base_model:
|
|
| 4 |
datasets:
|
| 5 |
- WaltonFuture/Multimodal-Cold-Start
|
| 6 |
- WaltonFuture/Multimodal-RL-Data
|
|
|
|
| 7 |
license: apache-2.0
|
| 8 |
pipeline_tag: image-text-to-text
|
| 9 |
-
library_name: transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
datasets:
|
| 5 |
- WaltonFuture/Multimodal-Cold-Start
|
| 6 |
- WaltonFuture/Multimodal-RL-Data
|
| 7 |
+
library_name: transformers
|
| 8 |
license: apache-2.0
|
| 9 |
pipeline_tag: image-text-to-text
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
|
| 13 |
+
|
| 14 |
+
* 🐙 **GitHub Repo:** [waltonfuture/RL-with-Cold-Start](https://github.com/waltonfuture/RL-with-Cold-Start)
|
| 15 |
+
* 📜 **Paper (arXiv):** [Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start (arXiv:2505.22334)](https://arxiv.org/abs/2505.22334)
|
| 16 |
+
|
| 17 |
+
## Introduction
|
| 18 |
+
|
| 19 |
+
This model is presented in the paper "Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start". We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities.
|
| 20 |
+
|
| 21 |
+
Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3%→73.4% on MathVista, 62.9%→70.4% on We-Math) and our 3B model achieving performance competitive with several 7B models.
|
| 22 |
+
|
| 23 |
+
<div align=center>
|
| 24 |
+
<img src="https://huggingface.co/WaltonFuture/Qwen2.5VL-3b-RLCS/resolve/main/model_comparison.png" width = "80%" alt="Model Comparison" align=center/>
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
### ✨ Key Highlights
|
| 28 |
+
|
| 29 |
+
* **Two-Stage Approach:** Combines Supervised Fine-Tuning (SFT) as a "cold start" for structured chain-of-thought reasoning with Reinforcement Learning (RL) via GRPO for further refinement.
|
| 30 |
+
* **Enhanced Multimodal Reasoning:** Consistently outperforms both SFT-only and RL-only methods on challenging multimodal reasoning benchmarks.
|
| 31 |
+
* **State-of-the-Art Performance:** Achieves SOTA performance among open-source MLLMs at both 3B and 7B scales.
|
| 32 |
+
* **Significant Improvements:** The 7B model shows substantial gains (e.g., 73.4% on MathVista, 70.4% on We-Math) over base models, while the 3B model is competitive with several 7B models.
|
| 33 |
+
* **Practical Guidance:** Provides practical insights for developing advanced multimodal reasoning models.
|
| 34 |
+
|
| 35 |
+
## Sample Usage
|
| 36 |
+
|
| 37 |
+
You can easily load and use this model with the Hugging Face `transformers` library. Ensure you have `transformers` and `Pillow` installed.
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
pip install transformers Pillow
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Below is an example demonstrating how to perform multimodal inference:
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 47 |
+
from PIL import Image
|
| 48 |
+
import torch
|
| 49 |
+
|
| 50 |
+
# Load the model and processor
|
| 51 |
+
# Replace "WaltonFuture/Qwen2.5VL-3b-RLCS" with "WaltonFuture/Qwen2.5VL-7b-RLCS" for the 7B model.
|
| 52 |
+
model_id = "WaltonFuture/Qwen2.5VL-3b-RLCS"
|
| 53 |
+
|
| 54 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 55 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
|
| 56 |
+
|
| 57 |
+
# Example image (replace with your image path or a PIL Image object)
|
| 58 |
+
# Make sure to provide a valid image path.
|
| 59 |
+
# For example, download an image locally:
|
| 60 |
+
# import requests
|
| 61 |
+
# from io import BytesIO
|
| 62 |
+
# image_url = "https://www.ilusionviajera.com/wp-content/uploads/2021/04/paris-eiffel-tower-in-spring.jpg"
|
| 63 |
+
# response = requests.get(image_url)
|
| 64 |
+
# image = Image.open(BytesIO(response.content)).convert("RGB")
|
| 65 |
+
image_path = "path/to/your/image.jpg" # Replace with your image path
|
| 66 |
+
image = Image.open(image_path).convert("RGB")
|
| 67 |
+
|
| 68 |
+
# Prepare the chat messages in the required multimodal format
|
| 69 |
+
messages = [
|
| 70 |
+
{
|
| 71 |
+
"role": "user",
|
| 72 |
+
"content": [
|
| 73 |
+
{"type": "image", "image": image},
|
| 74 |
+
{"type": "text", "text": "Describe this image in detail and answer any questions about it. For example, what is the main subject?"},
|
| 75 |
+
],
|
| 76 |
+
}
|
| 77 |
+
]
|
| 78 |
+
|
| 79 |
+
# Apply the model's chat template to format the input
|
| 80 |
+
text = processor.apply_chat_template(
|
| 81 |
+
messages,
|
| 82 |
+
tokenize=False,
|
| 83 |
+
add_generation_prompt=True
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
# Process the inputs (text and image) for the model
|
| 87 |
+
input_ids = processor(text=text, images=image, return_tensors="pt").input_ids.to(model.device)
|
| 88 |
+
|
| 89 |
+
# Generate the response
|
| 90 |
+
outputs = model.generate(input_ids=input_ids, max_new_tokens=512, do_sample=True, temperature=0.7)
|
| 91 |
+
|
| 92 |
+
# Decode the generated tokens to a human-readable response
|
| 93 |
+
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 94 |
+
|
| 95 |
+
print(response)
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## Data Access
|
| 99 |
+
|
| 100 |
+
Our two-stage datasets are now available on Hugging Face:
|
| 101 |
+
|
| 102 |
+
| Stage | Data |
|
| 103 |
+
| :------------ | :--------------------------------------------------------------------------------- |
|
| 104 |
+
| Cold Start | [Multimodal-Cold-Start](https://huggingface.co/datasets/WaltonFuture/Multimodal-Cold-Start) |
|
| 105 |
+
| RL | [Multimodal-RL-Data](https://huggingface.co/datasets/WaltonFuture/Multimodal-RL-Data) |
|
| 106 |
+
|
| 107 |
+
## Model Access
|
| 108 |
+
|
| 109 |
+
Our models are now available on Hugging Face:
|
| 110 |
+
|
| 111 |
+
| Backbone | Our model |
|
| 112 |
+
| :------------- | :------------------------------------------------------------ |
|
| 113 |
+
| Qwen2.5-VL-7b | [Qwen2.5VL-7b-RL-with-Cold-Start](https://huggingface.co/WaltonFuture/Qwen2.5VL-7b-RLCS) |
|
| 114 |
+
| Qwen2.5-VL-3b | [Qwen2.5VL-3b-RL-with-Cold-Start](https://huggingface.co/WaltonFuture/Qwen2.5VL-3b-RLCS) |
|
| 115 |
+
|
| 116 |
+
## Acknowledgment
|
| 117 |
+
|
| 118 |
+
Our models are built upon the amazing [Qwen2.5-VL](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) family.
|
| 119 |
+
We thank [EasyR1](https://github.com/hiyouga/EasyR1) and [ms-swift](https://github.com/modelscope/ms-swift) for their training codes.
|
| 120 |
+
|
| 121 |
+
## Citation
|
| 122 |
+
|
| 123 |
+
If our work has been helpful to you, please consider citing it:
|
| 124 |
+
|
| 125 |
+
```bibtex
|
| 126 |
+
@article{wei2025advancing,
|
| 127 |
+
title={Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start},
|
| 128 |
+
author={Wei, Lai and Li, Yuting and Zheng, Kaipeng and Wang, Chen and Wang, Yue and Kong, Linghe and Sun, Lichao and Huang, Weiran},
|
| 129 |
+
journal={arXiv preprint arXiv:2505.22334},
|
| 130 |
+
year={2025}
|
| 131 |
+
}
|
| 132 |
+
```
|