XenoZLH
/

Shuffle-R1-Qwen-3B

Model card Files Files and versions

Shuffle-R1-Qwen-3B / README.md

XenoZLH's picture

Update README.md

1ce00fe verified 4 months ago

|

history blame contribute delete

3.2 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	---



	# Shuffle-R1-Qwen-3B

	This is the model checkpoint of Shuffle-R1-Qwen-3B. It is trained based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)

	## Model Performance

	\| Model \| MathVerse \| MathVision \| MathVista (mini) \| WeMath (loose) \| HallusionBench \| ChartQA \| Avg. \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Qwen2.5-VL-3B \| 34.8 \| 21.9 \| 58.4 \| 51.7 \| 59.8 \| 73.1 \| 49.9 \|
	\| Qwen2.5-VL-7B \| 42.6 \| 25.8 \| 67.4 \| 63.5 \| 65.2 \| 79.8 \| 57.4 \|
	\| Shuffle-R1-3B \| 44.2 \| 26.8 \| 70.4 \| 66.5 \| 69.2 \| 79.9 \| 59.5 \|
	\| Shuffle-R1-7B \| 53.9 \| 30.0 \| 77.0 \| 72.3 \| 71.0 \| 84.1 \| 64.7 \|

	All models are evaluated under CoT prompt.


	## Inference

	### Using Transformers

	The process is the same as [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). Note that it is better to add a "Thinking prompt" at the begining of user query.

	```
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model_path = "path/to/your/checkpoint"

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	)

	processor = AutoProcessor.from_pretrained(model_path)

	system_prompt = """
	You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}.
	"""

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "path/to/your/image"},
	{"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"},
	],
	}
	]

	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	### Using vLLM

	Our model also supports inference using [vLLM](https://github.com/vllm-project/vllm).

	Please refer to our [Official Repo](https://github.com/xiaomi-research/shuffle-r1) for detailed instructions.

	## Citation
	If you find our work useful for your research, please consider citing:
	```
	@misc{zhu2025shuffler1,
	title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle},
	author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai},
	year={2025},
	eprint={2508.05612},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2508.05612},
	}
	```