--- license: apache-2.0 base_model: - Qwen/Qwen2.5-VL-3B-Instruct --- # Shuffle-R1-Qwen-3B This is the model checkpoint of Shuffle-R1-Qwen-3B. It is trained based on [**Qwen2.5-VL-3B**](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) ## Model Performance | Model | MathVerse | MathVision | MathVista (mini) | WeMath (loose) | HallusionBench | ChartQA | Avg. | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Qwen2.5-VL-3B | 34.8 | 21.9 | 58.4 | 51.7 | 59.8 | 73.1 | 49.9 | | Qwen2.5-VL-7B | 42.6 | 25.8 | 67.4 | 63.5 | 65.2 | 79.8 | 57.4 | | Shuffle-R1-3B | 44.2 | 26.8 | 70.4 | 66.5 | 69.2 | 79.9 | 59.5 | | Shuffle-R1-7B | 53.9 | 30.0 | 77.0 | 72.3 | 71.0 | 84.1 | 64.7 | All models are evaluated under CoT prompt. ## Inference ### Using *Transformers* The process is the same as [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). Note that it is better to add a "Thinking prompt" at the begining of user query. ``` from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "path/to/your/checkpoint" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained(model_path) system_prompt = """ You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \\boxed{}. """ messages = [ { "role": "user", "content": [ {"type": "image", "image": "path/to/your/image"}, {"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ### Using *vLLM* Our model also supports inference using [**vLLM**](https://github.com/vllm-project/vllm). Please refer to our [**Official Repo**](https://github.com/xiaomi-research/shuffle-r1) for detailed instructions. ## Citation If you find our work useful for your research, please consider citing: ``` @misc{zhu2025shuffler1, title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle}, author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai}, year={2025}, eprint={2508.05612}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.05612}, } ```