--- license: apache-2.0 base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - mm math reasoning datasets: - open-r1/OpenR1-Math-220k metrics: - accuracy --- # TBAC-VLR1-3B ## Overview This is a multimodal language model fine-tuned by **Tencent PCG Basic Algorithm Center**. Based on Qwen2.5-VL-3B-Instruct, TBAC-VLR1-3B-SFT undergoes SFT training using 40k sft data filtered from OpenR1-Math-220k. TBAC-VLR1-3B then employs GRPO (Group Relative Policy Optimization) and adapts Clip-Higher from DAPO, achieving **state-of-the-art** results on several multimodal reasoning benchmarks among models of the same size. ## Performance | Model | **Average** | **MathVista**| **MathVision** | **MathVerse** | **DynaMath** | **LogicVista** | | :-------------------: | :---------: | :-----------:| :------------: | :-----------: | :-----------: | :----------: | | Qwen2-VL-2B | 22.4 | 48.0 | 16.1 | 17.5 | 3.8 | 26.6 | | InternVL2.5-2B | 23.8 | 51.1 | 14.0 | 22.3 | 4.4 | 27.3 | | InternVL3-2B | 31.5 | 57.6 | 20.2 | 24.5 | 14.8 | 40.3 | | Qwen2.5-VL-3B | 33.6 | 61.2 | 21.9 | 31.2 | 13.2 | 40.3 | | VLM-R1-3B-Math-0305 | 34.1 | 62.7 | 21.9 | 32.2 | 13.0 | 40.5 | | Taichu-VLR-3B | 34.3 | 64.9 | 23.1 | 32.1 | 12.6 | 38.7 | | VLAA-Thinker-Qwen2.5VL-3B | 35.7 | 61.0 | 24.4 | 36.4 | 18.2 | 38.5 | | TBAC-VLR1-3B-preview | 36.3 | 64.8 | 25.0 | 33.2 | 17.7 | 40.8 | | TBAC-VLR1-3B-SFT | 35.3 | 57.0 | 27.4 | 41.1 | 15.0 | 36.1 | | TBAC-VLR1-3B | **36.7** | 57.5 | 28.7 | 41.1 | 16.1 | 40.0 | The results of our model are self-reported, obtained by running evaluations offline on each benchmark. ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "TencentBAC/TBAC-VLR1-3B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("TencentBAC/TBAC-VLR1-3B") messages = [ { "role": "system", "content": "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. The final answer MUST BE put in \\boxed{}." }, { "role": "user", "content": [ { "type": "image", "image": image_path, }, {"type": "text", "text": query}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Citation If you find our model useful in your research, please consider giving ❤️ and citations. Thanks! ``` @misc{Ou2025TBACVLR1, title = {TBAC-VLR1-3B}, author = {Ou, Linyu and Xu, Junzhe and Yin, Yuyang}, year = {2025}, url = {https://huggingface.co/TencentBAC/TBAC-VLR1-3B}, } ``` --- **About** Created by the Tencent PCG Basic Algorithm Center. All rights reserved.