PhoenixZ
/

MM-HELIX-7B-Thinking

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: video-text-to-text
+tags:
+- multimodal
+- video
+- reasoning
+- qwen
+---
+# MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
+This repository hosts the **MM-HELIX-7B-Thinking** model, a multimodal large language model (MLLM) designed to enhance long-chain reflective reasoning. It was introduced in the paper [MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization](https://huggingface.co/papers/2510.08540).
+The model is based on the Qwen2.5-VL-7B architecture and leverages a novel training strategy called Adaptive Hybrid Policy Optimization (AHPO) to achieve significant improvements in complex, iterative reasoning tasks.
+*   **Project Homepage**: [https://mm-helix.github.io/](https://mm-helix.github.io/)
+*   **Code Repository**: [https://github.com/PhoenixZ810/MM-HELIX](https://github.com/PhoenixZ810/MM-HELIX)
+*   **MM-HELIX-7B-Thinking on Hugging Face**: [https://huggingface.co/PhoenixZ/MM-HELIX-7B-Thinking](https://huggingface.co/PhoenixZ/MM-HELIX-7B-Thinking)
+*   **MM-HELIX Benchmark Dataset**: [https://huggingface.co/datasets/tianhao2k/MM-HELIX](https://huggingface.co/datasets/tianhao2k/MM-HELIX)
+*   **MM-HELIX-100K Dataset**: [https://huggingface.co/datasets/mjuicem/MM-HELIX-100K](https://huggingface.co/datasets/mjuicem/MM-HELIX-100K)
+<p align="center">
+  <img width="100%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/Teaser09241052.png">
+</p>
+## Abstract
+While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
+## Introduction
+While Multimodal Large Language Models (MLLMs) have shown proficiency in tasks like mathematics and logic, their ability for **long-chain reflective reasoning**—a key element for solving complex, real-world problems—is not fully developed. This type of reasoning requires iterative thinking and backtracking, which current models often lack.
+**MM-HELIX** is a comprehensive platform designed to **evaluate** and **enhance** this crucial capability in MLLMs. It consists of:
+*   **A Challenging Benchmark:** A new benchmark, MM-HELIX, featuring 1,260 instances across 42 difficult tasks that demand reflective reasoning. Our findings show that existing MLLMs struggle significantly on this benchmark.
+*   **A High-Quality Dataset:** To address the performance gap, we created MM-HELIX-100K, a dataset with 100,000 high-quality, reflective reasoning instruction-tuning samples, generated through our innovative **Step-Elicited Response Generation (SERG)** pipeline.
+*   **An Advanced Training Method:** We introduce **Adaptive Hybrid Policy Optimization (AHPO)**, a novel training strategy that combines offline supervision with online optimization. This method effectively teaches the model to learn from expert data and explore solutions independently, overcoming issues like sparse rewards and catastrophic forgetting that are common in standard Reinforcement Learning.
+Our model, based on Qwen2.5-VL-7B, shows a **+18.6%** improvement in accuracy on the MM-HELIX benchmark and a **+5.7%** average gain on general math and logic tasks, demonstrating that reflective reasoning can be effectively learned and generalized.
+## Adaptive Hybrid Policy Optimization (AHPO)
+Standard training methods often fall short in complex reasoning tasks. Supervised Fine-Tuning (SFT) can lead to catastrophic forgetting of general capabilities, while on-policy Reinforcement Learning (RL) is inefficient with sparse rewards.
+To solve these issues, we developed **Adaptive Hybrid Policy Optimization (AHPO)**, a novel training algorithm that unifies off-policy supervision and on-policy exploration.
+<p align="center">
+  <img width="80%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/AHPO.png">
+</p>
+AHPO's adaptive mechanism dynamically adjusts the influence of expert data based on the model's performance. When the model struggles (sparse rewards), it relies more on expert guidance. As it improves, it is encouraged to explore and find new solutions on its own. This approach fosters robust and generalizable reasoning skills.
+<p align-center>
+  <img src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/results_1.png">
+</p>
+## MM-HELIX Benchmark
+<p align="center">
+  <img width="100%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/main_table.png">
+  <em><p align="center">The 42 tasks in the MM-HELIX benchmark.</p></em>
+</p>
+The **MM-HELIX benchmark** is designed to test the limits of multimodal long-chain reflective reasoning in MLLMs.
+*   **Diverse and Challenging Tasks:** The benchmark includes 1,260 high-quality samples from 42 unique tasks divided into four categories: **algorithms, graphs, puzzles, and games**.
+*   **Controlled Difficulty:** Tasks are generated procedurally with five levels of difficulty, from Level 1 (very easy) to Level 5 (very hard), allowing for a detailed analysis of model performance at different complexities.
+*   **Automated and Objective Evaluation:** Our framework includes an **Instance Generator**, a deterministic **Solver**, and an automated **Verifier**. The Verifier validates the correctness of model-generated solutions, enabling objective and scalable evaluation, and also serves as a reward oracle in a reinforcement learning setup.
+## MM-HELIX-100K Dataset: High-Quality Multimodal Reflective CoT
+To train models for complex reasoning, a large-scale, high-quality dataset is essential. We introduce **MM-HELIX-100K**, a dataset of 100,000 instruction-tuning instances with detailed, reflective reasoning paths.
+This dataset was created using our **Step-Elicited Response Generation (SERG)** pipeline, which efficiently generates high-quality Chain-of-Thought (CoT) trajectories.
+The SERG pipeline works as follows:
+1.  A rule-based CoT constructor first generates a skeletal reasoning path.
+2.  This path is then refined by a powerful language model (Qwen3-235B) to create a more natural, human-like reasoning process that includes reflective steps.
+3.  Finally, each generated trajectory is validated by our automated verifier to ensure its correctness and quality.
+<p align="center">
+  <img width="50%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/cot.png">
+  <em><p align="center">The Step-Elicited Response Generation (SERG) pipeline.</p></em>
+</p>
+## MM-HELIX Leaderboard
+Our comprehensive evaluation of 23 leading MLLMs on the MM-HELIX benchmark reveals significant limitations in their reflective reasoning abilities. Even top proprietary models struggle to surpass a 50% accuracy threshold, and a notable performance gap exists between multimodal and text-only inputs.
+<p align="center">
+  <img width="100%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/main_table.png">
+  <em><p align="center"><b>Table 1:</b> Evaluation results on MM-HELIX across multimodal and text-only settings.</p></em>
+</p>
+## Training Performance
+When applying AHPO to the Qwen2.5-VL-7B model, we observed remarkable improvements. Our final model, **MM-HELIX-7B-Thinking**, not only achieves a **+18.6%** absolute improvement on the MM-HELIX benchmark but also demonstrates strong generalization with a **+5.7%** average gain on general math and logic benchmarks.
+<p align="center">
+  <img width="100%" src="https://github.com/PhoenixZ810/MM-HELIX/raw/main/images/results_1.png">
+  <em><p align="center"><b>Table 2:</b> Comparison of AHPO and other training strategies.</p></em>
+</p>
+For detailed results and rankings, please refer to our interactive leaderboard on the project page.
+## Citation
+If you find our work useful, please consider citing our paper:
+```bibtex
+@article{zhao2025mmhelix,
+  title={MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization},
+  author={Zhao, Xiangyu and Lin, Junming and Liang, Tianhao and Zhou, Yifan and Chai, Wenhao and Gu, Yuzhe and Wang, Weiyun and Chen, Kai and Luo, Gen and Zhang, Wenwei and Yan, Junchi and Yang, Hua and Duan, Haodong and Yang, Xue},
+  journal={arXiv preprint arXiv:2510.08540},
+  year={2025}
+}
+```