| --- |
| base_model: |
| - Qwen/Qwen2.5-VL-7B-Instruct |
| license: apache-2.0 |
| metrics: |
| - mae |
| - accuracy |
| pipeline_tag: video-text-to-text |
| --- |
| |
| # PRIMO R1: Process Reasoning Induced Monitoring |
|
|
| This repository contains the model weights for PRIMO R1, introduced in the paper [From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation](https://huggingface.co/papers/2603.15600). |
|
|
| ## Model Description |
|
|
| PRIMO R1 is a 7B framework designed to transform video Multimodal Large Language Models (MLLMs) from passive "Observers" into active "Critics" for long-horizon robotic manipulation. While traditional models often focus on recognizing ongoing events, PRIMO R1 evaluates the current state of a task relative to its final goal. |
|
|
| The model is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using outcome-based Reinforcement Learning to elicit explicit Chain-of-Thought (CoT) generation for progress estimation. Its architecture incorporates a structured temporal input that anchors video sequences between the initial and current state images. |
|
|
| ## Key Features |
|
|
| - **RL-Induced Reasoning**: Uses outcome-based RL to incentivize the generation of thought processes that evaluate state progress. |
| - **State-of-the-Art Performance**: Achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, outperforming much larger general MLLMs. |
| - **Strong Generalization**: Exhibits zero-shot performance on failure detection tasks, achieving 67.0% accuracy on the RoboFail benchmark and surpassing closed-source models like OpenAI o1. |
| - **Structured Temporal Input**: Explicitly anchors the video sequence between initial and current state images to provide clear goal-oriented context. |
|
|
| ## Citations |
|
|
| If you find our work helpful for your research, please consider citing our work. |
|
|
| ``` |
| @misc{liu2026passiveobserveractivecritic, |
| title={From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation}, |
| author={Yibin Liu and Yaxing Lyu and Daqi Gao and Zhixuan Liang and Weiliang Tang and Shilong Mu and Xiaokang Yang and Yao Mu}, |
| year={2026}, |
| eprint={2603.15600}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2603.15600}, |
| } |
| ``` |