arxiv:2603.15600

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Published on Mar 16

· Submitted by

Zhixuan Liang on Mar 18

Shanghai Jiao Tong University

Upvote

Authors:

Daqi Gao ,

Zhixuan Liang ,

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

View arXiv page View PDF Project page Add to collection

Community

LeonOverload

1 day ago

This paper proposes PRIMO R1, a model that leverages Reinforcement Learning to elicit the zero-shot reasoning capabilities of Video MLLMs, enabling them to estimate task progress and identify robot execution errors without the need for external reference videos.

Liang-ZX

Paper author Paper submitter about 14 hours ago

•

edited about 14 hours ago

PRIMO R1 reduces mean absolute error in progress estimation by 50% and achieves 67.0% accuracy on RoboFail, surpassing closed-source models including OpenAI o1, with model weights already open-sourced on Hugging Face.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.15600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.15600 in a Space README.md to link it from this page.