arxiv:2605.09269

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Published on May 10

· Submitted by

Rui Liu on May 12

Tencent Hunyuan

Upvote

Authors:

Rui Liu ,

Runpeng Dai ,

Abstract

DeltaRubric introduces a two-step multimodal preference evaluation approach using a single MLLM, where a Disagreement Planner generates instance-specific verification checklists and a Checklist Verifier executes these checks to produce grounded judgments, improving reward modeling reliability.

AI-generated summary

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

View arXiv page View PDF Project page Add to collection

Community

lr10260

Paper author Paper submitter about 14 hours ago

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward
models, yet existing single-step evaluators can suffer from lazy judging, exploiting
language priors over fine-grained visual verification. While rubric-based evaluation
mitigates these biases in text-only settings, extending it to multimodal tasks is
bottlenecked by the complexity of visual reasoning. The critical differences be-
tween responses often depend on instance-specific visual details. Robust evaluation
requires dynamically synthesizing rubrics that isolate spatial and factual discrepan-
cies. To address this, we introduce DeltaRubric, an approach that reformulates
multimodal preference evaluation as a plan-and-execute process within a single
MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner,
the model generates a neutral, instance-specific verification checklist. Transitioning
into a Checklist Verifier, it executes these self-generated checks against the image
and question to produce the final grounded judgment. We formulate DeltaRubric as
a multi-role reinforcement learning problem, jointly optimizing planning and
verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models,
DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench,
it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points,
largely outperforming standard no-rubric baselines. The results demonstrate that
decomposing evaluation into structured, verifiable steps leads to more reliable and
generalizable multimodal reward modeling.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.09269

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.09269 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.09269 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.09269 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.