DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
Abstract
DeltaRubric introduces a two-step multimodal preference evaluation approach using a single MLLM, where a Disagreement Planner generates instance-specific verification checklists and a Checklist Verifier executes these checks to produce grounded judgments, improving reward modeling reliability.
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
Community
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward
models, yet existing single-step evaluators can suffer from lazy judging, exploiting
language priors over fine-grained visual verification. While rubric-based evaluation
mitigates these biases in text-only settings, extending it to multimodal tasks is
bottlenecked by the complexity of visual reasoning. The critical differences be-
tween responses often depend on instance-specific visual details. Robust evaluation
requires dynamically synthesizing rubrics that isolate spatial and factual discrepan-
cies. To address this, we introduce DeltaRubric, an approach that reformulates
multimodal preference evaluation as a plan-and-execute process within a single
MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner,
the model generates a neutral, instance-specific verification checklist. Transitioning
into a Checklist Verifier, it executes these self-generated checks against the image
and question to produce the final grounded judgment. We formulate DeltaRubric as
a multi-role reinforcement learning problem, jointly optimizing planning and
verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models,
DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench,
it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points,
largely outperforming standard no-rubric baselines. The results demonstrate that
decomposing evaluation into structured, verifiable steps leads to more reliable and
generalizable multimodal reward modeling.
Get this paper in your agent:
hf papers read 2605.09269 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper