Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Paper
•
2512.16899
•
Published
•
13
This dataset contains the evaluation results of mj1-checkpoint on the MMRB2 (Multimodal RewardBench 2) benchmark.
The MMRB2 benchmark evaluates reward models on multimodal understanding and interleaved generation across 4 task types:
| Task | Description | Count |
|---|---|---|
| Text-to-Image (image) | Evaluating image generation quality | 1000 |
| Image Editing (edit) | Evaluating edit instruction adherence | 1000 |
| Interleaved | Mixed text+image generation | 1000 |
| Visual Reasoning (reasoning) | Step-by-step reasoning quality | 1000 |
Total pairs evaluated: 4000
all_results.jsonl - All evaluation results in JSONL format (for HuggingFace datasets)all_results.json - All evaluation results in JSON formattask_*.json - Individual task results (original format)Each record contains:
pair_id: Unique identifier for the pairtask_type: Task type (image/edit/interleaved/reasoning)model_name: Model used for evaluationforward_judgement: Judgement when responses shown as A, Breverse_judgement: Judgement when responses shown as B, Aforward_metadata: Full model output and reasoning (JSON string)reverse_metadata: Full model output and reasoning (JSON string)forward_reasoning: Extracted reasoning (if available)reverse_reasoning: Extracted reasoning (if available)forward_confidence: Model confidence score (if available)reverse_confidence: Model confidence score (if available)from datasets import load_dataset
# Load from local files
dataset = load_dataset('json', data_files='all_results.jsonl')
# Or load specific task
dataset = load_dataset('json', data_files='task_image_*.json')
If you use these results, please cite the MMRB2 benchmark:
@article{mmrb2,
title={Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image},
author={Meta FAIR},
journal={arXiv preprint arXiv:2512.16899},
year={2025}
}
This evaluation dataset is released under CC BY-NC 4.0 (following the MMRB2 benchmark license).