mj1-checkpoint MMRB2 Evaluation Results

This dataset contains the evaluation results of mj1-checkpoint on the MMRB2 (Multimodal RewardBench 2) benchmark.

Dataset Description

The MMRB2 benchmark evaluates reward models on multimodal understanding and interleaved generation across 4 task types:

Task	Description	Count
Text-to-Image (image)	Evaluating image generation quality	1000
Image Editing (edit)	Evaluating edit instruction adherence	1000
Interleaved	Mixed text+image generation	1000
Visual Reasoning (reasoning)	Step-by-step reasoning quality	1000

Total pairs evaluated: 4000

Files

all_results.jsonl - All evaluation results in JSONL format (for HuggingFace datasets)
all_results.json - All evaluation results in JSON format
task_*.json - Individual task results (original format)

Schema

Each record contains:

pair_id: Unique identifier for the pair
task_type: Task type (image/edit/interleaved/reasoning)
model_name: Model used for evaluation
forward_judgement: Judgement when responses shown as A, B
reverse_judgement: Judgement when responses shown as B, A
forward_metadata: Full model output and reasoning (JSON string)
reverse_metadata: Full model output and reasoning (JSON string)
forward_reasoning: Extracted reasoning (if available)
reverse_reasoning: Extracted reasoning (if available)
forward_confidence: Model confidence score (if available)
reverse_confidence: Model confidence score (if available)

Usage

from datasets import load_dataset

# Load from local files
dataset = load_dataset('json', data_files='all_results.jsonl')

# Or load specific task
dataset = load_dataset('json', data_files='task_image_*.json')

Citation

If you use these results, please cite the MMRB2 benchmark:

@article{mmrb2,
  title={Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image},
  author={Meta FAIR},
  journal={arXiv preprint arXiv:2512.16899},
  year={2025}
}

License

This evaluation dataset is released under CC BY-NC 4.0 (following the MMRB2 benchmark license).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Bhavkumar21/mmrb2-mj1-checkpoint-results

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Paper • 2512.16899 • Published Dec 18, 2025 • 14