mj1-checkpoint MMRB2 Evaluation Results

This dataset contains the evaluation results of mj1-checkpoint on the MMRB2 (Multimodal RewardBench 2) benchmark.

Dataset Description

The MMRB2 benchmark evaluates reward models on multimodal understanding and interleaved generation across 4 task types:

Task Description Count
Text-to-Image (image) Evaluating image generation quality 1000
Image Editing (edit) Evaluating edit instruction adherence 1000
Interleaved Mixed text+image generation 1000
Visual Reasoning (reasoning) Step-by-step reasoning quality 1000

Total pairs evaluated: 4000

Files

  • all_results.jsonl - All evaluation results in JSONL format (for HuggingFace datasets)
  • all_results.json - All evaluation results in JSON format
  • task_*.json - Individual task results (original format)

Schema

Each record contains:

  • pair_id: Unique identifier for the pair
  • task_type: Task type (image/edit/interleaved/reasoning)
  • model_name: Model used for evaluation
  • forward_judgement: Judgement when responses shown as A, B
  • reverse_judgement: Judgement when responses shown as B, A
  • forward_metadata: Full model output and reasoning (JSON string)
  • reverse_metadata: Full model output and reasoning (JSON string)
  • forward_reasoning: Extracted reasoning (if available)
  • reverse_reasoning: Extracted reasoning (if available)
  • forward_confidence: Model confidence score (if available)
  • reverse_confidence: Model confidence score (if available)

Usage

from datasets import load_dataset

# Load from local files
dataset = load_dataset('json', data_files='all_results.jsonl')

# Or load specific task
dataset = load_dataset('json', data_files='task_image_*.json')

Citation

If you use these results, please cite the MMRB2 benchmark:

@article{mmrb2,
  title={Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image},
  author={Meta FAIR},
  journal={arXiv preprint arXiv:2512.16899},
  year={2025}
}

License

This evaluation dataset is released under CC BY-NC 4.0 (following the MMRB2 benchmark license).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Bhavkumar21/mmrb2-mj1-checkpoint-results