metadata
license: mit
task_categories:
- video-classification
- reinforcement-learning
- robotics
language:
- en
tags:
- games
- maze
- sokoban
- 3d-navigation
- multimodal
- video
- planning
size_categories:
- 10K<n<100K
VR-Bench: A Multimodal Video Reasoning Benchmark
Dataset Description
This is a multimodal dataset containing video demonstrations of game-playing scenarios across different game types including mazes, 3D mazes, Sokoban puzzles, and trap fields. The dataset is designed for training AI models on visual reasoning, planning, and sequential decision-making tasks.
Dataset Structure
The dataset is organized into three main directories:
train_data/: Training data with subdirectories for each game type and difficulty leveltest_data/: Test data with the same structure as training datatest_data_merge/: Merged test data organized by game type (without difficulty separation)
Game Types
- Maze: Classic 2D maze navigation
- Irregular Maze: Non-standard maze layouts
- Maze3D: Three-dimensional maze navigation
- Sokoban: Box-pushing puzzle game
- Trapfield: Navigation with obstacles and traps
Difficulty Levels
Each game type has three difficulty levels:
easy: Simple layouts with shorter solution pathsmedium: Moderate complexityhard: Complex layouts requiring advanced planning
File Format
Each data sample consists of:
- Video file (
.mp4): Demonstration of gameplay - Image file (
.png): Initial state screenshot - JSON file (
.json): Game state metadata including:- Grid layout and dimensions
- Entity positions (player, goal, boxes)
- Bounding box information
- Render parameters
JSON Structure
{
"version": "1.0",
"game_type": "maze",
"entities": {
"player": {
"pixel_pos": {"x": 165, "y": 45},
"bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
"grid_pos": {"row": 1, "col": 5}
},
"goal": {
"pixel_pos": {"x": 105, "y": 165},
"bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
"grid_pos": {"row": 5, "col": 3}
}
},
"grid": {
"data": [[1,1,1,...], [1,0,0,...], ...],
"height": 7,
"width": 7
},
"render": {
"cell_size": 30,
"image_width": 210,
"image_height": 210
}
}
Metadata CSV
Each subdirectory contains a metadata.csv file with columns:
video: Video filenameprompt: Associated text prompt (currently empty)input_image: Initial state image filename
Usage
This dataset can be used for:
- Visual Planning: Learning to plan sequences of actions from visual input
- Multimodal Learning: Combining video, image, and structured data
- Reinforcement Learning: Training agents on game environments
- Video Understanding: Learning temporal patterns in sequential decision-making
Dataset Statistics
- Total Games: 5 game types
- Difficulty Levels: 3 per game type
- Data Splits: Training and test sets
- File Types: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)
Citation
If you use this dataset in your research, please cite:
@dataset{vr_bench_2025,
title={VR-Bench: A Multimodal Video Reasoning Benchmark},
author={[Author Name]},
year={2025},
url={https://huggingface.co/datasets/[username]/VR-Bench}
}
License
This dataset is released under the MIT License.