Wan-R1 / README.md

Upload README.md with huggingface_hub

17eb8eb verified 5 months ago

3.42 kB

license: mit
task_categories:
  - video-classification
  - reinforcement-learning
  - robotics
language:
  - en
tags:
  - games
  - maze
  - sokoban
  - 3d-navigation
  - multimodal
  - video
  - planning
size_categories:
  - 10K<n<100K

VR-Bench: A Multimodal Video Reasoning Benchmark

Dataset Description

This is a multimodal dataset containing video demonstrations of game-playing scenarios across different game types including mazes, 3D mazes, Sokoban puzzles, and trap fields. The dataset is designed for training AI models on visual reasoning, planning, and sequential decision-making tasks.

Dataset Structure

The dataset is organized into three main directories:

train_data/: Training data with subdirectories for each game type and difficulty level
test_data/: Test data with the same structure as training data
test_data_merge/: Merged test data organized by game type (without difficulty separation)

Game Types

Maze: Classic 2D maze navigation
Irregular Maze: Non-standard maze layouts
Maze3D: Three-dimensional maze navigation
Sokoban: Box-pushing puzzle game
Trapfield: Navigation with obstacles and traps

Difficulty Levels

Each game type has three difficulty levels:

easy: Simple layouts with shorter solution paths
medium: Moderate complexity
hard: Complex layouts requiring advanced planning

File Format

Each data sample consists of:

Video file (.mp4): Demonstration of gameplay
Image file (.png): Initial state screenshot
JSON file (.json): Game state metadata including:
- Grid layout and dimensions
- Entity positions (player, goal, boxes)
- Bounding box information
- Render parameters

JSON Structure

{
  "version": "1.0",
  "game_type": "maze",
  "entities": {
    "player": {
      "pixel_pos": {"x": 165, "y": 45},
      "bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
      "grid_pos": {"row": 1, "col": 5}
    },
    "goal": {
      "pixel_pos": {"x": 105, "y": 165},
      "bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
      "grid_pos": {"row": 5, "col": 3}
    }
  },
  "grid": {
    "data": [[1,1,1,...], [1,0,0,...], ...],
    "height": 7,
    "width": 7
  },
  "render": {
    "cell_size": 30,
    "image_width": 210,
    "image_height": 210
  }
}

Metadata CSV

Each subdirectory contains a metadata.csv file with columns:

video: Video filename
prompt: Associated text prompt (currently empty)
input_image: Initial state image filename

Usage

This dataset can be used for:

Visual Planning: Learning to plan sequences of actions from visual input
Multimodal Learning: Combining video, image, and structured data
Reinforcement Learning: Training agents on game environments
Video Understanding: Learning temporal patterns in sequential decision-making

Dataset Statistics

Total Games: 5 game types
Difficulty Levels: 3 per game type
Data Splits: Training and test sets
File Types: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)

Citation

If you use this dataset in your research, please cite:

@dataset{vr_bench_2025,
  title={VR-Bench: A Multimodal Video Reasoning Benchmark},
  author={[Author Name]},
  year={2025},
  url={https://huggingface.co/datasets/[username]/VR-Bench}
}

License

This dataset is released under the MIT License.