Wan-R1 / README.md
HY-Wan's picture
Upload README.md with huggingface_hub
17eb8eb verified
|
raw
history blame
3.42 kB
metadata
license: mit
task_categories:
  - video-classification
  - reinforcement-learning
  - robotics
language:
  - en
tags:
  - games
  - maze
  - sokoban
  - 3d-navigation
  - multimodal
  - video
  - planning
size_categories:
  - 10K<n<100K

VR-Bench: A Multimodal Video Reasoning Benchmark

Dataset Description

This is a multimodal dataset containing video demonstrations of game-playing scenarios across different game types including mazes, 3D mazes, Sokoban puzzles, and trap fields. The dataset is designed for training AI models on visual reasoning, planning, and sequential decision-making tasks.

Dataset Structure

The dataset is organized into three main directories:

  • train_data/: Training data with subdirectories for each game type and difficulty level
  • test_data/: Test data with the same structure as training data
  • test_data_merge/: Merged test data organized by game type (without difficulty separation)

Game Types

  1. Maze: Classic 2D maze navigation
  2. Irregular Maze: Non-standard maze layouts
  3. Maze3D: Three-dimensional maze navigation
  4. Sokoban: Box-pushing puzzle game
  5. Trapfield: Navigation with obstacles and traps

Difficulty Levels

Each game type has three difficulty levels:

  • easy: Simple layouts with shorter solution paths
  • medium: Moderate complexity
  • hard: Complex layouts requiring advanced planning

File Format

Each data sample consists of:

  • Video file (.mp4): Demonstration of gameplay
  • Image file (.png): Initial state screenshot
  • JSON file (.json): Game state metadata including:
    • Grid layout and dimensions
    • Entity positions (player, goal, boxes)
    • Bounding box information
    • Render parameters

JSON Structure

{
  "version": "1.0",
  "game_type": "maze",
  "entities": {
    "player": {
      "pixel_pos": {"x": 165, "y": 45},
      "bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
      "grid_pos": {"row": 1, "col": 5}
    },
    "goal": {
      "pixel_pos": {"x": 105, "y": 165},
      "bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
      "grid_pos": {"row": 5, "col": 3}
    }
  },
  "grid": {
    "data": [[1,1,1,...], [1,0,0,...], ...],
    "height": 7,
    "width": 7
  },
  "render": {
    "cell_size": 30,
    "image_width": 210,
    "image_height": 210
  }
}

Metadata CSV

Each subdirectory contains a metadata.csv file with columns:

  • video: Video filename
  • prompt: Associated text prompt (currently empty)
  • input_image: Initial state image filename

Usage

This dataset can be used for:

  • Visual Planning: Learning to plan sequences of actions from visual input
  • Multimodal Learning: Combining video, image, and structured data
  • Reinforcement Learning: Training agents on game environments
  • Video Understanding: Learning temporal patterns in sequential decision-making

Dataset Statistics

  • Total Games: 5 game types
  • Difficulty Levels: 3 per game type
  • Data Splits: Training and test sets
  • File Types: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)

Citation

If you use this dataset in your research, please cite:

@dataset{vr_bench_2025,
  title={VR-Bench: A Multimodal Video Reasoning Benchmark},
  author={[Author Name]},
  year={2025},
  url={https://huggingface.co/datasets/[username]/VR-Bench}
}

License

This dataset is released under the MIT License.