Update README.md
Browse files
README.md
CHANGED
|
@@ -21,110 +21,8 @@ pipeline_tag: image-to-video
|
|
| 21 |
|
| 22 |
# VR-Bench: A Multimodal Video Reasoning Benchmark
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
The dataset is organized into three main directories:
|
| 31 |
-
|
| 32 |
-
- `train_data/`: Training data with subdirectories for each game type and difficulty level
|
| 33 |
-
- `test_data/`: Test data with the same structure as training data
|
| 34 |
-
- `test_data_merge/`: Merged test data organized by game type (without difficulty separation)
|
| 35 |
-
|
| 36 |
-
### Game Types
|
| 37 |
-
|
| 38 |
-
1. **Maze**: Classic 2D maze navigation
|
| 39 |
-
2. **Irregular Maze**: Non-standard maze layouts
|
| 40 |
-
3. **Maze3D**: Three-dimensional maze navigation
|
| 41 |
-
4. **Sokoban**: Box-pushing puzzle game
|
| 42 |
-
5. **Trapfield**: Navigation with obstacles and traps
|
| 43 |
-
|
| 44 |
-
### Difficulty Levels
|
| 45 |
-
|
| 46 |
-
Each game type has three difficulty levels:
|
| 47 |
-
- `easy`: Simple layouts with shorter solution paths
|
| 48 |
-
- `medium`: Moderate complexity
|
| 49 |
-
- `hard`: Complex layouts requiring advanced planning
|
| 50 |
-
|
| 51 |
-
## File Format
|
| 52 |
-
|
| 53 |
-
Each data sample consists of:
|
| 54 |
-
- **Video file** (`.mp4`): Demonstration of gameplay
|
| 55 |
-
- **Image file** (`.png`): Initial state screenshot
|
| 56 |
-
- **JSON file** (`.json`): Game state metadata including:
|
| 57 |
-
- Grid layout and dimensions
|
| 58 |
-
- Entity positions (player, goal, boxes)
|
| 59 |
-
- Bounding box information
|
| 60 |
-
- Render parameters
|
| 61 |
-
|
| 62 |
-
### JSON Structure
|
| 63 |
-
|
| 64 |
-
```json
|
| 65 |
-
{
|
| 66 |
-
"version": "1.0",
|
| 67 |
-
"game_type": "maze",
|
| 68 |
-
"entities": {
|
| 69 |
-
"player": {
|
| 70 |
-
"pixel_pos": {"x": 165, "y": 45},
|
| 71 |
-
"bbox": {"x": 150, "y": 30, "width": 30, "height": 30},
|
| 72 |
-
"grid_pos": {"row": 1, "col": 5}
|
| 73 |
-
},
|
| 74 |
-
"goal": {
|
| 75 |
-
"pixel_pos": {"x": 105, "y": 165},
|
| 76 |
-
"bbox": {"x": 90, "y": 150, "width": 30, "height": 30},
|
| 77 |
-
"grid_pos": {"row": 5, "col": 3}
|
| 78 |
-
}
|
| 79 |
-
},
|
| 80 |
-
"grid": {
|
| 81 |
-
"data": [[1,1,1,...], [1,0,0,...], ...],
|
| 82 |
-
"height": 7,
|
| 83 |
-
"width": 7
|
| 84 |
-
},
|
| 85 |
-
"render": {
|
| 86 |
-
"cell_size": 30,
|
| 87 |
-
"image_width": 210,
|
| 88 |
-
"image_height": 210
|
| 89 |
-
}
|
| 90 |
-
}
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
### Metadata CSV
|
| 94 |
-
|
| 95 |
-
Each subdirectory contains a `metadata.csv` file with columns:
|
| 96 |
-
- `video`: Video filename
|
| 97 |
-
- `prompt`: Associated text prompt (currently empty)
|
| 98 |
-
- `input_image`: Initial state image filename
|
| 99 |
-
|
| 100 |
-
## Usage
|
| 101 |
-
|
| 102 |
-
This dataset can be used for:
|
| 103 |
-
- **Visual Planning**: Learning to plan sequences of actions from visual input
|
| 104 |
-
- **Multimodal Learning**: Combining video, image, and structured data
|
| 105 |
-
- **Reinforcement Learning**: Training agents on game environments
|
| 106 |
-
- **Video Understanding**: Learning temporal patterns in sequential decision-making
|
| 107 |
-
|
| 108 |
-
## Dataset Statistics
|
| 109 |
-
|
| 110 |
-
- **Total Games**: 5 game types
|
| 111 |
-
- **Difficulty Levels**: 3 per game type
|
| 112 |
-
- **Data Splits**: Training and test sets
|
| 113 |
-
- **File Types**: Video (.mp4), Images (.png), Metadata (.json), Index (.csv)
|
| 114 |
-
|
| 115 |
-
## Citation
|
| 116 |
-
|
| 117 |
-
If you use this dataset in your research, please cite:
|
| 118 |
-
|
| 119 |
-
```bibtex
|
| 120 |
-
@dataset{vr_bench_2025,
|
| 121 |
-
title={VR-Bench: A Multimodal Video Reasoning Benchmark},
|
| 122 |
-
author={[Author Name]},
|
| 123 |
-
year={2025},
|
| 124 |
-
url={https://huggingface.co/datasets/[username]/VR-Bench}
|
| 125 |
-
}
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
## License
|
| 129 |
-
|
| 130 |
-
This dataset is released under the MIT License.
|
|
|
|
| 21 |
|
| 22 |
# VR-Bench: A Multimodal Video Reasoning Benchmark
|
| 23 |
|
| 24 |
+
## 🧠 Models
|
| 25 |
|
| 26 |
+
| Models | Download Links | Description |
|
| 27 |
+
|---------------------------|----------------|-----------------------------------------------------------------------------|
|
| 28 |
+
| MiniVeo3-Reasoner-Maze-5B | 🤗 [HuggingFace](https://huggingface.co/你的模型地址) | Fine-tuned LoRA for [Maze](https://example.com/maze-docs) tasks (3×3 to 6×6 sizes) from the base model [Wan2.2-TI2V-5B](https://huggingface.co/你的base模型地址) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|