nielsr HF Staff commited on
Commit
9635bf5
Β·
verified Β·
1 Parent(s): 8b7bc0d

Add comprehensive model card for MIND-V

Browse files

This PR adds a comprehensive model card for the MIND-V model, enhancing its discoverability and usefulness on the Hugging Face Hub.

The updates include:
- Adding the `pipeline_tag: robotics` for better categorization.
- Specifying the `license: apache-2.0`.
- Linking to the official paper: [MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment](https://huggingface.co/papers/2512.06628).
- Providing a direct link to the GitHub repository for code and further details.
- Including a concise model description.
- Adding visual demonstrations (GIFs and a pipeline diagram).
- Integrating a ready-to-use sample inference code snippet from the GitHub repository.
- Adding the BibTeX citation and acknowledgments.

Please review these additions.

Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ ---
5
+
6
+ # MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
7
+
8
+ [![arXiv](https://img.shields.io/badge/arXiv-2512.06628-b31b1b.svg)](https://huggingface.co/papers/2512.06628)
9
+ [![Model](https://img.shields.io/badge/%F0%9F%A4%97_Model-MIND--V-FF6C37)](https://huggingface.co/Richard-ZZZZZ/MIND-V)
10
+
11
+ This repository contains the official implementation of **MIND-V**, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. It addresses the scarcity of diverse, long-horizon robotic manipulation data by bridging high-level reasoning with pixel-level synthesis. MIND-V leverages a Semantic Reasoning Hub (SRH) for task planning, a Behavioral Semantic Bridge (BSB) for translating instructions into domain-invariant representations, and a Motor Video Generator (MVG) for conditional video rendering. It also employs Staged Visual Future Rollouts and a GRPO reinforcement learning post-training phase for physical alignment.
12
+
13
+ For more details, please refer to the [paper](https://huggingface.co/papers/2512.06628) and the [GitHub repository](https://github.com/Richard-Zhang-AI/MIND-V).
14
+
15
+ ### Comprehensive comparison of MIND-V against SOTA models for long-horizon robotic video generation
16
+
17
+ <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/rada.png" width="88%"/>
18
+
19
+ <br>
20
+
21
+ ### Long-Horizon Manipulation Demos
22
+
23
+ <div align="center">
24
+ <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long1.gif" width="48%" style="margin:0; padding:0; border:none;"/>
25
+ <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long2.gif" width="48%" style="margin:0; padding:0; border:none;"/>
26
+ </div>
27
+
28
+ <br>
29
+
30
+ ### Overview of our hierarchical framework for long-horizon robotic manipulation video generation
31
+
32
+ <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/pipeline.png" width="100%"/>
33
+
34
+ <div align="center">
35
+ Beginning in the cognitive core, the <b>Semantic Reasoning Hub (SRH)</b> decomposes a high-level instruction into atomic sub-tasks and plans a detailed trajectory for each. These plans are then encapsulated into our novel <b>Behavioral Semantic Bridge (BSB)</b>, a structured, domain-invariant intermediate representation that serves as a precise blueprint for the <b>Motor Video Generator (MVG)</b>. The MVG, a conditional diffusion model, renders photorealistic videos that strictly adhere to the kinematic constraints defined in the BSB. At inference time, <b>Staged Visual Future Rollouts</b> provide a β€œpropose-verify-refine” loop for self-correction, ensuring local optimality at each stage to mitigate error accumulation.
36
+ </div>
37
+
38
+ <br>
39
+
40
+ ## βš™οΈ Quick Start
41
+
42
+ ### 1. Setup
43
+ Our environment setup is compatible with CogVideoX. You can follow their configuration to complete the setup.
44
+
45
+ ```bash
46
+ conda create -n mindv python=3.10
47
+ conda activate mindv
48
+ pip install -r requirements.txt
49
+ bash setup_MIND-V_env.sh
50
+ ```
51
+
52
+ Download models from [download_models.sh](https://github.com/Richard-Zhang-AI/MIND-V/blob/main/download_models.sh) and place them under the base root. The checkpoints should be organized as follows:
53
+
54
+ ```
55
+ β”œβ”€β”€ ckpts
56
+ β”œβ”€β”€ CogVideoX-Fun-V1.5-5b-InP (pretrained model base)
57
+ β”œβ”€β”€ MIND-V (fine-tuned transformer)
58
+ β”œβ”€β”€ sam2 (segmentation model)
59
+ β”œβ”€β”€ vjepa2 (world models)
60
+ └── affordance-r1 (semantic reasoning model)
61
+ ```
62
+
63
+ **Required:** Configure your own Gemini API key. The project uses Google Gemini (via service account) for visual captioning. Create a Google Cloud project and enable the Gemini API.
64
+ ```
65
+ Create a service account β†’ Create Key β†’ JSON
66
+ Save the downloaded JSON as vlm_api/captioner.json
67
+ ```
68
+ Example content (replace with your own values):
69
+ ```json
70
+ {
71
+ "type": "service_account",
72
+ "project_id": "your-project-id",
73
+ "private_key_id": "your-key-id",
74
+ "private_key": "-----BEGIN PRIVATE KEY-----
75
+ YOUR_PRIVATE_KEY_HERE
76
+ -----END PRIVATE KEY-----
77
+ ",
78
+ "client_email": "xxx@your-project.iam.gserviceaccount.com",
79
+ "client_id": "your-client-id",
80
+ "auth_uri": "https://accounts.google.com/o/oauth2/auth",
81
+ "token_uri": "https://oauth2.googleapis.com/token",
82
+ "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
83
+ "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40your-project.iam.gserviceaccount.com",
84
+ "universe_domain": "googleapis.com"
85
+ }
86
+ ```
87
+
88
+
89
+ ### 2. Long-Horizon Video Generation
90
+
91
+ ```bash
92
+ python long_horizon_video_pipeline.py \
93
+ --image "demos/long_video/bridge1_s1.png" \
94
+ --instruction "First put the towel into the metal pot, then put the spoon into the metal pot" \
95
+ --output "output/long_horizon" \
96
+ --num_inference_steps 20 \
97
+ --transition_frames 5 \
98
+ --seed 42
99
+ ```
100
+
101
+ ## πŸ”— Citation
102
+
103
+ If you find this work helpful, please consider citing:
104
+
105
+ ```bibtex
106
+ @misc{zhang2025mindvhierarchicalvideogeneration,
107
+ title={MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment},
108
+ author={Ruicheng Zhang and Mingyang Zhang and Jun Zhou and Zhangrui Guo and Xiaofan Liu and Zunnan Xu and Zhizhou Zhong and Puxin Yan and Haocheng Luo and Xiu Li},
109
+ year={2025},
110
+ eprint={2512.06628},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.RO},
113
+ url={https://arxiv.org/abs/2512.06628},
114
+ }
115
+ ```
116
+
117
+ ### Acknowledgments
118
+
119
+ We sincerely thank the **RoboMaster** team for their pioneering work in robotic video generation. Our implementation builds upon and extends the excellent codebase from:
120
+
121
+ **https://github.com/KlingTeam/RoboMaster/tree/main**
122
+
123
+ ### Additional References
124
+
125
+ - **CogVideoX**: https://github.com/THUDM/CogVideo
126
+ - **V-JEPA2**: https://github.com/facebookresearch/vjepa2
127
+ - **SAM2**: https://github.com/facebookresearch/segment-anything-2
128
+ - **Affordance-R1**: https://github.com/hq-King/Affordance-R1