Add comprehensive model card for MIND-V
Browse filesThis PR adds a comprehensive model card for the MIND-V model, enhancing its discoverability and usefulness on the Hugging Face Hub.
The updates include:
- Adding the `pipeline_tag: robotics` for better categorization.
- Specifying the `license: apache-2.0`.
- Linking to the official paper: [MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment](https://huggingface.co/papers/2512.06628).
- Providing a direct link to the GitHub repository for code and further details.
- Including a concise model description.
- Adding visual demonstrations (GIFs and a pipeline diagram).
- Integrating a ready-to-use sample inference code snippet from the GitHub repository.
- Adding the BibTeX citation and acknowledgments.
Please review these additions.
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
|
| 7 |
+
|
| 8 |
+
[](https://huggingface.co/papers/2512.06628)
|
| 9 |
+
[](https://huggingface.co/Richard-ZZZZZ/MIND-V)
|
| 10 |
+
|
| 11 |
+
This repository contains the official implementation of **MIND-V**, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. It addresses the scarcity of diverse, long-horizon robotic manipulation data by bridging high-level reasoning with pixel-level synthesis. MIND-V leverages a Semantic Reasoning Hub (SRH) for task planning, a Behavioral Semantic Bridge (BSB) for translating instructions into domain-invariant representations, and a Motor Video Generator (MVG) for conditional video rendering. It also employs Staged Visual Future Rollouts and a GRPO reinforcement learning post-training phase for physical alignment.
|
| 12 |
+
|
| 13 |
+
For more details, please refer to the [paper](https://huggingface.co/papers/2512.06628) and the [GitHub repository](https://github.com/Richard-Zhang-AI/MIND-V).
|
| 14 |
+
|
| 15 |
+
### Comprehensive comparison of MIND-V against SOTA models for long-horizon robotic video generation
|
| 16 |
+
|
| 17 |
+
<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/rada.png" width="88%"/>
|
| 18 |
+
|
| 19 |
+
<br>
|
| 20 |
+
|
| 21 |
+
### Long-Horizon Manipulation Demos
|
| 22 |
+
|
| 23 |
+
<div align="center">
|
| 24 |
+
<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long1.gif" width="48%" style="margin:0; padding:0; border:none;"/>
|
| 25 |
+
<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long2.gif" width="48%" style="margin:0; padding:0; border:none;"/>
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
<br>
|
| 29 |
+
|
| 30 |
+
### Overview of our hierarchical framework for long-horizon robotic manipulation video generation
|
| 31 |
+
|
| 32 |
+
<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/pipeline.png" width="100%"/>
|
| 33 |
+
|
| 34 |
+
<div align="center">
|
| 35 |
+
Beginning in the cognitive core, the <b>Semantic Reasoning Hub (SRH)</b> decomposes a high-level instruction into atomic sub-tasks and plans a detailed trajectory for each. These plans are then encapsulated into our novel <b>Behavioral Semantic Bridge (BSB)</b>, a structured, domain-invariant intermediate representation that serves as a precise blueprint for the <b>Motor Video Generator (MVG)</b>. The MVG, a conditional diffusion model, renders photorealistic videos that strictly adhere to the kinematic constraints defined in the BSB. At inference time, <b>Staged Visual Future Rollouts</b> provide a βpropose-verify-refineβ loop for self-correction, ensuring local optimality at each stage to mitigate error accumulation.
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
<br>
|
| 39 |
+
|
| 40 |
+
## βοΈ Quick Start
|
| 41 |
+
|
| 42 |
+
### 1. Setup
|
| 43 |
+
Our environment setup is compatible with CogVideoX. You can follow their configuration to complete the setup.
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
conda create -n mindv python=3.10
|
| 47 |
+
conda activate mindv
|
| 48 |
+
pip install -r requirements.txt
|
| 49 |
+
bash setup_MIND-V_env.sh
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Download models from [download_models.sh](https://github.com/Richard-Zhang-AI/MIND-V/blob/main/download_models.sh) and place them under the base root. The checkpoints should be organized as follows:
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
βββ ckpts
|
| 56 |
+
βββ CogVideoX-Fun-V1.5-5b-InP (pretrained model base)
|
| 57 |
+
βββ MIND-V (fine-tuned transformer)
|
| 58 |
+
βββ sam2 (segmentation model)
|
| 59 |
+
βββ vjepa2 (world models)
|
| 60 |
+
βββ affordance-r1 (semantic reasoning model)
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
**Required:** Configure your own Gemini API key. The project uses Google Gemini (via service account) for visual captioning. Create a Google Cloud project and enable the Gemini API.
|
| 64 |
+
```
|
| 65 |
+
Create a service account β Create Key β JSON
|
| 66 |
+
Save the downloaded JSON as vlm_api/captioner.json
|
| 67 |
+
```
|
| 68 |
+
Example content (replace with your own values):
|
| 69 |
+
```json
|
| 70 |
+
{
|
| 71 |
+
"type": "service_account",
|
| 72 |
+
"project_id": "your-project-id",
|
| 73 |
+
"private_key_id": "your-key-id",
|
| 74 |
+
"private_key": "-----BEGIN PRIVATE KEY-----
|
| 75 |
+
YOUR_PRIVATE_KEY_HERE
|
| 76 |
+
-----END PRIVATE KEY-----
|
| 77 |
+
",
|
| 78 |
+
"client_email": "xxx@your-project.iam.gserviceaccount.com",
|
| 79 |
+
"client_id": "your-client-id",
|
| 80 |
+
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
|
| 81 |
+
"token_uri": "https://oauth2.googleapis.com/token",
|
| 82 |
+
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
|
| 83 |
+
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40your-project.iam.gserviceaccount.com",
|
| 84 |
+
"universe_domain": "googleapis.com"
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
### 2. Long-Horizon Video Generation
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
python long_horizon_video_pipeline.py \
|
| 93 |
+
--image "demos/long_video/bridge1_s1.png" \
|
| 94 |
+
--instruction "First put the towel into the metal pot, then put the spoon into the metal pot" \
|
| 95 |
+
--output "output/long_horizon" \
|
| 96 |
+
--num_inference_steps 20 \
|
| 97 |
+
--transition_frames 5 \
|
| 98 |
+
--seed 42
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
## π Citation
|
| 102 |
+
|
| 103 |
+
If you find this work helpful, please consider citing:
|
| 104 |
+
|
| 105 |
+
```bibtex
|
| 106 |
+
@misc{zhang2025mindvhierarchicalvideogeneration,
|
| 107 |
+
title={MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment},
|
| 108 |
+
author={Ruicheng Zhang and Mingyang Zhang and Jun Zhou and Zhangrui Guo and Xiaofan Liu and Zunnan Xu and Zhizhou Zhong and Puxin Yan and Haocheng Luo and Xiu Li},
|
| 109 |
+
year={2025},
|
| 110 |
+
eprint={2512.06628},
|
| 111 |
+
archivePrefix={arXiv},
|
| 112 |
+
primaryClass={cs.RO},
|
| 113 |
+
url={https://arxiv.org/abs/2512.06628},
|
| 114 |
+
}
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### Acknowledgments
|
| 118 |
+
|
| 119 |
+
We sincerely thank the **RoboMaster** team for their pioneering work in robotic video generation. Our implementation builds upon and extends the excellent codebase from:
|
| 120 |
+
|
| 121 |
+
**https://github.com/KlingTeam/RoboMaster/tree/main**
|
| 122 |
+
|
| 123 |
+
### Additional References
|
| 124 |
+
|
| 125 |
+
- **CogVideoX**: https://github.com/THUDM/CogVideo
|
| 126 |
+
- **V-JEPA2**: https://github.com/facebookresearch/vjepa2
|
| 127 |
+
- **SAM2**: https://github.com/facebookresearch/segment-anything-2
|
| 128 |
+
- **Affordance-R1**: https://github.com/hq-King/Affordance-R1
|