Add comprehensive model card for MIND-V

This PR adds a comprehensive model card for the MIND-V model, enhancing its discoverability and usefulness on the Hugging Face Hub.

The updates include:
- Adding the `pipeline_tag: robotics` for better categorization.
- Specifying the `license: apache-2.0`.
- Linking to the official paper: [MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment](https://huggingface.co/papers/2512.06628).
- Providing a direct link to the GitHub repository for code and further details.
- Including a concise model description.
- Adding visual demonstrations (GIFs and a pipeline diagram).
- Integrating a ready-to-use sample inference code snippet from the GitHub repository.
- Adding the BibTeX citation and acknowledgments.

Please review these additions.

Files changed (1) hide show

README.md +128 -0

README.md ADDED Viewed

	@@ -0,0 +1,128 @@

+---
+license: apache-2.0
+pipeline_tag: robotics
+---
+# MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
+[![arXiv](https://img.shields.io/badge/arXiv-2512.06628-b31b1b.svg)](https://huggingface.co/papers/2512.06628)
+[![Model](https://img.shields.io/badge/%F0%9F%A4%97_Model-MIND--V-FF6C37)](https://huggingface.co/Richard-ZZZZZ/MIND-V)
+This repository contains the official implementation of **MIND-V**, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. It addresses the scarcity of diverse, long-horizon robotic manipulation data by bridging high-level reasoning with pixel-level synthesis. MIND-V leverages a Semantic Reasoning Hub (SRH) for task planning, a Behavioral Semantic Bridge (BSB) for translating instructions into domain-invariant representations, and a Motor Video Generator (MVG) for conditional video rendering. It also employs Staged Visual Future Rollouts and a GRPO reinforcement learning post-training phase for physical alignment.
+For more details, please refer to the [paper](https://huggingface.co/papers/2512.06628) and the [GitHub repository](https://github.com/Richard-Zhang-AI/MIND-V).
+### Comprehensive comparison of MIND-V against SOTA models for long-horizon robotic video generation
+<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/rada.png" width="88%"/>
+<br>
+### Long-Horizon Manipulation Demos
+<div align="center">
+  <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long1.gif" width="48%" style="margin:0; padding:0; border:none;"/>
+  <img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/long2.gif" width="48%" style="margin:0; padding:0; border:none;"/>
+</div>
+<br>
+### Overview of our hierarchical framework for long-horizon robotic manipulation video generation
+<img src="https://huggingface.co/Richard-ZZZZZ/MIND-V/resolve/main/assets/pipeline.png" width="100%"/>
+<div align="center">
+  Beginning in the cognitive core, the <b>Semantic Reasoning Hub (SRH)</b> decomposes a high-level instruction into atomic sub-tasks and plans a detailed trajectory for each. These plans are then encapsulated into our novel <b>Behavioral Semantic Bridge (BSB)</b>, a structured, domain-invariant intermediate representation that serves as a precise blueprint for the <b>Motor Video Generator (MVG)</b>. The MVG, a conditional diffusion model, renders photorealistic videos that strictly adhere to the kinematic constraints defined in the BSB. At inference time, <b>Staged Visual Future Rollouts</b> provide a “propose-verify-refine” loop for self-correction, ensuring local optimality at each stage to mitigate error accumulation.
+</div>
+<br>
+## ⚙️ Quick Start
+### 1. Setup
+Our environment setup is compatible with CogVideoX. You can follow their configuration to complete the setup.
+```bash
+conda create -n mindv python=3.10
+conda activate mindv
+pip install -r requirements.txt
+bash setup_MIND-V_env.sh
+```
+Download models from [download_models.sh](https://github.com/Richard-Zhang-AI/MIND-V/blob/main/download_models.sh) and place them under the base root. The checkpoints should be organized as follows:
+```
+├── ckpts
+    ├── CogVideoX-Fun-V1.5-5b-InP   (pretrained model base)
+    ├── MIND-V                      (fine-tuned transformer)
+    ├── sam2                        (segmentation model)
+    ├── vjepa2                      (world models)
+    └── affordance-r1               (semantic reasoning model)
+```
+**Required:** Configure your own Gemini API key. The project uses Google Gemini (via service account) for visual captioning. Create a Google Cloud project and enable the Gemini API.
+```
+Create a service account → Create Key → JSON
+Save the downloaded JSON as vlm_api/captioner.json
+```
+Example content (replace with your own values):
+```json
+{
+  "type": "service_account",
+  "project_id": "your-project-id",
+  "private_key_id": "your-key-id",
+  "private_key": "-----BEGIN PRIVATE KEY-----
+YOUR_PRIVATE_KEY_HERE
+-----END PRIVATE KEY-----
+",
+  "client_email": "xxx@your-project.iam.gserviceaccount.com",
+  "client_id": "your-client-id",
+  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
+  "token_uri": "https://oauth2.googleapis.com/token",
+  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
+  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40your-project.iam.gserviceaccount.com",
+  "universe_domain": "googleapis.com"
+}
+```
+### 2. Long-Horizon Video Generation
+```bash
+python long_horizon_video_pipeline.py \
+    --image "demos/long_video/bridge1_s1.png" \
+    --instruction "First put the towel into the metal pot, then put the spoon into the metal pot" \
+    --output "output/long_horizon" \
+    --num_inference_steps 20 \
+    --transition_frames 5 \
+    --seed 42
+```
+## 🔗 Citation
+If you find this work helpful, please consider citing:
+```bibtex
+@misc{zhang2025mindvhierarchicalvideogeneration,
+      title={MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment},
+      author={Ruicheng Zhang and Mingyang Zhang and Jun Zhou and Zhangrui Guo and Xiaofan Liu and Zunnan Xu and Zhizhou Zhong and Puxin Yan and Haocheng Luo and Xiu Li},
+      year={2025},
+      eprint={2512.06628},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2512.06628},
+}
+```
+### Acknowledgments
+We sincerely thank the **RoboMaster** team for their pioneering work in robotic video generation. Our implementation builds upon and extends the excellent codebase from:
+**https://github.com/KlingTeam/RoboMaster/tree/main**
+### Additional References
+-   **CogVideoX**: https://github.com/THUDM/CogVideo
+-   **V-JEPA2**: https://github.com/facebookresearch/vjepa2
+-   **SAM2**: https://github.com/facebookresearch/segment-anything-2
+-   **Affordance-R1**: https://github.com/hq-King/Affordance-R1