hhyhrhy
/

MM-ACT-Model

+---
+license: mit
+pipeline_tag: robotics
+library_name: transformers
+---
+# MM-ACT: Learn from Multimodal Parallel Generation to Act
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-red.svg)](https://arxiv.org/abs/2512.00975)
+[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97-Model-yellow)](https://huggingface.co/hhyhrhy/MM-ACT-Model)
+[![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97-Dataset-blue)](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data)
+<br>
+<div align="center">
+  <img src="https://github.com/HHYHRHY/MM-ACT/raw/main/assets/MM-ACT.png" width="80%" alt="MM-ACT Arch"/>
+</div>
+<br>
+This repository contains **MM-ACT**, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency.
+The model was presented in the paper [MM-ACT: Learn from Multimodal Parallel Generation to Act](https://huggingface.co/papers/2512.00975).
+Code: https://github.com/HHYHRHY/MM-ACT
+## Usage
+For detailed usage, including training and deployment scripts, please refer to the official [GitHub repository](https://github.com/HHYHRHY/MM-ACT).
+### 1. Clone Repo and Environment Setup
+```bash
+git clone https://github.com/HHYHRHY/MM-ACT.git
+cd MM-ACT
+# Create environment
+conda create -n mmact python=3.13
+conda activate mmact
+# Install requirements
+pip install -r requirement.txt
+```
+### 2. Dataset Preparation
+-   **LIBERO**
+    We utilize LIBERO datasets from [Huggingface_LeRobot](https://huggingface.co/lerobot), and uses LeRobot datasets for loading robot data.
+    Please download [LIBERO-Object](https://huggingface.co/datasets/lerobot/libero_object_image),
+    [LIBERO-Spatial](https://huggingface.co/datasets/lerobot/libero_spatial_image),[LIBERO-Goal](https://huggingface.co/datasets/lerobot/libero_goal_image) and
+    [LIBERO-10](https://huggingface.co/datasets/lerobot/libero_10_image). For LIBERO-10, we also provide our task planning datasets in [LIBERO-10-task](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data/tree/main/LIBERO).
+-   **RoboTwin**
+    For RoboTwin datasets, we utilize a dataset sampling pipeline that includes task planning generation. You can download our [datasets](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data/tree/main/RoboTwin)
+    or collect your own datasets with our pipeline in [Robotwin_subtask](https://github.com/RoboTwin-Platform/RoboTwin/tree/Subtask_info). This branch includes updates to original RoboTwin data collection pipeline to support our subtask text annotations. The collection usage is identical to the main branch. Please report any bugs or questions of text annotations in MM-ACT's issue.
+### 3. Model Weight Preparation
+Download the base model weights from MMaDA: [MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base) and expand the original model's action codebook (we use 2048):
+```bash
+python model_utils/resize_model_vocab.py --model ${origin_model_path} --out ${output_model_path} --num_new ${action_codebook_size}
+```
+## 🎥 Real-world Experiments (Video Demo)
+https://private-user-images.githubusercontent.com/91517920/520774696-02a3bf40-f1ae-4f52-9562-a3fc2e9a1477.mp4
+## Acknowledgments
+This work is based on [MMaDA](https://github.com/Gen-Verse/MMaDA), [RoboTwin](https://github.com/robotwin-Platform/RoboTwin),
+[LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO), [LeRobot](https://github.com/huggingface/lerobot), [OpenVLA](https://github.com/openvla/openvla.git). Thanks these great work.