license: mit
pipeline_tag: robotics
library_name: transformers
MM-ACT: Learn from Multimodal Parallel Generation to Act
This repository contains MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency.
The model was presented in the paper MM-ACT: Learn from Multimodal Parallel Generation to Act.
Code: https://github.com/HHYHRHY/MM-ACT
Usage
For detailed usage, including training and deployment scripts, please refer to the official GitHub repository.
1. Clone Repo and Environment Setup
git clone https://github.com/HHYHRHY/MM-ACT.git
cd MM-ACT
# Create environment
conda create -n mmact python=3.13
conda activate mmact
# Install requirements
pip install -r requirement.txt
2. Dataset Preparation
LIBERO
We utilize LIBERO datasets from Huggingface_LeRobot, and uses LeRobot datasets for loading robot data. Please download LIBERO-Object, LIBERO-Spatial,LIBERO-Goal and LIBERO-10. For LIBERO-10, we also provide our task planning datasets in LIBERO-10-task.
RoboTwin
For RoboTwin datasets, we utilize a dataset sampling pipeline that includes task planning generation. You can download our datasets or collect your own datasets with our pipeline in Robotwin_subtask. This branch includes updates to original RoboTwin data collection pipeline to support our subtask text annotations. The collection usage is identical to the main branch. Please report any bugs or questions of text annotations in MM-ACT's issue.
3. Model Weight Preparation
Download the base model weights from MMaDA: MMaDA-8B-Base and expand the original model's action codebook (we use 2048):
python model_utils/resize_model_vocab.py --model ${origin_model_path} --out ${output_model_path} --num_new ${action_codebook_size}
🎥 Real-world Experiments (Video Demo)
Acknowledgments
This work is based on MMaDA, RoboTwin, LIBERO, LeRobot, OpenVLA. Thanks these great work.