---
license: mit
pipeline_tag: robotics
library_name: transformers
---

# MM-ACT: Learn from Multimodal Parallel Generation to Act

[![arXiv](https://img.shields.io/badge/arXiv-Paper-red.svg)](https://arxiv.org/abs/2512.00975)
[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97-Model-yellow)](https://huggingface.co/hhyhrhy/MM-ACT-Model)
[![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97-Dataset-blue)](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data)

<br>

<div align="center">
  <img src="https://github.com/HHYHRHY/MM-ACT/raw/main/assets/MM-ACT.png" width="80%" alt="MM-ACT Arch"/>
</div>

<br>

This repository contains **MM-ACT**, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency.

The model was presented in the paper [MM-ACT: Learn from Multimodal Parallel Generation to Act](https://huggingface.co/papers/2512.00975).

Code: https://github.com/HHYHRHY/MM-ACT

## Usage

For detailed usage, including training and deployment scripts, please refer to the official [GitHub repository](https://github.com/HHYHRHY/MM-ACT).

### 1. Clone Repo and Environment Setup

```bash
git clone https://github.com/HHYHRHY/MM-ACT.git
cd MM-ACT

# Create environment
conda create -n mmact python=3.13
conda activate mmact

# Install requirements
pip install -r requirement.txt
```

### 2. Dataset Preparation

-   **LIBERO**

    We utilize LIBERO datasets from [Huggingface_LeRobot](https://huggingface.co/lerobot), and uses LeRobot datasets for loading robot data.
    Please download [LIBERO-Object](https://huggingface.co/datasets/lerobot/libero_object_image),
    [LIBERO-Spatial](https://huggingface.co/datasets/lerobot/libero_spatial_image),[LIBERO-Goal](https://huggingface.co/datasets/lerobot/libero_goal_image) and
    [LIBERO-10](https://huggingface.co/datasets/lerobot/libero_10_image). For LIBERO-10, we also provide our task planning datasets in [LIBERO-10-task](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data/tree/main/LIBERO).

-   **RoboTwin**

    For RoboTwin datasets, we utilize a dataset sampling pipeline that includes task planning generation. You can download our [datasets](https://huggingface.co/datasets/hhyhrhy/MM-ACT-data/tree/main/RoboTwin)
    or collect your own datasets with our pipeline in [Robotwin_subtask](https://github.com/RoboTwin-Platform/RoboTwin/tree/Subtask_info). This branch includes updates to original RoboTwin data collection pipeline to support our subtask text annotations. The collection usage is identical to the main branch. Please report any bugs or questions of text annotations in MM-ACT's issue.

### 3. Model Weight Preparation

Download the base model weights from MMaDA: [MMaDA-8B-Base](https://huggingface.co/Gen-Verse/MMaDA-8B-Base) and expand the original model's action codebook (we use 2048):

```bash
python model_utils/resize_model_vocab.py --model ${origin_model_path} --out ${output_model_path} --num_new ${action_codebook_size}
```

## Acknowledgments

This work is based on [MMaDA](https://github.com/Gen-Verse/MMaDA), [RoboTwin](https://github.com/robotwin-Platform/RoboTwin),
[LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO), [LeRobot](https://github.com/huggingface/lerobot), [OpenVLA](https://github.com/openvla/openvla.git). Thanks these great work.