license: cc-by-nc-sa-4.0
pipeline_tag: text-to-3d
ReMoMask: Retrieval-Augmented Masked Motion Generation
This is the official repository for the paper ReMoMask: Retrieval-Augmented Masked Motion Generation.
- 📚 Paper
- 🌐 Project Page
- 💻 Code
https://github.com/user-attachments/assets/3f29c0c5-abb8-4fd1-893c-48ac82b79532
Abstract
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M.
Framework
An overview of the ReMoMask framework:
Sample Usage
To run a local demo for motion generation, you can use the provided demo.py script from the GitHub repository.
First, ensure you have the environment set up as described in the GitHub repository's Prerequisite section.
Then, run the demo with a text prompt:
python demo.py --gpu_id 0 --ext exp1 --text_prompt "A person is walking on a circle." --checkpoints_dir logs --dataset_name humanml3d --mtrans_name pretrain_mtrans --rtrans_name pretrain_rtrans
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
--repeat_times: number of replications for generation, default1.--motion_length: specify the number of poses for generation.
The output will be saved in ./outputs/exp1/.
Citation
If you find our work helpful or inspiring, please feel free to cite it.
@article{li2025remomask,
title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
journal={arXiv preprint arXiv:2508.02605},
year={2025}
}
