ReMoMask / README.md
nielsr's picture
nielsr HF Staff
Improve model card for ReMoMask
f0709f0 verified
|
raw
history blame
3.27 kB
metadata
license: cc-by-nc-sa-4.0
pipeline_tag: text-to-3d

logo ReMoMask: Retrieval-Augmented Masked Motion Generation

This is the official repository for the paper ReMoMask: Retrieval-Augmented Masked Motion Generation.

https://github.com/user-attachments/assets/3f29c0c5-abb8-4fd1-893c-48ac82b79532

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M.

Framework

An overview of the ReMoMask framework:

framework

Sample Usage

To run a local demo for motion generation, you can use the provided demo.py script from the GitHub repository.

First, ensure you have the environment set up as described in the GitHub repository's Prerequisite section.

Then, run the demo with a text prompt:

python demo.py --gpu_id 0 --ext exp1 --text_prompt "A person is walking on a circle." --checkpoints_dir logs --dataset_name humanml3d --mtrans_name pretrain_mtrans --rtrans_name pretrain_rtrans
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
  • --repeat_times: number of replications for generation, default 1.
  • --motion_length: specify the number of poses for generation.

The output will be saved in ./outputs/exp1/.

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@article{li2025remomask,
  title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
  author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
  journal={arXiv preprint arXiv:2508.02605},
  year={2025}
}