lycnight commited on
Commit
205d043
·
verified ·
1 Parent(s): 5ba590c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +331 -3
README.md CHANGED
@@ -1,5 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
3
- ---
4
 
5
- ReMoMask
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ReMoMask: Retrieval-Augmented Masked Motion Generation<br>
2
+
3
+ This is the official repository for the paper:
4
+ > **ReMoMask: Retrieval-Augmented Masked Motion Generation**
5
+ >
6
+ > Zhengdao Li\*, Siheng Wang\*, [Zeyu Zhang](https://steve-zeyu-zhang.github.io/)\*<sup>†</sup>, and [Hao Tang](https://ha0tang.github.io/)<sup>#</sup>
7
+ >
8
+ > \*Equal contribution. <sup>†</sup>Project lead. <sup>#</sup>Corresponding author.
9
+ >
10
+ > ### [Paper](https://arxiv.org/abs/2508.02605) | [Website](https://aigeeksgroup.github.io/ReMoMask) | [Model](https://huggingface.co/lycnight/ReMoMask) | [HF Paper](https://huggingface.co/papers/2508.02605)
11
+
12
+
13
+ <video>
14
+
15
+ # ✏️ Citation
16
+
17
+ ```
18
+ @article{li2025remomask,
19
+ title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
20
+ author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
21
+ journal={arXiv preprint arXiv:2508.02605},
22
+ year={2025}
23
+ }
24
+ ```
25
+
26
  ---
 
 
27
 
28
+ # 👋 Introduction
29
+ Retrieval-Augmented Text-to-Motion (RAG-T2M) models have demonstrated superior performance over conventional T2M approaches, particularly in handling uncommon and complex textual descriptions by leveraging external motion knowledge. Despite these gains, existing RAG-T2M models remain limited by two closely related factors: (1) coarse-grained text-motion retrieval that overlooks the hierarchical structure of human motion, and (2) underexplored mechanisms for effectively fusing retrieved information into the generative process. In this work, we present **ReMoMask**, a structure-aware RAG framework for text-to-motion generation that addresses these limitations. To improve retrieval, we propose **Hierarchical Bidirectional Momentum** (HBM) Contrastive Learning, which employs dual contrastive objectives to jointly align global motion semantics and fine-grained part-level motion features with text. To address the fusion gap, we first conduct a systematic study on motion representations and information fusion strategies in RAG-T2M, revealing that a 2D motion representation combined with cross-attention-based fusion yields superior performance. Based on these findings, we design **Semantic Spatial-Temporal Attention** (SSTA), a motion-tailored fusion module that more effectively integrates retrieved motion knowledge into the generative backbone. Extensive experiments on HumanML3D, KIT-ML, and SnapMoGen demonstrate that ReMoMask consistently outperforms prior methods on both text-motion retrieval and text-to-motion generation benchmarks.
30
+
31
+
32
+
33
+ ![framework](./assets/framework.png)
34
+
35
+ ## TODO List
36
+
37
+ - [x] Upload our paper to arXiv and build project pages.
38
+ - [x] Upload the code.
39
+ - [x] Release TMR model.
40
+ - [x] Release T2M model.
41
+
42
+ # 🤗 Prerequisite
43
+ <details>
44
+ <summary>details</summary>
45
+
46
+ ## Environment
47
+ ```bash
48
+ conda create -n remomask python=3.10
49
+ pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
50
+ pip install -r requirements.txt
51
+ conda activate remomask
52
+ ```
53
+ We tested our environment on both A800 and H20.
54
+
55
+ ## Dependencies
56
+ ### 1. pretrained models
57
+ Dwonload the models from [HuggingFace](https://huggingface.co/lycnight/ReMoMask), and place them like:
58
+
59
+ ```
60
+ remomask_models.zip
61
+ ├── checkpoints/ # Evaluation Models and Gloves
62
+ ├── Part_TMR/
63
+ │ └── checkpoints/ # RAG pretrained checkpoints
64
+ ├── logs/ # T2M pretrained checkpoints
65
+ ├── database/ # RAG database
66
+ └── ViT-B-32.pt # CLIP model
67
+ ```
68
+
69
+ ### 2. Prepare training dataset
70
+ Follow the instruction in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git), then place the result dataset to `./dataset/HumanML3D`.
71
+ </details>
72
+
73
+ # 🚀 Demo
74
+ <details>
75
+ <summary>details</summary>
76
+
77
+ ```bash
78
+ python demo.py \
79
+ --gpu_id 0 \
80
+ --ext exp_demo \
81
+ --text_prompt "A person is playing the drum set." \
82
+ --checkpoints_dir logs \
83
+ --dataset_name humanml3d \
84
+ --mtrans_name pretrain_mtrans \
85
+ --rtrans_name pretrain_rtrans
86
+ # change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
87
+ ```
88
+ explanation:
89
+ * `--repeat_times`: number of replications for generation, default `1`.
90
+ * `--motion_length`: specify the number of poses for generation.
91
+
92
+ output will be in `./outputs/`
93
+ </details>
94
+
95
+
96
+ # 🛠️ Train your own models
97
+ <details>
98
+ <summary>details</summary>
99
+
100
+ ## Stage1: train a Motion Retriever
101
+ ```bash
102
+ python Part_TMR/scripts/train.py \
103
+ device=cuda:0 \
104
+ train=train \
105
+ dataset.train_split_filename=train.txt \
106
+ exp_name=exp \
107
+ train.optimizer.motion_lr=1.0e-05 \
108
+ train.optimizer.text_lr=1.0e-05 \
109
+ train.optimizer.head_lr=1.0e-05
110
+ # change the exp_name to your rag name
111
+ ```
112
+ then build a rag database for training t2m model:
113
+ ```bash
114
+ python build_rag_database.py \
115
+ --config-name=config \
116
+ device=cuda:0 \
117
+ train=train \
118
+ dataset.train_split_filename=train.txt \
119
+ exp_name=exp_for_mtrans
120
+ ```
121
+ you will get `./database`
122
+
123
+
124
+ ## Stage2: train a Retrieval Augmented Mask Model
125
+
126
+ ### tarin a 2D RVQ-VAE Quantizer
127
+ ```bash
128
+ bash run_rvq.sh \
129
+ vq \
130
+ 0 \
131
+ humanml3d \
132
+ --batch_size 256 \
133
+ --num_quantizers 6 \
134
+ --max_epoch 50 \
135
+ --quantize_dropout_prob 0.2 \
136
+ --gamma 0.1 \
137
+ --code_dim2d 1024 \
138
+ --nb_code2d 256
139
+ # vq means the save dir
140
+ # 0 means gpu_0
141
+ # humanml3d means dataset
142
+ # change the vq_name to your vq name
143
+ ```
144
+
145
+ ### train a 2D Retrieval-Augmented Mask Transformer
146
+ ```bash
147
+ bash run_mtrans.sh \
148
+ mtrans \
149
+ 1 \
150
+ 0 \
151
+ 11247 \
152
+ humanml3d \
153
+ --vq_name pretrain_vq \
154
+ --batch_size 64 \
155
+ --max_epoch 2000 \
156
+ --attnj \
157
+ --attnt \
158
+ --latent_dim 512 \
159
+ --n_heads 8 \
160
+ --train_split train.txt \
161
+ --val_split val.txt
162
+ # 1 means using one gpu
163
+ # 0 means using gpu_0
164
+ # 11247 means ddp master port
165
+ # change the mtrans to your mtrans name
166
+ ```
167
+
168
+
169
+ ### train a 2D Retrieval-Augmented Residual Transformer
170
+ ```bash
171
+ bash run_rtrans.sh \
172
+ rtrans \
173
+ 2 \
174
+ humanml3d \
175
+ --batch_size 64 \
176
+ --vq_name pretrain_vq \
177
+ --cond_drop_prob 0.01 \
178
+ --share_weight \
179
+ --max_epoch 2000 \
180
+ --attnj \
181
+ --attnt
182
+ # here, 2 means cuda:0,1
183
+ # --vq_name: the vq model you want to use
184
+ # change the rtrans to your vq rtrans
185
+ ```
186
+
187
+ </details>
188
+
189
+
190
+
191
+ # 💪 Evalution
192
+ <details>
193
+ <summary>details</summary>
194
+
195
+ ## Evaluate the RAG
196
+ ```bash
197
+ python Part_TMR/scripts/test.py \
198
+ device=cuda:0 \
199
+ train=train \
200
+ exp_name=exp_pretrain
201
+ # change exp_pretrain to your rag model
202
+ ```
203
+
204
+
205
+ ## Evaluate the T2M
206
+
207
+ ### 1. Evaluate the 2D RVQ-VAE Quantizer
208
+ ```bash
209
+ python eval_vq.py \
210
+ --gpu_id 0 \
211
+ --name pretrain_vq \
212
+ --dataset_name humanml3d \
213
+ --ext eval \
214
+ --which_epoch net_best_fid.tar
215
+ # change pretrain_vq to your vq
216
+ ```
217
+
218
+ ### 2. Evaluate the 2D Retrieval-Augmented Masked Transformer
219
+ ```bash
220
+ python eval_mask.py \
221
+ --dataset_name humanml3d \
222
+ --mtrans_name pretrain_mtrans \
223
+ --gpu_id 0 \
224
+ --cond_scale 4 \
225
+ --time_steps 10 \
226
+ --ext eval \
227
+ --repeat_times 1 \
228
+ --which_epoch net_best_fid.tar
229
+ # change pretrain_mtrans to your mtrans
230
+ ```
231
+
232
+
233
+ ### 3. Evaluate the 2D Residual Transformer
234
+ HumanML3D:
235
+ ```bash
236
+ python eval_res.py \
237
+ --gpu_id 0 \
238
+ --dataset_name humanml3d \
239
+ --mtrans_name pretrain_mtrans \
240
+ --rtrans_name pretrain_rtrans \
241
+ --cond_scale 4 \
242
+ --time_steps 10 \
243
+ --ext eval \
244
+ --which_ckpt net_best_fid.tar \
245
+ --which_epoch fid \
246
+ --traverse_res
247
+ # change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
248
+ ```
249
+ </details>
250
+
251
+
252
+
253
+ # 🤖 Visualization
254
+ <details>
255
+ <summary>details</summary>
256
+
257
+ ## 1. download and set up blender
258
+ <details>
259
+ <summary>details</summary>
260
+ You can download the blender from [instructions](https://www.blender.org/download/lts/2-93/). Please install exactly this version. For our paper, we use `blender-2.93.18-linux-x64`.
261
+ >
262
+ ### a. unzip it:
263
+ ```bash
264
+ tar -xvf blender-2.93.18-linux-x64.tar.xz
265
+ ```
266
+
267
+ ### b. check if you have installed the blender successfully or not:
268
+ ```bash
269
+ cd blender-2.93.18-linux-x64
270
+ ./blender --background --version
271
+ ```
272
+ you should see: `Blender 2.93.18 (hash cb886axxxx built 2023-05-22 23:33:27)`
273
+ ```bash
274
+ ./blender --background --python-expr "import sys; import os; print('\nThe version of python is ' + sys.version.split(' ')[0])"
275
+ ```
276
+ you should see: `The version of python is 3.9.2`
277
+
278
+ ### c. get the blender-python path
279
+ ```bash
280
+ ./blender --background --python-expr "import sys; import os; print('\nThe path to the installation of python is\n' + sys.executable)"
281
+ ```
282
+ you should see: ` The path to the installation of python is /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9s`
283
+
284
+ ### d. install pip for blender-python
285
+ ```bash
286
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m ensurepip --upgrade
287
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install --upgrade pip
288
+ ```
289
+
290
+ ### e. prepare env for blender-python
291
+ ```bash
292
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install numpy==2.0.2
293
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install matplotlib==3.9.4
294
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra-core==1.3.2
295
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra_colorlog==1.2.0
296
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install moviepy==1.0.3
297
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install shortuuid==1.0.13
298
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install natsort==8.4.0
299
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install pytest-shutil==1.8.1
300
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==4.67.1
301
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==1.17.0
302
+ ```
303
+ </details>
304
+
305
+
306
+ ## 2. calulate SMPL mesh:
307
+ ```bash
308
+ python -m fit --dir new_test_npy --save_folder new_temp_npy --cuda cuda:0
309
+ ```
310
+
311
+ ## 3. render to video or sequence
312
+ ```bash
313
+ /xxx/blender-2.93.18-linux-x64/blender --background --python render.py -- --cfg=./configs/render_mld.yaml --dir=test_npy --mode=video --joint_type=HumanML3D
314
+ ```
315
+ - `--mode=video`: render to mp4 video
316
+ - `--mode=sequence`: render to a png image, calle sequence.
317
+
318
+ </details>
319
+
320
+ # 👍 Acknowlegements
321
+ We sincerely thank the open-sourcing of these works where our code is based on:
322
+
323
+ [MoMask](https://github.com/EricGuo5513/momask-codes),
324
+ [MoGenTS](https://github.com/weihaosky/mogents),
325
+ [ReMoDiffuse](https://github.com/mingyuan-zhang/ReMoDiffuse),
326
+ [MDM](https://github.com/GuyTevet/motion-diffusion-model),
327
+ [TMR](https://github.com/Mathux/TMR),
328
+ [ReMoGPT](https://ojs.aaai.org/index.php/AAAI/article/view/33044)
329
+
330
+ ## 🔒 License
331
+ This code is distributed under an [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
332
+
333
+ Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.