lycnight commited on
Commit
e4b8dd8
·
verified ·
1 Parent(s): 7e2e0a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +328 -2
README.md CHANGED
@@ -1,3 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ReMoMask: Retrieval-Augmented Masked Motion Generation<br>
2
+
3
+ This is the official repository for the paper:
4
+ > **ReMoMask: Retrieval-Augmented Masked Motion Generation**
5
+ >
6
+ > Zhengdao Li\*, Siheng Wang\*, [Zeyu Zhang](https://steve-zeyu-zhang.github.io/)\*<sup>†</sup>, and [Hao Tang](https://ha0tang.github.io/)<sup>#</sup>
7
+ >
8
+ > \*Equal contribution. <sup>†</sup>Project lead. <sup>#</sup>Corresponding author.
9
+ >
10
+ > ### [Paper](https://arxiv.org/abs/2508.02605) | [Website](https://aigeeksgroup.github.io/ReMoMask) | [Model](https://huggingface.co/lycnight/ReMoMask) | [HF Paper](https://huggingface.co/papers/2508.02605)
11
+
12
+
13
+ # ✏️ Citation
14
+
15
+ ```
16
+ @article{li2025remomask,
17
+ title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
18
+ author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
19
+ journal={arXiv preprint arXiv:2508.02605},
20
+ year={2025}
21
+ }
22
+ ```
23
+
24
  ---
25
+
26
+ # 👋 Introduction
27
+ Retrieval-Augmented Text-to-Motion (RAG-T2M) models have demonstrated superior performance over conventional T2M approaches, particularly in handling uncommon and complex textual descriptions by leveraging external motion knowledge. Despite these gains, existing RAG-T2M models remain limited by two closely related factors: (1) coarse-grained text-motion retrieval that overlooks the hierarchical structure of human motion, and (2) underexplored mechanisms for effectively fusing retrieved information into the generative process. In this work, we present **ReMoMask**, a structure-aware RAG framework for text-to-motion generation that addresses these limitations. To improve retrieval, we propose **Hierarchical Bidirectional Momentum** (HBM) Contrastive Learning, which employs dual contrastive objectives to jointly align global motion semantics and fine-grained part-level motion features with text. To address the fusion gap, we first conduct a systematic study on motion representations and information fusion strategies in RAG-T2M, revealing that a 2D motion representation combined with cross-attention-based fusion yields superior performance. Based on these findings, we design **Semantic Spatial-Temporal Attention** (SSTA), a motion-tailored fusion module that more effectively integrates retrieved motion knowledge into the generative backbone. Extensive experiments on HumanML3D, KIT-ML, and SnapMoGen demonstrate that ReMoMask consistently outperforms prior methods on both text-motion retrieval and text-to-motion generation benchmarks.
28
+
29
+
30
+
31
+ ## TODO List
32
+
33
+ - [x] Upload our paper to arXiv and build project pages.
34
+ - [x] Upload the code.
35
+ - [x] Release TMR model.
36
+ - [x] Release T2M model.
37
+
38
+ # 🤗 Prerequisite
39
+ <details>
40
+ <summary>details</summary>
41
+
42
+ ## Environment
43
+ ```bash
44
+ conda create -n remomask python=3.10
45
+ pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
46
+ pip install -r requirements.txt
47
+ conda activate remomask
48
+ ```
49
+ We tested our environment on both A800 and H20.
50
+
51
+ ## Dependencies
52
+ ### 1. pretrained models
53
+ Dwonload the models from [HuggingFace](https://huggingface.co/lycnight/ReMoMask), and place them like:
54
+
55
+ ```
56
+ remomask_models.zip
57
+ ├── checkpoints/ # Evaluation Models and Gloves
58
+ ├── Part_TMR/
59
+ │ └── checkpoints/ # RAG pretrained checkpoints
60
+ ├── logs/ # T2M pretrained checkpoints
61
+ ├── database/ # RAG database
62
+ └── ViT-B-32.pt # CLIP model
63
+ ```
64
+
65
+ ### 2. Prepare training dataset
66
+ Follow the instruction in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git), then place the result dataset to `./dataset/HumanML3D`.
67
+ </details>
68
+
69
+ # 🚀 Demo
70
+ <details>
71
+ <summary>details</summary>
72
+
73
+ ```bash
74
+ python demo.py \
75
+ --gpu_id 0 \
76
+ --ext exp_demo \
77
+ --text_prompt "A person is playing the drum set." \
78
+ --checkpoints_dir logs \
79
+ --dataset_name humanml3d \
80
+ --mtrans_name pretrain_mtrans \
81
+ --rtrans_name pretrain_rtrans
82
+ # change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
83
+ ```
84
+ explanation:
85
+ * `--repeat_times`: number of replications for generation, default `1`.
86
+ * `--motion_length`: specify the number of poses for generation.
87
+
88
+ output will be in `./outputs/`
89
+ </details>
90
+
91
+
92
+ # 🛠️ Train your own models
93
+ <details>
94
+ <summary>details</summary>
95
+
96
+ ## Stage1: train a Motion Retriever
97
+ ```bash
98
+ python Part_TMR/scripts/train.py \
99
+ device=cuda:0 \
100
+ train=train \
101
+ dataset.train_split_filename=train.txt \
102
+ exp_name=exp \
103
+ train.optimizer.motion_lr=1.0e-05 \
104
+ train.optimizer.text_lr=1.0e-05 \
105
+ train.optimizer.head_lr=1.0e-05
106
+ # change the exp_name to your rag name
107
+ ```
108
+ then build a rag database for training t2m model:
109
+ ```bash
110
+ python build_rag_database.py \
111
+ --config-name=config \
112
+ device=cuda:0 \
113
+ train=train \
114
+ dataset.train_split_filename=train.txt \
115
+ exp_name=exp_for_mtrans
116
+ ```
117
+ you will get `./database`
118
+
119
+
120
+ ## Stage2: train a Retrieval Augmented Mask Model
121
+
122
+ ### tarin a 2D RVQ-VAE Quantizer
123
+ ```bash
124
+ bash run_rvq.sh \
125
+ vq \
126
+ 0 \
127
+ humanml3d \
128
+ --batch_size 256 \
129
+ --num_quantizers 6 \
130
+ --max_epoch 50 \
131
+ --quantize_dropout_prob 0.2 \
132
+ --gamma 0.1 \
133
+ --code_dim2d 1024 \
134
+ --nb_code2d 256
135
+ # vq means the save dir
136
+ # 0 means gpu_0
137
+ # humanml3d means dataset
138
+ # change the vq_name to your vq name
139
+ ```
140
+
141
+ ### train a 2D Retrieval-Augmented Mask Transformer
142
+ ```bash
143
+ bash run_mtrans.sh \
144
+ mtrans \
145
+ 1 \
146
+ 0 \
147
+ 11247 \
148
+ humanml3d \
149
+ --vq_name pretrain_vq \
150
+ --batch_size 64 \
151
+ --max_epoch 2000 \
152
+ --attnj \
153
+ --attnt \
154
+ --latent_dim 512 \
155
+ --n_heads 8 \
156
+ --train_split train.txt \
157
+ --val_split val.txt
158
+ # 1 means using one gpu
159
+ # 0 means using gpu_0
160
+ # 11247 means ddp master port
161
+ # change the mtrans to your mtrans name
162
+ ```
163
+
164
+
165
+ ### train a 2D Retrieval-Augmented Residual Transformer
166
+ ```bash
167
+ bash run_rtrans.sh \
168
+ rtrans \
169
+ 2 \
170
+ humanml3d \
171
+ --batch_size 64 \
172
+ --vq_name pretrain_vq \
173
+ --cond_drop_prob 0.01 \
174
+ --share_weight \
175
+ --max_epoch 2000 \
176
+ --attnj \
177
+ --attnt
178
+ # here, 2 means cuda:0,1
179
+ # --vq_name: the vq model you want to use
180
+ # change the rtrans to your vq rtrans
181
+ ```
182
+
183
+ </details>
184
+
185
+
186
+
187
+ # 💪 Evalution
188
+ <details>
189
+ <summary>details</summary>
190
+
191
+ ## Evaluate the RAG
192
+ ```bash
193
+ python Part_TMR/scripts/test.py \
194
+ device=cuda:0 \
195
+ train=train \
196
+ exp_name=exp_pretrain
197
+ # change exp_pretrain to your rag model
198
+ ```
199
+
200
+
201
+ ## Evaluate the T2M
202
+
203
+ ### 1. Evaluate the 2D RVQ-VAE Quantizer
204
+ ```bash
205
+ python eval_vq.py \
206
+ --gpu_id 0 \
207
+ --name pretrain_vq \
208
+ --dataset_name humanml3d \
209
+ --ext eval \
210
+ --which_epoch net_best_fid.tar
211
+ # change pretrain_vq to your vq
212
+ ```
213
+
214
+ ### 2. Evaluate the 2D Retrieval-Augmented Masked Transformer
215
+ ```bash
216
+ python eval_mask.py \
217
+ --dataset_name humanml3d \
218
+ --mtrans_name pretrain_mtrans \
219
+ --gpu_id 0 \
220
+ --cond_scale 4 \
221
+ --time_steps 10 \
222
+ --ext eval \
223
+ --repeat_times 1 \
224
+ --which_epoch net_best_fid.tar
225
+ # change pretrain_mtrans to your mtrans
226
+ ```
227
+
228
+
229
+ ### 3. Evaluate the 2D Residual Transformer
230
+ HumanML3D:
231
+ ```bash
232
+ python eval_res.py \
233
+ --gpu_id 0 \
234
+ --dataset_name humanml3d \
235
+ --mtrans_name pretrain_mtrans \
236
+ --rtrans_name pretrain_rtrans \
237
+ --cond_scale 4 \
238
+ --time_steps 10 \
239
+ --ext eval \
240
+ --which_ckpt net_best_fid.tar \
241
+ --which_epoch fid \
242
+ --traverse_res
243
+ # change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
244
+ ```
245
+ </details>
246
+
247
+
248
+
249
+ # 🤖 Visualization
250
+ <details>
251
+ <summary>details</summary>
252
+
253
+ ## 1. download and set up blender
254
+ <details>
255
+ <summary>details</summary>
256
+ You can download the blender from [instructions](https://www.blender.org/download/lts/2-93/). Please install exactly this version. For our paper, we use `blender-2.93.18-linux-x64`.
257
+ >
258
+ ### a. unzip it:
259
+ ```bash
260
+ tar -xvf blender-2.93.18-linux-x64.tar.xz
261
+ ```
262
+
263
+ ### b. check if you have installed the blender successfully or not:
264
+ ```bash
265
+ cd blender-2.93.18-linux-x64
266
+ ./blender --background --version
267
+ ```
268
+ you should see: `Blender 2.93.18 (hash cb886axxxx built 2023-05-22 23:33:27)`
269
+ ```bash
270
+ ./blender --background --python-expr "import sys; import os; print('\nThe version of python is ' + sys.version.split(' ')[0])"
271
+ ```
272
+ you should see: `The version of python is 3.9.2`
273
+
274
+ ### c. get the blender-python path
275
+ ```bash
276
+ ./blender --background --python-expr "import sys; import os; print('\nThe path to the installation of python is\n' + sys.executable)"
277
+ ```
278
+ you should see: ` The path to the installation of python is /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9s`
279
+
280
+ ### d. install pip for blender-python
281
+ ```bash
282
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m ensurepip --upgrade
283
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install --upgrade pip
284
+ ```
285
+
286
+ ### e. prepare env for blender-python
287
+ ```bash
288
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install numpy==2.0.2
289
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install matplotlib==3.9.4
290
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra-core==1.3.2
291
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra_colorlog==1.2.0
292
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install moviepy==1.0.3
293
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install shortuuid==1.0.13
294
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install natsort==8.4.0
295
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install pytest-shutil==1.8.1
296
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==4.67.1
297
+ /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==1.17.0
298
+ ```
299
+ </details>
300
+
301
+
302
+ ## 2. calulate SMPL mesh:
303
+ ```bash
304
+ python -m fit --dir new_test_npy --save_folder new_temp_npy --cuda cuda:0
305
+ ```
306
+
307
+ ## 3. render to video or sequence
308
+ ```bash
309
+ /xxx/blender-2.93.18-linux-x64/blender --background --python render.py -- --cfg=./configs/render_mld.yaml --dir=test_npy --mode=video --joint_type=HumanML3D
310
+ ```
311
+ - `--mode=video`: render to mp4 video
312
+ - `--mode=sequence`: render to a png image, calle sequence.
313
+
314
+ </details>
315
+
316
+ # 👍 Acknowlegements
317
+ We sincerely thank the open-sourcing of these works where our code is based on:
318
+
319
+ [MoMask](https://github.com/EricGuo5513/momask-codes),
320
+ [MoGenTS](https://github.com/weihaosky/mogents),
321
+ [ReMoDiffuse](https://github.com/mingyuan-zhang/ReMoDiffuse),
322
+ [MDM](https://github.com/GuyTevet/motion-diffusion-model),
323
+ [TMR](https://github.com/Mathux/TMR),
324
+ [ReMoGPT](https://ojs.aaai.org/index.php/AAAI/article/view/33044)
325
+
326
+ ## 🔒 License
327
+ This code is distributed under an [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
328
+
329
+ Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.