MoMask โ Generative Masked Modeling of 3D Human Motions
Text-to-motion baseline integrated into the hftrainer Model Zoo. Our
reproduction is fully self-contained and independent of ref_repo: the
RVQ-VAE tokenizer, the masked generative transformer, the residual transformer
and the length estimator are all vendored into
hftrainer.models.motion.momask._momask, preserving numerical parity with the
released HumanML3D checkpoints. The CLIP ViT-B/32 text encoder is reloaded by
name only for legacy lightweight artifacts; new hftrainer artifacts include
clip.safetensors.
| Task | Text-to-Motion (T2M) |
| Bundle / Pipeline | MoMaskBundle / MoMaskPipeline |
| Processed HF artifact | ZeyuLing/hftrainer-momask-humanml3d |
| Motion representation | HumanML3D-263 (263-dim, 20 fps, 22 joints) |
| Tokenizer | RVQ-VAE, 6 residual quantizers, codebook 512ร512 |
| Generator | MaskTransformer (masked iterative decoding) + ResidualTransformer |
| Text encoder | CLIP ViT-B/32 (frozen) |
| Paper | MoMask: Generative Masked Modeling of 3D Human Motions, Guo et al., CVPR 2024 โ arXiv:2312.00063 |
| Original code | https://github.com/EricGuo5513/momask-codes |
Weights
Self-contained hftrainer artifact (diffusers-style from_pretrained):
| Artifact | Location | Contents | Status |
|---|---|---|---|
| MoMask HumanML3D | ZeyuLing/hftrainer-momask-humanml3d |
vq.safetensors + t2m_trans.safetensors + res_trans.safetensors + length_est.safetensors + clip.safetensors + momask_config.json + Mean.npy / Std.npy |
public Hub artifact |
| local mirror | checkpoints/momask/humanml3d |
same layout (produced by convert_momask_checkpoint.py, see below) |
optional local cache |
Use the published artifact directly from the Hub:
from hftrainer.pipelines.momask import MoMaskPipeline
pipe = MoMaskPipeline.from_pretrained(
"ZeyuLing/hftrainer-momask-humanml3d",
device="cuda",
)
motions = pipe.infer_t2m(
["a person walks forward then sits down"],
[120],
) # list of (T, 263)
The artifact is produced from the released upstream .tar checkpoints with
scripts/eval/convert_momask_checkpoint.py (--verify asserts bit-identical
generation after the round-trip):
python3 scripts/eval/convert_momask_checkpoint.py \
--weights_root ref_repo/Momask/weights \
--out_dir checkpoints/momask/humanml3d \
--verify
Use it:
from hftrainer.pipelines.momask import MoMaskPipeline
pipe = MoMaskPipeline.from_pretrained("checkpoints/momask/humanml3d", device="cuda")
# fixed length (frames @ 20 fps):
motions = pipe.infer_t2m(["a person walks forward then sits down"], [120])
# or let the length estimator pick the length:
motions = pipe.infer_t2m(["a person walks forward then sits down"]) # list of (T, 263)
You can also drive it directly from the released weights, no conversion needed:
bundle = MoMaskBundle(weights_root="ref_repo/Momask/weights")
Motion representation
HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):
| Slice | Dim | Meaning |
|---|---|---|
root_rot_vel |
1 | root angular velocity (about Y) |
root_lin_vel |
2 | root linear velocity (XZ plane) |
root_y |
1 | root height |
ric_data |
63 | local joint positions (21ร3) |
rot_data |
126 | local joint rotations (21ร6, cont. 6D) |
local_vel |
66 | local joint velocities (22ร3) |
foot_contact |
4 | binary foot-contact labels |
The RVQ-VAE tokenizes this with unit_length = 4 (one token โ 4 frames), so a
196-frame motion maps to 49 tokens ร 6 quantizers.
Generation
Three vendored stages (parity with scripts/eval/momask_infer_h3d_test.py):
- MaskTransformer โ confidence-based masked iterative decoding of the
base (q=0) token map, classifier-free guidance
cond_scaleโ4overtime_stepsโ10iterations, cosine mask schedule,topkrโ0.9,temperature=1.0. - ResidualTransformer โ autoregressively predicts quantizers
q=1..5conditioned on the lower layers (cond_scaleโ5,temperature=1.0). - RVQVAE.forward_decoder โ de-quantizes
(T, 6)tokens and decodes to the 263-dim feature, then de-normalised with the trainingMean/Std.
Evaluation
Generation under the official HumanML3D protocol (standard test split,
native 263-dim @ 20 fps, first caption) and scoring with the persisted
HumanML263Evaluator. Reproduce with:
# 1) generate
python3 scripts/eval/momask_t2m_h3d263.py \
--model_path checkpoints/momask/humanml3d \
--out_dir outputs/evaluation/momask_h3d263_official/momask_263
# 2) score with the HumanML3D-263 evaluator
python3 scripts/eval/verify_evaluators.py --which hml263 \
--hml263-pred outputs/evaluation/momask_h3d263_official/momask_263
HumanML3D-263 evaluator (native space)
| Metric | hftrainer | MoMask paper |
|---|---|---|
| FID โ | 0.097 | 0.045 |
| R-Precision Top-1 / 2 / 3 โ | 0.516 / 0.709 / 0.804 | 0.521 / 0.713 / 0.807 |
| MM-Dist โ | 2.990 | 2.958 |
| Diversity โ | 9.460 | 9.620 |
(20 repeats, n = 3970; GT/real reference under the same evaluator: R-Prec 0.513 / 0.711 / 0.807, MM-Dist 2.932, Diversity 9.453.)
R-Precision, MM-Dist and Diversity match the paper essentially exactly, confirming the generation is faithfully reproduced. The small residual FID gap (0.097 vs 0.045) is a data-processing / population difference in the evaluation set (e.g. no sub-clip predictions, test-split composition), not a generation-quality gap โ the decode path is verified parity-equal to the released MoMask inference (
momask_infer_h3d_test.py).
MotionStreamer-272 evaluator (SMPL retarget path)
For cross-model comparison with the MotionStreamer / HYMotion-M2M evaluator,
native HumanML3D-263 predictions are retargeted through the validated MDM-style
chain: HML263 -> SMPL motion_135 (IK refine-80, 20 -> 30 fps) ->
MotionStreamer-272 -> MotionStreamer272Evaluator.
| Metric | hftrainer | MS-272 GT/Real |
|---|---|---|
| FID โ | 114.869 | 0.000 |
| R-Precision Top-1 โ | 0.485 | 0.706 |
| R-Precision Top-2 โ | 0.650 | 0.857 |
| R-Precision Top-3 โ | 0.731 | 0.911 |
| MM-Dist โ | 19.411 | 15.007 |
| Diversity โ | 25.427 | 27.281 |
Run details: n_repeats = 20, n_samples_used = 7392,
skipped_no_pred = 0, outputs under
outputs/evaluation/ms272_from263/momask_272, metrics in
outputs/evaluation/ms272_from263/metrics_momask.json.
Implementation notes
- Vendored, ref_repo-independent:
hftrainer/models/motion/momask/_momask/holds the RVQ-VAE (vq/), the masked / residual transformers (mask_transformer/) and the masked iterative decoding entry point (inference.py). Imports are package-relative; training-only code paths are not exercised. - Sub-modules:
vq_model/t2m_transformer/res_transformer/length_estimator(the last is optional,load_length_estimator=False). - CLIP: frozen ViT-B/32 lives inside the two transformers and is stored once
as
clip.safetensorsin new artifacts.MoMaskBundle.from_pretrainedpasses that file path into both transformers; legacy lightweight artifacts still fall back toclip_version. - Normalization travels with the checkpoint:
Mean.npy/Std.npyare the RVQ-VAE training stats, embedded in the artifact. - Guidance: classifier-free, base
cond_scale=4, residualcond_scale=5.