MoMask โ€” Generative Masked Modeling of 3D Human Motions

Text-to-motion baseline integrated into the hftrainer Model Zoo. Our reproduction is fully self-contained and independent of ref_repo: the RVQ-VAE tokenizer, the masked generative transformer, the residual transformer and the length estimator are all vendored into hftrainer.models.motion.momask._momask, preserving numerical parity with the released HumanML3D checkpoints. The CLIP ViT-B/32 text encoder is reloaded by name only for legacy lightweight artifacts; new hftrainer artifacts include clip.safetensors.

Task Text-to-Motion (T2M)
Bundle / Pipeline MoMaskBundle / MoMaskPipeline
Processed HF artifact ZeyuLing/hftrainer-momask-humanml3d
Motion representation HumanML3D-263 (263-dim, 20 fps, 22 joints)
Tokenizer RVQ-VAE, 6 residual quantizers, codebook 512ร—512
Generator MaskTransformer (masked iterative decoding) + ResidualTransformer
Text encoder CLIP ViT-B/32 (frozen)
Paper MoMask: Generative Masked Modeling of 3D Human Motions, Guo et al., CVPR 2024 โ€” arXiv:2312.00063
Original code https://github.com/EricGuo5513/momask-codes

Weights

Self-contained hftrainer artifact (diffusers-style from_pretrained):

Artifact Location Contents Status
MoMask HumanML3D ZeyuLing/hftrainer-momask-humanml3d vq.safetensors + t2m_trans.safetensors + res_trans.safetensors + length_est.safetensors + clip.safetensors + momask_config.json + Mean.npy / Std.npy public Hub artifact
local mirror checkpoints/momask/humanml3d same layout (produced by convert_momask_checkpoint.py, see below) optional local cache

Use the published artifact directly from the Hub:

from hftrainer.pipelines.momask import MoMaskPipeline

pipe = MoMaskPipeline.from_pretrained(
    "ZeyuLing/hftrainer-momask-humanml3d",
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then sits down"],
    [120],
)  # list of (T, 263)

The artifact is produced from the released upstream .tar checkpoints with scripts/eval/convert_momask_checkpoint.py (--verify asserts bit-identical generation after the round-trip):

python3 scripts/eval/convert_momask_checkpoint.py \
    --weights_root ref_repo/Momask/weights \
    --out_dir checkpoints/momask/humanml3d \
    --verify

Use it:

from hftrainer.pipelines.momask import MoMaskPipeline

pipe = MoMaskPipeline.from_pretrained("checkpoints/momask/humanml3d", device="cuda")
# fixed length (frames @ 20 fps):
motions = pipe.infer_t2m(["a person walks forward then sits down"], [120])
# or let the length estimator pick the length:
motions = pipe.infer_t2m(["a person walks forward then sits down"])  # list of (T, 263)

You can also drive it directly from the released weights, no conversion needed:

bundle = MoMaskBundle(weights_root="ref_repo/Momask/weights")

Motion representation

HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):

Slice Dim Meaning
root_rot_vel 1 root angular velocity (about Y)
root_lin_vel 2 root linear velocity (XZ plane)
root_y 1 root height
ric_data 63 local joint positions (21ร—3)
rot_data 126 local joint rotations (21ร—6, cont. 6D)
local_vel 66 local joint velocities (22ร—3)
foot_contact 4 binary foot-contact labels

The RVQ-VAE tokenizes this with unit_length = 4 (one token โ‰ˆ 4 frames), so a 196-frame motion maps to 49 tokens ร— 6 quantizers.


Generation

Three vendored stages (parity with scripts/eval/momask_infer_h3d_test.py):

  1. MaskTransformer โ€” confidence-based masked iterative decoding of the base (q=0) token map, classifier-free guidance cond_scaleโ‰ˆ4 over time_stepsโ‰ˆ10 iterations, cosine mask schedule, topkrโ‰ˆ0.9, temperature=1.0.
  2. ResidualTransformer โ€” autoregressively predicts quantizers q=1..5 conditioned on the lower layers (cond_scaleโ‰ˆ5, temperature=1.0).
  3. RVQVAE.forward_decoder โ€” de-quantizes (T, 6) tokens and decodes to the 263-dim feature, then de-normalised with the training Mean / Std.

Evaluation

Generation under the official HumanML3D protocol (standard test split, native 263-dim @ 20 fps, first caption) and scoring with the persisted HumanML263Evaluator. Reproduce with:

# 1) generate
python3 scripts/eval/momask_t2m_h3d263.py \
    --model_path checkpoints/momask/humanml3d \
    --out_dir outputs/evaluation/momask_h3d263_official/momask_263
# 2) score with the HumanML3D-263 evaluator
python3 scripts/eval/verify_evaluators.py --which hml263 \
    --hml263-pred outputs/evaluation/momask_h3d263_official/momask_263

HumanML3D-263 evaluator (native space)

Metric hftrainer MoMask paper
FID โ†“ 0.097 0.045
R-Precision Top-1 / 2 / 3 โ†‘ 0.516 / 0.709 / 0.804 0.521 / 0.713 / 0.807
MM-Dist โ†“ 2.990 2.958
Diversity โ†’ 9.460 9.620

(20 repeats, n = 3970; GT/real reference under the same evaluator: R-Prec 0.513 / 0.711 / 0.807, MM-Dist 2.932, Diversity 9.453.)

R-Precision, MM-Dist and Diversity match the paper essentially exactly, confirming the generation is faithfully reproduced. The small residual FID gap (0.097 vs 0.045) is a data-processing / population difference in the evaluation set (e.g. no sub-clip predictions, test-split composition), not a generation-quality gap โ€” the decode path is verified parity-equal to the released MoMask inference (momask_infer_h3d_test.py).

MotionStreamer-272 evaluator (SMPL retarget path)

For cross-model comparison with the MotionStreamer / HYMotion-M2M evaluator, native HumanML3D-263 predictions are retargeted through the validated MDM-style chain: HML263 -> SMPL motion_135 (IK refine-80, 20 -> 30 fps) -> MotionStreamer-272 -> MotionStreamer272Evaluator.

Metric hftrainer MS-272 GT/Real
FID โ†“ 114.869 0.000
R-Precision Top-1 โ†‘ 0.485 0.706
R-Precision Top-2 โ†‘ 0.650 0.857
R-Precision Top-3 โ†‘ 0.731 0.911
MM-Dist โ†“ 19.411 15.007
Diversity โ†’ 25.427 27.281

Run details: n_repeats = 20, n_samples_used = 7392, skipped_no_pred = 0, outputs under outputs/evaluation/ms272_from263/momask_272, metrics in outputs/evaluation/ms272_from263/metrics_momask.json.


Implementation notes

  • Vendored, ref_repo-independent: hftrainer/models/motion/momask/_momask/ holds the RVQ-VAE (vq/), the masked / residual transformers (mask_transformer/) and the masked iterative decoding entry point (inference.py). Imports are package-relative; training-only code paths are not exercised.
  • Sub-modules: vq_model / t2m_transformer / res_transformer / length_estimator (the last is optional, load_length_estimator=False).
  • CLIP: frozen ViT-B/32 lives inside the two transformers and is stored once as clip.safetensors in new artifacts. MoMaskBundle.from_pretrained passes that file path into both transformers; legacy lightweight artifacts still fall back to clip_version.
  • Normalization travels with the checkpoint: Mean.npy / Std.npy are the RVQ-VAE training stats, embedded in the artifact.
  • Guidance: classifier-free, base cond_scale=4, residual cond_scale=5.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for ZeyuLing/hftrainer-momask-humanml3d