MLD - Motion Latent Diffusion

Text-to-motion baseline integrated into the hftrainer Model Zoo. The reproduction keeps the MLD motion VAE, latent denoiser, DDIM scheduler wiring, and SentenceT5 text wrapper in the native hftrainer runtime, so inference no longer imports the upstream repository.


Task	Text-to-Motion (T2M)
Bundle / Pipeline	`MLDBundle` / `MLDPipeline`
Motion representation	HumanML3D-263 (263-dim, 20 fps, 22 joints)
Backbone	MLD VAE + latent diffusion denoiser, default 50 DDIM steps
Text encoder	SentenceT5-Large (`sentence-transformers/sentence-t5-large`, frozen)
Paper	Executing Your Commands via Motion Diffusion in Latent Space, Chen et al., CVPR 2023
Original code	https://github.com/ChenFengYe/motion-latent-diffusion

Weights

Current hftrainer artifact:

Artifact	Location	Contents	Status
MLD HumanML3D	`ZeyuLing/hftrainer-mld-humanml3d` / `checkpoints/mld/humanml3d`	`vae.safetensors` + `denoiser.safetensors` + `mld_config.json` + `Mean.npy` / `Std.npy`	hftrainer artifact

Load through the same from_pretrained surface as the other reproduced baselines:

from hftrainer.pipelines.mld import MLDPipeline

pipe = MLDPipeline.from_pretrained(
    "ZeyuLing/hftrainer-mld-humanml3d",
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then sits down"],
    [120],
    num_inference_steps=50,
)

Package the artifact from the upstream Lightning checkpoint:

python3 scripts/eval/convert_mld_checkpoint.py \
    --model_ckpt ref_repo/MotionLCM/experiments_t2m/mld_humanml/mld_humanml_v1.ckpt \
    --out_dir checkpoints/mld/humanml3d

The frozen SentenceT5-Large encoder is resolved by name rather than duplicated inside the artifact. For fully offline use, snapshot the text encoder into the local Hugging Face cache before calling from_pretrained.

Motion Representation

HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):

Slice	Dim	Meaning
`root_rot_vel`	1	root angular velocity (about Y)
`root_lin_vel`	2	root linear velocity (XZ plane)
`root_y`	1	root height
`ric_data`	63	local joint positions (21x3)
`rot_data`	126	local joint rotations (21x6, continuous 6D)
`local_vel`	66	local joint velocities (22x3)
`foot_contact`	4	binary foot-contact labels

MLD samples in latent space and decodes directly back to HumanML3D-263. Convert to SMPL or MotionStreamer-272 only when a cross-representation evaluator needs that space.

Evaluation

Generation follows the shared HumanML3D official-test protocol used by the leaderboard: 4042 official test ids, corrected selected captions under outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/, native 263-dim at 20 fps, and one prediction per test id.

python3 scripts/eval/mld_t2m_h3d263.py \
    --anno_file outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/test_hml3d_official272_gtlen_motionclip_selected_caption.json \
    --anno_data_dir . \
    --model_path checkpoints/mld/humanml3d \
    --num_inference_steps 50 \
    --out_dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mld

The full reproduction pipeline writes the canonical outputs:

Representation	Canonical path
HML263	`outputs/evaluation/t2m/humanml3d_official_test/hml263/mld`
SMPL motion_135	`outputs/evaluation/t2m/humanml3d_official_test/motion135/mld`
MotionStreamer-272	`outputs/evaluation/t2m/humanml3d_official_test/ms272/mld`

Run the Taiji wrapper for full generation, conversion, and evaluators:

python3 scripts/submit/submit_mld_standard_pipeline_taiji.py \
    --gpu V100 \
    --num-gpus 8 \
    --elastic

Report current metrics from the generated evaluator JSONs under outputs/evaluation/t2m/humanml3d_official_test/_runs/<run>/metrics/. For HumanML3D-263 semantic metrics, the evaluator texts_dir must match the captions used for generation. The selected-caption official-test run is scored with outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/texts; scoring these outputs against the older CondMDI text files produces mismatched R-Precision / MM-Dist.

Current HumanML3D official-test metrics (4042 generated motions, selected caption protocol):

Evaluator	R@1	R@2	R@3	FID	MM-Dist	Diversity
HumanML3D-263 (selected captions)	0.5176	0.7161	0.8159	0.2969	2.9498	9.6283
MotionStreamer-272 (HML roundtrip GT)	0.5660	0.7326	0.8095	39.7437	19.3374	24.9017
MotionCLIP-135 no-L2 (HML roundtrip GT)	0.3831	0.5380	0.6319	134.6484	42.4679	22.9470

Physical diagnostics on SMPL motion_135: Slide 4.2199, Float 16.7402, Jitter 3.2692, Dynamic 20.1758.

Implementation Notes

hftrainer-native runtime: hftrainer.models.motion.mld wraps the shared native MLD VAE / denoiser / SentenceT5 components and does not import ref_repo at inference time.
Scheduler: MLD uses diffusers.DDIMScheduler with 50 inference steps by default (eta=0.0, steps_offset=1), matching the official inference config.
Classifier-free guidance: the denoiser has no LCM time_cond_proj, so guidance uses the standard unconditional/conditional two-pass batch.
Normalization travels with the checkpoint: Mean.npy / Std.npy are embedded in the artifact to avoid evaluator drift caused by mismatched HumanML3D statistics.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support