MDM — Human Motion Diffusion Model

Text-to-motion baseline integrated into the hftrainer Model Zoo. Our reproduction is fully self-contained and independent of ref_repo: the network, the Gaussian-diffusion schedule, the classifier-free-guidance sampler and the collate are all vendored into hftrainer.models.motion.mdm._mdm, and verified to be bit-identical to the released checkpoint (max-abs-diff = 0.0 for the same seed/input).


Task	Text-to-Motion (T2M)
Bundle / Pipeline	`MDMBundle` / `MDMPipeline`
Processed HF artifact	`ZeyuLing/hftrainer-mdm-humanml3d`
Motion representation	HumanML3D-263 (263-dim, 20 fps, 22 joints)
Text encoder	CLIP ViT-B/32 (frozen)
Paper	Human Motion Diffusion Model, Tevet et al., ICLR 2023 — arXiv:2209.14916
Original code	https://github.com/GuyTevet/motion-diffusion-model

Weights

Current hftrainer artifact (diffusers-style from_pretrained):

Artifact	Location	Contents	Status
MDM HumanML3D	`ZeyuLing/hftrainer-mdm-humanml3d`	`model.safetensors` + `mdm_config.json` + `Mean.npy` / `Std.npy`	public Hub artifact; complete CLIP packaging pending
local mirror	`checkpoints/mdm/humanml_trans_enc_512`	same layout	optional local cache

Use directly from the Hub:

from hftrainer.pipelines.mdm import MDMPipeline

pipe = MDMPipeline.from_pretrained("ZeyuLing/hftrainer-mdm-humanml3d", device="cuda")
motions = pipe.infer_t2m(["a person walks forward then sits down"], [120])  # list of (T, 263)

Or download to disk first:

huggingface-cli download ZeyuLing/hftrainer-mdm-humanml3d \
    --local-dir checkpoints/mdm/humanml_trans_enc_512

The artifact is produced from a raw upstream .pt with scripts/eval/convert_mdm_checkpoint.py (--verify asserts bit-identical generation after the round-trip).

Complete text-encoder packaging is still pending for the current public MDM artifact: the model weights reload through MDMPipeline.from_pretrained, but CLIP ViT-B/32 is currently resolved by name rather than stored inside the repo.

Motion representation

HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):

Slice	Dim	Meaning
`root_rot_vel`	1	root angular velocity (about Y)
`root_lin_vel`	2	root linear velocity (XZ plane)
`root_y`	1	root height
`ric_data`	63	local joint positions (21×3)
`rot_data`	126	local joint rotations (21×6, cont. 6D)
`local_vel`	66	local joint velocities (22×3)
`foot_contact`	4	binary foot-contact labels

Convert to/from other spaces with hftrainer.motion.representation.convert (e.g. hml263_to_joints, hml263_to_motion135, hml263_to_motion272).

Evaluation

Generation under the official HumanML3D protocol (standard test split, native 263-dim @ 20 fps, first caption) and scoring with the two persisted hftrainer evaluators. Reproduce with:

# 1) generate (8-GPU sharded)
bash scripts/eval/_run_mdm_h3d263_shards.sh
# 2) score with the HumanML3D-263 evaluator
python3 scripts/eval/verify_evaluators.py --which hml263 \
    --hml263-pred outputs/evaluation/mdm_h3d263_official/mdm_263

HumanML3D-263 evaluator (native space, n=3970)

Metric	hftrainer	MDM paper	Note
FID ↓	0.509	0.544	✅ reproduced (within noise)
Diversity →	9.563	9.559	✅ matches
R-Precision Top-1 / 2 / 3 ↑	0.420 / 0.605 / 0.711	— / — / 0.611	evaluator runs slightly hot (GT Top-3 0.816 vs paper 0.797)
MM-Dist ↓	3.681	5.566	different evaluator embedding scale
GT(real) R-Prec / Div	0.518 / 0.720 / 0.816, 9.499	0.797 (T3), 9.503	✅ GT row consistent

FID and Diversity match the paper; R-Precision / MM-Dist differ only by the calibration of our persisted evaluator (the GT row shifts the same way), not by the model.

MotionStreamer-272 evaluator (cross-representation, n=7392)

MDM is a 263-dim model; scoring it on the MS-272 evaluator requires a 263 → 272 conversion, which shifts the distribution. These numbers are not a fair native comparison — they quantify the conversion gap, not MDM quality.

Metric	MDM→272	MS-272 GT(real)
FID ↓	121.35	0.0
R-Precision Top-1 / 2 / 3 ↑	0.379 / 0.529 / 0.610	0.706 / 0.857 / 0.911
MM-Dist ↓	20.96	15.01
Diversity →	25.48	27.36

The GT(real) row reproduces the MotionStreamer paper exactly (R@1 0.706, Div 27.36, MM 15.01), confirming the evaluator is correct; the large MDM FID is the 263→272 representation mismatch.

Implementation notes

Vendored, ref_repo-independent: hftrainer/models/mdm/_mdm/ holds the network (network.py), diffusion (diffusion/), CFG sampler and collate. Training-only deps are stubbed (inference-only).
Normalization travels with the checkpoint: Mean.npy / Std.npy are the HumanML3D training stats (not the evaluator stats) and are embedded in the artifact, eliminating the recurring "wrong Mean/Std → forward drift" bug.
Guidance: classifier-free, default scale 2.5.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

17.9M params

Tensor type

F32

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ZeyuLing/hftrainer-mdm-humanml3d

Human Motion Diffusion Model

Paper • 2209.14916 • Published Sep 29, 2022