fernandotonon
/

QtMeshEditor-t2m

skeletal-animation

Model card Files Files and versions

QtMeshEditor-t2m / README.md

fernandotonon's picture

Update card for v4 architecture + corrected 6D

b272e71 verified 1 day ago

|

History Blame Contribute Delete

2.9 kB

	---
	license: cc0-1.0
	tags:
	- text-to-motion
	- skeletal-animation
	- qtmesheditor
	- experimental
	---

	# QtMeshEditor Text-to-Motion (experimental, #411)

	A small, experimental from-scratch text-to-motion model for
	[QtMeshEditor](https://github.com/fernandotonon/QtMeshEditor). Given a text
	prompt (an action keyword), it generates a 60-frame @30fps, 22-joint
	canonical WORLD-frame skeletal clip that QtMeshEditor retargets onto an
	arbitrary humanoid rig.

	> The model QtMeshEditor actually downloads at runtime lives in the shared
	> [`fernandotonon/QtMeshEditor-models`](https://huggingface.co/fernandotonon/QtMeshEditor-models)
	> repo under `motion/`. This repo is the standalone model card + mirror.

	## Status: experimental

	The shipped default in QtMeshEditor is the deterministic **template-clip
	retarget** (a curated library of 47 real CMU mocap clips across 15 actions,
	with per-action variety) — that is the quality bar. This model is an opt-in
	(`--model` / GUI checkbox / MCP `model:true`) that **falls back to the
	template** automatically when unavailable or out of vocabulary. It produces
	coherent, upright motion with per-generate variety, but is stylistically
	gentler/less crisp than the real-mocap templates.

	## Training data — permissive only

	Trained from scratch on clean, dynamic, single-action windows mined from
	the CMU MoCap database (commercial-OK). AMASS / HumanML3D / KIT-ML were
	excluded (non-commercial). Windows are 30fps, 2s, selected for motion
	energy and snapped to a calm near-neutral start frame; mirror-augmented.

	## Architecture (v4)

	- 6D-rotation representation (Zhou et al. 2019), correctly column-packed.
	- Cross-attention transformer decoder with an absolute per-frame pose head
	(self-attention models temporal coherence; no error-accumulating cumsum).
	- CVAE latent with z=0 supervision + aggregate-posterior matching.
	- Per-sample velocity/acceleration matching in both 6D and true rotation
	(geodesic) space; derived-local supervision (the quantity the retarget
	renders); 1-2-1 output smoothing baked into the ONNX graph.
	- ~7.6M params, exports to ONNX (one forward pass).

	## I/O contract

	```
	input "tokens" float32 [1, V] one-hot over the fixed action vocab (see t2m-vocab.json)
	input "seed" float32 [1, Z] latent noise (host samples ~N(0,0.5) and does best-of-N)
	output "motion" float32 [1, T, C] C = 22*10 per-joint [tx,ty,tz, qx,qy,qz,qw, sx,sy,sz]
	```

	`t2m-vocab.json` ships the `{vocab, Z, T, C, J, joints, fps, frame}` the host
	needs — `frame: "world"` marks the WORLD-frame convention (retarget takes a
	world delta), `fps: 30`. Vocabulary: walk, run, jump, dance, march, kick,
	punch, wave, climb, sit, throw, boxing, idle.

	## Reproducing

	`scripts/prep-t2m-v4.py` + `scripts/train-t2m-onnx-v4.py` in the QtMeshEditor
	repo (one-time, offline dev tools — the app never runs Python).