| --- |
| license: cc0-1.0 |
| tags: |
| - text-to-motion |
| - skeletal-animation |
| - qtmesheditor |
| - experimental |
| --- |
| |
| # QtMeshEditor Text-to-Motion (experimental, #411) |
|
|
| A small, **experimental** from-scratch text-to-motion model for |
| [QtMeshEditor](https://github.com/fernandotonon/QtMeshEditor). Given a text |
| prompt (an action keyword), it generates a 60-frame @30fps, 22-joint |
| **canonical WORLD-frame** skeletal clip that QtMeshEditor retargets onto an |
| arbitrary humanoid rig. |
|
|
| > The model QtMeshEditor actually downloads at runtime lives in the shared |
| > [`fernandotonon/QtMeshEditor-models`](https://huggingface.co/fernandotonon/QtMeshEditor-models) |
| > repo under `motion/`. This repo is the standalone model card + mirror. |
|
|
| ## Status: experimental |
|
|
| The shipped **default** in QtMeshEditor is the deterministic **template-clip |
| retarget** (a curated library of 47 real CMU mocap clips across 15 actions, |
| with per-action variety) — that is the quality bar. This model is an opt-in |
| (`--model` / GUI checkbox / MCP `model:true`) that **falls back to the |
| template** automatically when unavailable or out of vocabulary. It produces |
| coherent, upright motion with per-generate variety, but is stylistically |
| gentler/less crisp than the real-mocap templates. |
|
|
| ## Training data — permissive only |
|
|
| Trained from scratch on **clean, dynamic, single-action windows** mined from |
| the **CMU MoCap** database (commercial-OK). AMASS / HumanML3D / KIT-ML were |
| **excluded** (non-commercial). Windows are 30fps, 2s, selected for motion |
| energy and snapped to a calm near-neutral start frame; mirror-augmented. |
|
|
| ## Architecture (v4) |
|
|
| - **6D-rotation** representation (Zhou et al. 2019), correctly column-packed. |
| - Cross-attention transformer decoder with an **absolute** per-frame pose head |
| (self-attention models temporal coherence; no error-accumulating cumsum). |
| - CVAE latent with z=0 supervision + aggregate-posterior matching. |
| - Per-sample velocity/acceleration matching in both 6D and true rotation |
| (geodesic) space; derived-local supervision (the quantity the retarget |
| renders); 1-2-1 output smoothing baked into the ONNX graph. |
| - ~7.6M params, exports to ONNX (one forward pass). |
|
|
| ## I/O contract |
|
|
| ``` |
| input "tokens" float32 [1, V] one-hot over the fixed action vocab (see t2m-vocab.json) |
| input "seed" float32 [1, Z] latent noise (host samples ~N(0,0.5) and does best-of-N) |
| output "motion" float32 [1, T, C] C = 22*10 per-joint [tx,ty,tz, qx,qy,qz,qw, sx,sy,sz] |
| ``` |
|
|
| `t2m-vocab.json` ships the `{vocab, Z, T, C, J, joints, fps, frame}` the host |
| needs — `frame: "world"` marks the WORLD-frame convention (retarget takes a |
| world delta), `fps: 30`. Vocabulary: walk, run, jump, dance, march, kick, |
| punch, wave, climb, sit, throw, boxing, idle. |
|
|
| ## Reproducing |
|
|
| `scripts/prep-t2m-v4.py` + `scripts/train-t2m-onnx-v4.py` in the QtMeshEditor |
| repo (one-time, offline dev tools — the app never runs Python). |
|
|