| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| tags: |
| - motion-generation |
| - vision-language |
| - robotics |
| - qwen |
| - dual-stream |
| datasets: |
| - MotionVLA-Dataset |
| --- |
| |
| # MotionVLA |
|
|
| **MotionVLA** is an end-to-end vision-language-action model for humanoid motion generation. It combines a **Qwen3.5** autoregressive backbone (conditioned on a scene image and a text instruction) with **DSFT (Dual-Stream Frequency-domain Tokenizer)**, which decouples low-frequency pose semantics from high-frequency physical dynamics. |
|
|
| ## Repository Contents |
|
|
| This HuggingFace repository contains: |
|
|
| | Path | Description | |
| |------|-------------| |
| | `tokenizer/` | DSFT tokenizer checkpoints | |
| | `tokenizer/base/` | Base stream BPE tokenizer (4096 vocab, 201-dim DCT) | |
| | `tokenizer/phys/` | Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) | |
| | `dataset/` | Dataset index files (motion_path β relative paths) | |
| |
| **Motion data files** (`.pt`) and **images** are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset` |
| |
| ## Tokenizer Design |
| |
| The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams: |
| |
| ``` |
| 276-dim motion (T frames) |
| β split by dimension |
| Base (201-dim): body_pose_6d + joints + root_orient + root_trans β low-freq semantic |
| Phys (75-dim): joints_vel + root_vel + root_trans_vel β high-freq dynamics |
| β DCT along time axis, keep top K coefficients |
| β BPE encoding |
| Base tokens: ~477/sequence (K=5, vocab=4096) |
| Phys tokens: ~40/sequence (K=15, vocab=4096) |
| ``` |
| |
| Each motion sample is laid out as a unified autoregressive sequence: |
| |
| ``` |
| [ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ] |
| ``` |
| |
| where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask |
| enforces the order `BASE β SEP β PHYS β EOS` at inference, so semantic pose |
| structure is generated before high-frequency physical dynamics. |
| |
| ## Token Vocabulary |
| |
| The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the |
| ms-swift training pipeline): |
| |
| | Token type | ID range | Count | |
| |------------|----------|-------| |
| | Base motion tokens | 248320 β 252415 | 4096 | |
| | Phys motion tokens | 252416 β 256511 | 4096 | |
| | MOTION_BOS | 256512 | 1 | |
| | MOTION_SEP | 256513 | 1 | |
| | MOTION_EOS | 256514 | 1 | |
| |
| ## Usage |
| |
| ```python |
| from tokenizer.ds_fast_tokenizer import DSFTTokenizer |
| import numpy as np |
|
|
| # Load tokenizer |
| tok = DSFTTokenizer.load("tokenizer/checkpoints") |
|
|
| # Encode 276-dim motion |
| motion = np.load("motion.npy") # shape: (T, 276) |
| result = tok.encode(motion) |
| # result["base_tokens"]: list of int (BPE IDs for base stream) |
| # result["phys_tokens"]: list of int (BPE IDs for phys stream) |
| # result["T"]: number of frames |
|
|
| # Decode back |
| base_recon, phys_recon = tok.decode( |
| result["base_tokens"], result["phys_tokens"], result["T"]) |
| # base_recon: (T, 201), phys_recon: (T, 75) |
| ``` |
| |
| ## Code |
|
|
| Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{motionvla2026, |
| title={MotionVLA: Vision-Language-Action Model for Humanoid Motion}, |
| author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao}, |
| year={2026} |
| } |
| ``` |
|
|