File size: 3,147 Bytes
35d3b76 16a250e 35d3b76 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e a880ac0 16a250e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | ---
license: apache-2.0
language:
- en
- zh
tags:
- motion-generation
- vision-language
- robotics
- qwen
- dual-stream
datasets:
- MotionVLA-Dataset
---
# MotionVLA
**MotionVLA** is an end-to-end vision-language-action model for humanoid motion generation. It combines a **Qwen3.5** autoregressive backbone (conditioned on a scene image and a text instruction) with **DSFT (Dual-Stream Frequency-domain Tokenizer)**, which decouples low-frequency pose semantics from high-frequency physical dynamics.
## Repository Contents
This HuggingFace repository contains:
| Path | Description |
|------|-------------|
| `tokenizer/` | DSFT tokenizer checkpoints |
| `tokenizer/base/` | Base stream BPE tokenizer (4096 vocab, 201-dim DCT) |
| `tokenizer/phys/` | Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) |
| `dataset/` | Dataset index files (motion_path β relative paths) |
**Motion data files** (`.pt`) and **images** are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset`
## Tokenizer Design
The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:
```
276-dim motion (T frames)
β split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans β low-freq semantic
Phys (75-dim): joints_vel + root_vel + root_trans_vel β high-freq dynamics
β DCT along time axis, keep top K coefficients
β BPE encoding
Base tokens: ~477/sequence (K=5, vocab=4096)
Phys tokens: ~40/sequence (K=15, vocab=4096)
```
Each motion sample is laid out as a unified autoregressive sequence:
```
[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]
```
where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask
enforces the order `BASE β SEP β PHYS β EOS` at inference, so semantic pose
structure is generated before high-frequency physical dynamics.
## Token Vocabulary
The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the
ms-swift training pipeline):
| Token type | ID range | Count |
|------------|----------|-------|
| Base motion tokens | 248320 β 252415 | 4096 |
| Phys motion tokens | 252416 β 256511 | 4096 |
| MOTION_BOS | 256512 | 1 |
| MOTION_SEP | 256513 | 1 |
| MOTION_EOS | 256514 | 1 |
## Usage
```python
from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np
# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")
# Encode 276-dim motion
motion = np.load("motion.npy") # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames
# Decode back
base_recon, phys_recon = tok.decode(
result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)
```
## Code
Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA)
## Citation
```bibtex
@article{motionvla2026,
title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
year={2026}
}
```
|