File size: 3,147 Bytes

---
license: apache-2.0
language:
- en
- zh
tags:
- motion-generation
- vision-language
- robotics
- qwen
- dual-stream
datasets:
- MotionVLA-Dataset
---

# MotionVLA

**MotionVLA** is an end-to-end vision-language-action model for humanoid motion generation. It combines a **Qwen3.5** autoregressive backbone (conditioned on a scene image and a text instruction) with **DSFT (Dual-Stream Frequency-domain Tokenizer)**, which decouples low-frequency pose semantics from high-frequency physical dynamics.

## Repository Contents

This HuggingFace repository contains:

| Path | Description |
|------|-------------|
| `tokenizer/` | DSFT tokenizer checkpoints |
| `tokenizer/base/` | Base stream BPE tokenizer (4096 vocab, 201-dim DCT) |
| `tokenizer/phys/` | Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) |
| `dataset/` | Dataset index files (motion_path → relative paths) |

**Motion data files** (`.pt`) and **images** are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset`

## Tokenizer Design

The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:

```
276-dim motion (T frames)
    ↓ split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans   ← low-freq semantic
Phys  (75-dim): joints_vel + root_vel + root_trans_vel             ← high-freq dynamics
    ↓ DCT along time axis, keep top K coefficients
    ↓ BPE encoding
Base tokens: ~477/sequence  (K=5,  vocab=4096)
Phys tokens: ~40/sequence   (K=15, vocab=4096)
```

Each motion sample is laid out as a unified autoregressive sequence:

```
[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]
```

where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask
enforces the order `BASE → SEP → PHYS → EOS` at inference, so semantic pose
structure is generated before high-frequency physical dynamics.

## Token Vocabulary

The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the
ms-swift training pipeline):

| Token type | ID range | Count |
|------------|----------|-------|
| Base motion tokens | 248320 – 252415 | 4096 |
| Phys motion tokens | 252416 – 256511 | 4096 |
| MOTION_BOS | 256512 | 1 |
| MOTION_SEP | 256513 | 1 |
| MOTION_EOS | 256514 | 1 |

## Usage

```python
from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np

# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")

# Encode 276-dim motion
motion = np.load("motion.npy")  # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames

# Decode back
base_recon, phys_recon = tok.decode(
    result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)
```

## Code

Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA)

## Citation

```bibtex
@article{motionvla2026,
  title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
  author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
  year={2026}
}
```