MotionVLA / README.md
AlmightyFish's picture
fix README: remove T5 hallucinations, align with paper (Qwen3.5 + DSFT)
a880ac0
|
Raw
History Blame Contribute Delete
3.15 kB
---
license: apache-2.0
language:
- en
- zh
tags:
- motion-generation
- vision-language
- robotics
- qwen
- dual-stream
datasets:
- MotionVLA-Dataset
---
# MotionVLA
**MotionVLA** is an end-to-end vision-language-action model for humanoid motion generation. It combines a **Qwen3.5** autoregressive backbone (conditioned on a scene image and a text instruction) with **DSFT (Dual-Stream Frequency-domain Tokenizer)**, which decouples low-frequency pose semantics from high-frequency physical dynamics.
## Repository Contents
This HuggingFace repository contains:
| Path | Description |
|------|-------------|
| `tokenizer/` | DSFT tokenizer checkpoints |
| `tokenizer/base/` | Base stream BPE tokenizer (4096 vocab, 201-dim DCT) |
| `tokenizer/phys/` | Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) |
| `dataset/` | Dataset index files (motion_path β†’ relative paths) |
**Motion data files** (`.pt`) and **images** are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset`
## Tokenizer Design
The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:
```
276-dim motion (T frames)
↓ split by dimension
Base (201-dim): body_pose_6d + joints + root_orient + root_trans ← low-freq semantic
Phys (75-dim): joints_vel + root_vel + root_trans_vel ← high-freq dynamics
↓ DCT along time axis, keep top K coefficients
↓ BPE encoding
Base tokens: ~477/sequence (K=5, vocab=4096)
Phys tokens: ~40/sequence (K=15, vocab=4096)
```
Each motion sample is laid out as a unified autoregressive sequence:
```
[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]
```
where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask
enforces the order `BASE β†’ SEP β†’ PHYS β†’ EOS` at inference, so semantic pose
structure is generated before high-frequency physical dynamics.
## Token Vocabulary
The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the
ms-swift training pipeline):
| Token type | ID range | Count |
|------------|----------|-------|
| Base motion tokens | 248320 – 252415 | 4096 |
| Phys motion tokens | 252416 – 256511 | 4096 |
| MOTION_BOS | 256512 | 1 |
| MOTION_SEP | 256513 | 1 |
| MOTION_EOS | 256514 | 1 |
## Usage
```python
from tokenizer.ds_fast_tokenizer import DSFTTokenizer
import numpy as np
# Load tokenizer
tok = DSFTTokenizer.load("tokenizer/checkpoints")
# Encode 276-dim motion
motion = np.load("motion.npy") # shape: (T, 276)
result = tok.encode(motion)
# result["base_tokens"]: list of int (BPE IDs for base stream)
# result["phys_tokens"]: list of int (BPE IDs for phys stream)
# result["T"]: number of frames
# Decode back
base_recon, phys_recon = tok.decode(
result["base_tokens"], result["phys_tokens"], result["T"])
# base_recon: (T, 201), phys_recon: (T, 75)
```
## Code
Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA)
## Citation
```bibtex
@article{motionvla2026,
title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
year={2026}
}
```