fix README: remove T5 hallucinations, align with paper (Qwen3.5 + DSFT)

a880ac0 26 days ago

3.15 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	tags:
	- motion-generation
	- vision-language
	- robotics
	- qwen
	- dual-stream
	datasets:
	- MotionVLA-Dataset
	---

	# MotionVLA

	MotionVLA is an end-to-end vision-language-action model for humanoid motion generation. It combines a Qwen3.5 autoregressive backbone (conditioned on a scene image and a text instruction) with DSFT (Dual-Stream Frequency-domain Tokenizer), which decouples low-frequency pose semantics from high-frequency physical dynamics.

	## Repository Contents

	This HuggingFace repository contains:

	\| Path \| Description \|
	\|------\|-------------\|
	\| `tokenizer/` \| DSFT tokenizer checkpoints \|
	\| `tokenizer/base/` \| Base stream BPE tokenizer (4096 vocab, 201-dim DCT) \|
	\| `tokenizer/phys/` \| Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) \|
	\| `dataset/` \| Dataset index files (motion_path → relative paths) \|

	Motion data files (`.pt`) and images are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset`

	## Tokenizer Design

	The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams:

	```
	276-dim motion (T frames)
	↓ split by dimension
	Base (201-dim): body_pose_6d + joints + root_orient + root_trans ← low-freq semantic
	Phys (75-dim): joints_vel + root_vel + root_trans_vel ← high-freq dynamics
	↓ DCT along time axis, keep top K coefficients
	↓ BPE encoding
	Base tokens: ~477/sequence (K=5, vocab=4096)
	Phys tokens: ~40/sequence (K=15, vocab=4096)
	```

	Each motion sample is laid out as a unified autoregressive sequence:

	```
	[ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ]
	```

	where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask
	enforces the order `BASE → SEP → PHYS → EOS` at inference, so semantic pose
	structure is generated before high-frequency physical dynamics.

	## Token Vocabulary

	The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the
	ms-swift training pipeline):

	\| Token type \| ID range \| Count \|
	\|------------\|----------\|-------\|
	\| Base motion tokens \| 248320 – 252415 \| 4096 \|
	\| Phys motion tokens \| 252416 – 256511 \| 4096 \|
	\| MOTION_BOS \| 256512 \| 1 \|
	\| MOTION_SEP \| 256513 \| 1 \|
	\| MOTION_EOS \| 256514 \| 1 \|

	## Usage

	```python
	from tokenizer.ds_fast_tokenizer import DSFTTokenizer
	import numpy as np

	# Load tokenizer
	tok = DSFTTokenizer.load("tokenizer/checkpoints")

	# Encode 276-dim motion
	motion = np.load("motion.npy") # shape: (T, 276)
	result = tok.encode(motion)
	# result["base_tokens"]: list of int (BPE IDs for base stream)
	# result["phys_tokens"]: list of int (BPE IDs for phys stream)
	# result["T"]: number of frames

	# Decode back
	base_recon, phys_recon = tok.decode(
	result["base_tokens"], result["phys_tokens"], result["T"])
	# base_recon: (T, 201), phys_recon: (T, 75)
	```

	## Code

	Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA)

	## Citation

	```bibtex
	@article{motionvla2026,
	title={MotionVLA: Vision-Language-Action Model for Humanoid Motion},
	author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao},
	year={2026}
	}
	```