VLA Tokenizer — Qwen3
Extended Qwen3 tokenizer for the FineVideo-VLA multimodal dataset. Adds all VLA tokens (video, 3D pose, SNAC audio) on top of the Qwen3 base tokenizer.
Vocab size: 257,897 (~151,669 Qwen3 base + 106,228 VLA tokens)
Why Qwen3? The Qwen3 family has strong multilingual + reasoning abilities and native HuggingFace ecosystem support (vLLM, llama.cpp, transformers). This tokenizer enables training a Qwen3-based VLA model from scratch or continued pretraining with full VLA token support.
Token categories added on top of Qwen3
| Category | Format | Count | Notes |
|---|---|---|---|
| Seed2 visual | <seed2_N> (N: 0–8191) |
8,192 | Semantic keyframe tokens, 1 FPS |
| Cosmos spatial | <cosmos_N> (N: 0–63999) |
64,000 | Spatial video tokens, every 8 frames |
| AVC-LM H.264 | <avclm_N> (N: 0–8191) |
8,192 | H.264 BPE tokens, every 8 frames |
| Agent legacy | <agent_N> (N: 0–255) |
256 | Legacy opaque agent tokens |
| FPS prefix | <fps_N> (N: 1–60) |
60 | Frame rate marker per chunk |
| Joint position | <{joint}_x/y/z_N> (N: 0–255) |
13,056 | Quantized xyz, maps [-2m, +2m] |
| Joint time | <{joint}_t_N> (N: 0–7) |
136 | Frame index within 8-frame window |
| Modality wrappers | <seed2>, </agent>, etc. |
46 | Open/close tags + joint wrappers |
| SNAC Level 0 | <snac_128266> – <snac_132361> |
4,096 | 12.5 Hz coarse audio |
| SNAC Level 1 even | <snac_132362> – <snac_136457> |
4,096 | 25 Hz fine audio (even frames) |
| SNAC Level 1 odd | <snac_144650> – <snac_148745> |
4,096 | 25 Hz fine audio (odd frames) |
| SNAC wrappers | <snac>, </snac> |
2 | Block delimiters |
Total VLA tokens added: 106,228
17 Named Joints (H36M skeleton)
pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle ·
spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist ·
r_shoulder · r_elbow · r_wrist
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-qwen3")
print(len(tok)) # 257897
# All VLA tokens are atomic
print(tok.encode("<seed2_1137>", add_special_tokens=False)) # single ID
print(tok.encode("<pelvis_x_128>", add_special_tokens=False)) # single ID
print(tok.encode("<fps_30>", add_special_tokens=False)) # single ID
print(tok.encode("<snac_128266>", add_special_tokens=False)) # single ID
How it was created
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B") # or any Qwen3 variant
# Build all 106,228 VLA tokens
vla_tokens = []
# Modality wrappers
vla_tokens += ["<seed2>", "</seed2>", "<cosmos>", "</cosmos>",
"<avc_lm>", "</avc_lm>", "<agent>", "</agent>",
"<start_cosmo>", "</start_cosmo>", "<start_avclm>", "</start_avclm>"]
# Joint wrappers
for joint in ["pelvis", "r_hip", "r_knee", "r_ankle", "l_hip", "l_knee", "l_ankle",
"spine", "thorax", "nose", "head_top", "l_shoulder", "l_elbow", "l_wrist",
"r_shoulder", "r_elbow", "r_wrist"]:
vla_tokens += [f"<{joint}>", f"</{joint}>"]
# Video tokens
vla_tokens += [f"<agent_{i}>" for i in range(256)]
vla_tokens += [f"<avclm_{i}>" for i in range(8192)]
vla_tokens += [f"<seed2_{i}>" for i in range(8192)]
vla_tokens += [f"<cosmos_{i}>" for i in range(64000)]
# Pose tokens
vla_tokens += [f"<fps_{i}>" for i in range(1, 61)]
for joint in [...]: # 17 joints
vla_tokens += [f"<{joint}_x_{n}>" for n in range(256)]
vla_tokens += [f"<{joint}_y_{n}>" for n in range(256)]
vla_tokens += [f"<{joint}_z_{n}>" for n in range(256)]
vla_tokens += [f"<{joint}_t_{n}>" for n in range(8)]
# SNAC tokens
vla_tokens += ["<snac>", "</snac>"]
vla_tokens += [f"<snac_{i + 128266}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 132362}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 144650}>" for i in range(4096)]
tok.add_tokens(vla_tokens, special_tokens=True) # all atomic
tok.save_pretrained("tokenizer-vla-qwen3")
# vocab size: 257,897
Full script: tools/build_tokenizers.py in the
finevideo-vla repo.
Interleaved token sequence format
USER: <activity description> [Speech: ...] ASSISTANT:
<seed2_6750> <seed2_680> ... # semantic keyframes 1 FPS
<cosmos_63127> <cosmos_42647> ... </cosmos> # spatial video every 8 frames
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
<fps_30>
<pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128> ... </pelvis>
... 17 joints ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244> ... </snac>
Related
| Resource | Link |
|---|---|
| GPT-NeoX v2 tokenizer (+ SNAC) | tokenizer-vla-adaptive-v2 |
| Original GPT-NeoX v1 tokenizer | tokenizer-vla-adaptive |
| VLA model (trained with v1) | vla-1.7b-pab-spline-adaptive |
| FineVideo-VLA dataset | FineVideo-Phase7-Flattened |