VLA Tokenizer — Qwen3

Extended Qwen3 tokenizer for the FineVideo-VLA multimodal dataset. Adds all VLA tokens (video, 3D pose, SNAC audio) on top of the Qwen3 base tokenizer.

Vocab size: 257,897 (~151,669 Qwen3 base + 106,228 VLA tokens)

Why Qwen3? The Qwen3 family has strong multilingual + reasoning abilities and native HuggingFace ecosystem support (vLLM, llama.cpp, transformers). This tokenizer enables training a Qwen3-based VLA model from scratch or continued pretraining with full VLA token support.


Token categories added on top of Qwen3

Category Format Count Notes
Seed2 visual <seed2_N> (N: 0–8191) 8,192 Semantic keyframe tokens, 1 FPS
Cosmos spatial <cosmos_N> (N: 0–63999) 64,000 Spatial video tokens, every 8 frames
AVC-LM H.264 <avclm_N> (N: 0–8191) 8,192 H.264 BPE tokens, every 8 frames
Agent legacy <agent_N> (N: 0–255) 256 Legacy opaque agent tokens
FPS prefix <fps_N> (N: 1–60) 60 Frame rate marker per chunk
Joint position <{joint}_x/y/z_N> (N: 0–255) 13,056 Quantized xyz, maps [-2m, +2m]
Joint time <{joint}_t_N> (N: 0–7) 136 Frame index within 8-frame window
Modality wrappers <seed2>, </agent>, etc. 46 Open/close tags + joint wrappers
SNAC Level 0 <snac_128266><snac_132361> 4,096 12.5 Hz coarse audio
SNAC Level 1 even <snac_132362><snac_136457> 4,096 25 Hz fine audio (even frames)
SNAC Level 1 odd <snac_144650><snac_148745> 4,096 25 Hz fine audio (odd frames)
SNAC wrappers <snac>, </snac> 2 Block delimiters

Total VLA tokens added: 106,228


17 Named Joints (H36M skeleton)

pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle · spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist · r_shoulder · r_elbow · r_wrist


Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-qwen3")
print(len(tok))  # 257897

# All VLA tokens are atomic
print(tok.encode("<seed2_1137>",   add_special_tokens=False))  # single ID
print(tok.encode("<pelvis_x_128>", add_special_tokens=False))  # single ID
print(tok.encode("<fps_30>",       add_special_tokens=False))  # single ID
print(tok.encode("<snac_128266>",  add_special_tokens=False))  # single ID

How it was created

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")  # or any Qwen3 variant

# Build all 106,228 VLA tokens
vla_tokens = []

# Modality wrappers
vla_tokens += ["<seed2>", "</seed2>", "<cosmos>", "</cosmos>",
                "<avc_lm>", "</avc_lm>", "<agent>", "</agent>",
                "<start_cosmo>", "</start_cosmo>", "<start_avclm>", "</start_avclm>"]
# Joint wrappers
for joint in ["pelvis", "r_hip", "r_knee", "r_ankle", "l_hip", "l_knee", "l_ankle",
              "spine", "thorax", "nose", "head_top", "l_shoulder", "l_elbow", "l_wrist",
              "r_shoulder", "r_elbow", "r_wrist"]:
    vla_tokens += [f"<{joint}>", f"</{joint}>"]
# Video tokens
vla_tokens += [f"<agent_{i}>" for i in range(256)]
vla_tokens += [f"<avclm_{i}>" for i in range(8192)]
vla_tokens += [f"<seed2_{i}>" for i in range(8192)]
vla_tokens += [f"<cosmos_{i}>" for i in range(64000)]
# Pose tokens
vla_tokens += [f"<fps_{i}>" for i in range(1, 61)]
for joint in [...]:  # 17 joints
    vla_tokens += [f"<{joint}_x_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_y_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_z_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_t_{n}>" for n in range(8)]
# SNAC tokens
vla_tokens += ["<snac>", "</snac>"]
vla_tokens += [f"<snac_{i + 128266}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 132362}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 144650}>" for i in range(4096)]

tok.add_tokens(vla_tokens, special_tokens=True)  # all atomic
tok.save_pretrained("tokenizer-vla-qwen3")
# vocab size: 257,897

Full script: tools/build_tokenizers.py in the finevideo-vla repo.


Interleaved token sequence format

USER: <activity description> [Speech: ...]  ASSISTANT:
<seed2_6750> <seed2_680> ...                    # semantic keyframes 1 FPS
<cosmos_63127> <cosmos_42647> ... </cosmos>     # spatial video every 8 frames
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
  <fps_30>
  <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128> ... </pelvis>
  ... 17 joints ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244> ... </snac>

Related

Resource Link
GPT-NeoX v2 tokenizer (+ SNAC) tokenizer-vla-adaptive-v2
Original GPT-NeoX v1 tokenizer tokenizer-vla-adaptive
VLA model (trained with v1) vla-1.7b-pab-spline-adaptive
FineVideo-VLA dataset FineVideo-Phase7-Flattened
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading