VLA Tokenizer — Qwen3

Extended Qwen3 tokenizer for the FineVideo-VLA multimodal dataset. Adds all VLA tokens (video, 3D pose, SNAC audio) on top of the Qwen3 base tokenizer.

Vocab size: 257,897 (~151,669 Qwen3 base + 106,228 VLA tokens)

Why Qwen3? The Qwen3 family has strong multilingual + reasoning abilities and native HuggingFace ecosystem support (vLLM, llama.cpp, transformers). This tokenizer enables training a Qwen3-based VLA model from scratch or continued pretraining with full VLA token support.

Token categories added on top of Qwen3

Category	Format	Count	Notes
Seed2 visual	`<seed2_N>` (N: 0–8191)	8,192	Semantic keyframe tokens, 1 FPS
Cosmos spatial	`<cosmos_N>` (N: 0–63999)	64,000	Spatial video tokens, every 8 frames
AVC-LM H.264	`<avclm_N>` (N: 0–8191)	8,192	H.264 BPE tokens, every 8 frames
Agent legacy	`<agent_N>` (N: 0–255)	256	Legacy opaque agent tokens
FPS prefix	`<fps_N>` (N: 1–60)	60	Frame rate marker per chunk
Joint position	`<{joint}_x/y/z_N>` (N: 0–255)	13,056	Quantized xyz, maps [-2m, +2m]
Joint time	`<{joint}_t_N>` (N: 0–7)	136	Frame index within 8-frame window
Modality wrappers	`<seed2>`, `</agent>`, etc.	46	Open/close tags + joint wrappers
SNAC Level 0	`<snac_128266>` – `<snac_132361>`	4,096	12.5 Hz coarse audio
SNAC Level 1 even	`<snac_132362>` – `<snac_136457>`	4,096	25 Hz fine audio (even frames)
SNAC Level 1 odd	`<snac_144650>` – `<snac_148745>`	4,096	25 Hz fine audio (odd frames)
SNAC wrappers	`<snac>`, `</snac>`	2	Block delimiters

Total VLA tokens added: 106,228

17 Named Joints (H36M skeleton)

pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle · spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist · r_shoulder · r_elbow · r_wrist

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-qwen3")
print(len(tok))  # 257897

# All VLA tokens are atomic
print(tok.encode("<seed2_1137>",   add_special_tokens=False))  # single ID
print(tok.encode("<pelvis_x_128>", add_special_tokens=False))  # single ID
print(tok.encode("<fps_30>",       add_special_tokens=False))  # single ID
print(tok.encode("<snac_128266>",  add_special_tokens=False))  # single ID

How it was created

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")  # or any Qwen3 variant

# Build all 106,228 VLA tokens
vla_tokens = []

# Modality wrappers
vla_tokens += ["<seed2>", "</seed2>", "<cosmos>", "</cosmos>",
                "<avc_lm>", "</avc_lm>", "<agent>", "</agent>",
                "<start_cosmo>", "</start_cosmo>", "<start_avclm>", "</start_avclm>"]
# Joint wrappers
for joint in ["pelvis", "r_hip", "r_knee", "r_ankle", "l_hip", "l_knee", "l_ankle",
              "spine", "thorax", "nose", "head_top", "l_shoulder", "l_elbow", "l_wrist",
              "r_shoulder", "r_elbow", "r_wrist"]:
    vla_tokens += [f"<{joint}>", f"</{joint}>"]
# Video tokens
vla_tokens += [f"<agent_{i}>" for i in range(256)]
vla_tokens += [f"<avclm_{i}>" for i in range(8192)]
vla_tokens += [f"<seed2_{i}>" for i in range(8192)]
vla_tokens += [f"<cosmos_{i}>" for i in range(64000)]
# Pose tokens
vla_tokens += [f"<fps_{i}>" for i in range(1, 61)]
for joint in [...]:  # 17 joints
    vla_tokens += [f"<{joint}_x_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_y_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_z_{n}>" for n in range(256)]
    vla_tokens += [f"<{joint}_t_{n}>" for n in range(8)]
# SNAC tokens
vla_tokens += ["<snac>", "</snac>"]
vla_tokens += [f"<snac_{i + 128266}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 132362}>" for i in range(4096)]
vla_tokens += [f"<snac_{i + 144650}>" for i in range(4096)]

tok.add_tokens(vla_tokens, special_tokens=True)  # all atomic
tok.save_pretrained("tokenizer-vla-qwen3")
# vocab size: 257,897

Full script: tools/build_tokenizers.py in the finevideo-vla repo.

Interleaved token sequence format

USER: <activity description> [Speech: ...]  ASSISTANT:
<seed2_6750> <seed2_680> ...                    # semantic keyframes 1 FPS
<cosmos_63127> <cosmos_42647> ... </cosmos>     # spatial video every 8 frames
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
  <fps_30>
  <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128> ... </pelvis>
  ... 17 joints ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244> ... </snac>

Resource	Link
GPT-NeoX v2 tokenizer (+ SNAC)	tokenizer-vla-adaptive-v2
Original GPT-NeoX v1 tokenizer	tokenizer-vla-adaptive
VLA model (trained with v1)	vla-1.7b-pab-spline-adaptive
FineVideo-VLA dataset	FineVideo-Phase7-Flattened

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics