VLA Tokenizer — Adaptive v2 (GPT-NeoX-20b + SNAC)

Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA multimodal dataset. Adds 3D human pose tokens, video tokens, and SNAC audio tokens on top of the EleutherAI/gpt-neox-20b base.

Vocab size: 156,505 (50,277 base + 93,938 VLA + 12,290 SNAC)

v1 → v2 change: Added 12,290 SNAC audio tokens (<snac>, </snac>, and 12,288 <snac_N> tokens) for the SNAC listen format used in MixtureVitae-Omni and FineVideo-VLA audio tokenization. All existing v1 token IDs are unchanged.

Token categories

Category	Format	Count	Notes
Seed2 visual	`<seed2_N>` (N: 0–8191)	8,192	Semantic keyframe tokens, 1 FPS
Cosmos spatial	`<cosmos_N>` (N: 0–63999)	64,000	Spatial video tokens, every 8 frames
AVC-LM H.264	`<avclm_N>` (N: 0–8191)	8,192	H.264 BPE tokens, every 8 frames
Agent legacy	`<agent_N>` (N: 0–255)	256	Legacy opaque agent tokens
FPS prefix	`<fps_N>` (N: 1–60)	60	Frame rate marker per chunk
Joint position	`<{joint}_x/y/z_N>` (N: 0–255)	13,056	Quantized xyz, maps [-2m, +2m]
Joint time	`<{joint}_t_N>` (N: 0–7)	136	Frame index within 8-frame window
Modality wrappers	`<seed2>`, `</agent>`, etc.	46	Open/close tags + joint wrappers
SNAC Level 0	`<snac_128266>` – `<snac_132361>`	4,096	12.5 Hz coarse audio
SNAC Level 1 even	`<snac_132362>` – `<snac_136457>`	4,096	25 Hz fine audio (even frames)
SNAC Level 1 odd	`<snac_144650>` – `<snac_148745>`	4,096	25 Hz fine audio (odd frames)
SNAC wrappers	`<snac>`, `</snac>`	2	Block delimiters

Total new tokens: 106,228 (93,938 VLA + 12,290 SNAC)

17 Named Joints (H36M skeleton)

pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle · spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist · r_shoulder · r_elbow · r_wrist

Token format in context

Each 8-frame chunk in the interleaved sequence looks like:

<cosmos_63127> <cosmos_42647> ... </cosmos>
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
  <fps_30>
  <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>
           <pelvis_t_7> <pelvis_x_129> <pelvis_y_128> <pelvis_z_128> </pelvis>
  <r_hip>  <r_hip_t_0>  <r_hip_x_115>  ...  </r_hip>
  ... 17 joints total ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244>
       <snac_131267> <snac_135192> <snac_148152>
       <snac_128995> <snac_133704> <snac_145875> </snac>

SNAC listen format: 3 tokens per base frame (L0 + L1_even + L1_odd), 37.5 tokens/sec, ~9–10 tokens per 8-frame chunk at 30 FPS.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive-v2")
print(len(tok))  # 156505

# All VLA and SNAC tokens are single atomic tokens
print(tok.encode("<seed2_1137>",   add_special_tokens=False))  # [59908]
print(tok.encode("<pelvis_x_128>", add_special_tokens=False))  # [131151]
print(tok.encode("<fps_30>",       add_special_tokens=False))  # [130992]
print(tok.encode("<snac_128266>",  add_special_tokens=False))  # single ID
print(tok.encode("<snac_132362>",  add_special_tokens=False))  # single ID
print(tok.encode("<snac_144650>",  add_special_tokens=False))  # single ID

How it was created

from transformers import AutoTokenizer

# Start from existing v1 tokenizer (144,215 vocab)
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")

snac_tokens = ["<snac>", "</snac>"]
snac_tokens += [f"<snac_{i + 128266}>" for i in range(4096)]  # L0
snac_tokens += [f"<snac_{i + 132362}>" for i in range(4096)]  # L1 even
snac_tokens += [f"<snac_{i + 144650}>" for i in range(4096)]  # L1 odd

tok.add_tokens(snac_tokens, special_tokens=True)  # all atomic
tok.save_pretrained("tokenizer-vla-adaptive-v2")
# vocab size: 156,505

Script: tools/build_tokenizers.py in the finevideo-vla repo.

Resource	Link
v1 tokenizer (no SNAC)	tokenizer-vla-adaptive
Qwen3-based version	tokenizer-vla-qwen3
VLA model trained with v1	vla-1.7b-pab-spline-adaptive
FineVideo-VLA dataset	FineVideo-Phase7-Flattened

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics