VLA Tokenizer — Adaptive v2 (GPT-NeoX-20b + SNAC)

Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA multimodal dataset. Adds 3D human pose tokens, video tokens, and SNAC audio tokens on top of the EleutherAI/gpt-neox-20b base.

Vocab size: 156,505 (50,277 base + 93,938 VLA + 12,290 SNAC)

v1 → v2 change: Added 12,290 SNAC audio tokens (<snac>, </snac>, and 12,288 <snac_N> tokens) for the SNAC listen format used in MixtureVitae-Omni and FineVideo-VLA audio tokenization. All existing v1 token IDs are unchanged.


Token categories

Category Format Count Notes
Seed2 visual <seed2_N> (N: 0–8191) 8,192 Semantic keyframe tokens, 1 FPS
Cosmos spatial <cosmos_N> (N: 0–63999) 64,000 Spatial video tokens, every 8 frames
AVC-LM H.264 <avclm_N> (N: 0–8191) 8,192 H.264 BPE tokens, every 8 frames
Agent legacy <agent_N> (N: 0–255) 256 Legacy opaque agent tokens
FPS prefix <fps_N> (N: 1–60) 60 Frame rate marker per chunk
Joint position <{joint}_x/y/z_N> (N: 0–255) 13,056 Quantized xyz, maps [-2m, +2m]
Joint time <{joint}_t_N> (N: 0–7) 136 Frame index within 8-frame window
Modality wrappers <seed2>, </agent>, etc. 46 Open/close tags + joint wrappers
SNAC Level 0 <snac_128266><snac_132361> 4,096 12.5 Hz coarse audio
SNAC Level 1 even <snac_132362><snac_136457> 4,096 25 Hz fine audio (even frames)
SNAC Level 1 odd <snac_144650><snac_148745> 4,096 25 Hz fine audio (odd frames)
SNAC wrappers <snac>, </snac> 2 Block delimiters

Total new tokens: 106,228 (93,938 VLA + 12,290 SNAC)


17 Named Joints (H36M skeleton)

pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle · spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist · r_shoulder · r_elbow · r_wrist


Token format in context

Each 8-frame chunk in the interleaved sequence looks like:

<cosmos_63127> <cosmos_42647> ... </cosmos>
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
  <fps_30>
  <pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>
           <pelvis_t_7> <pelvis_x_129> <pelvis_y_128> <pelvis_z_128> </pelvis>
  <r_hip>  <r_hip_t_0>  <r_hip_x_115>  ...  </r_hip>
  ... 17 joints total ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244>
       <snac_131267> <snac_135192> <snac_148152>
       <snac_128995> <snac_133704> <snac_145875> </snac>

SNAC listen format: 3 tokens per base frame (L0 + L1_even + L1_odd), 37.5 tokens/sec, ~9–10 tokens per 8-frame chunk at 30 FPS.


Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive-v2")
print(len(tok))  # 156505

# All VLA and SNAC tokens are single atomic tokens
print(tok.encode("<seed2_1137>",   add_special_tokens=False))  # [59908]
print(tok.encode("<pelvis_x_128>", add_special_tokens=False))  # [131151]
print(tok.encode("<fps_30>",       add_special_tokens=False))  # [130992]
print(tok.encode("<snac_128266>",  add_special_tokens=False))  # single ID
print(tok.encode("<snac_132362>",  add_special_tokens=False))  # single ID
print(tok.encode("<snac_144650>",  add_special_tokens=False))  # single ID

How it was created

from transformers import AutoTokenizer

# Start from existing v1 tokenizer (144,215 vocab)
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")

snac_tokens = ["<snac>", "</snac>"]
snac_tokens += [f"<snac_{i + 128266}>" for i in range(4096)]  # L0
snac_tokens += [f"<snac_{i + 132362}>" for i in range(4096)]  # L1 even
snac_tokens += [f"<snac_{i + 144650}>" for i in range(4096)]  # L1 odd

tok.add_tokens(snac_tokens, special_tokens=True)  # all atomic
tok.save_pretrained("tokenizer-vla-adaptive-v2")
# vocab size: 156,505

Script: tools/build_tokenizers.py in the finevideo-vla repo.


Related

Resource Link
v1 tokenizer (no SNAC) tokenizer-vla-adaptive
Qwen3-based version tokenizer-vla-qwen3
VLA model trained with v1 vla-1.7b-pab-spline-adaptive
FineVideo-VLA dataset FineVideo-Phase7-Flattened
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading