VLA Tokenizer — Adaptive v2 (GPT-NeoX-20b + SNAC)
Extended GPT-NeoX-20b tokenizer for the FineVideo-VLA multimodal dataset. Adds 3D human pose tokens, video tokens, and SNAC audio tokens on top of the EleutherAI/gpt-neox-20b base.
Vocab size: 156,505 (50,277 base + 93,938 VLA + 12,290 SNAC)
v1 → v2 change: Added 12,290 SNAC audio tokens (
<snac>,</snac>, and 12,288<snac_N>tokens) for the SNAC listen format used in MixtureVitae-Omni and FineVideo-VLA audio tokenization. All existing v1 token IDs are unchanged.
Token categories
| Category | Format | Count | Notes |
|---|---|---|---|
| Seed2 visual | <seed2_N> (N: 0–8191) |
8,192 | Semantic keyframe tokens, 1 FPS |
| Cosmos spatial | <cosmos_N> (N: 0–63999) |
64,000 | Spatial video tokens, every 8 frames |
| AVC-LM H.264 | <avclm_N> (N: 0–8191) |
8,192 | H.264 BPE tokens, every 8 frames |
| Agent legacy | <agent_N> (N: 0–255) |
256 | Legacy opaque agent tokens |
| FPS prefix | <fps_N> (N: 1–60) |
60 | Frame rate marker per chunk |
| Joint position | <{joint}_x/y/z_N> (N: 0–255) |
13,056 | Quantized xyz, maps [-2m, +2m] |
| Joint time | <{joint}_t_N> (N: 0–7) |
136 | Frame index within 8-frame window |
| Modality wrappers | <seed2>, </agent>, etc. |
46 | Open/close tags + joint wrappers |
| SNAC Level 0 | <snac_128266> – <snac_132361> |
4,096 | 12.5 Hz coarse audio |
| SNAC Level 1 even | <snac_132362> – <snac_136457> |
4,096 | 25 Hz fine audio (even frames) |
| SNAC Level 1 odd | <snac_144650> – <snac_148745> |
4,096 | 25 Hz fine audio (odd frames) |
| SNAC wrappers | <snac>, </snac> |
2 | Block delimiters |
Total new tokens: 106,228 (93,938 VLA + 12,290 SNAC)
17 Named Joints (H36M skeleton)
pelvis · r_hip · r_knee · r_ankle · l_hip · l_knee · l_ankle ·
spine · thorax · nose · head_top · l_shoulder · l_elbow · l_wrist ·
r_shoulder · r_elbow · r_wrist
Token format in context
Each 8-frame chunk in the interleaved sequence looks like:
<cosmos_63127> <cosmos_42647> ... </cosmos>
<avc_lm> <avclm_263> <avclm_107> ... </avc_lm>
<agent>
<fps_30>
<pelvis> <pelvis_t_0> <pelvis_x_128> <pelvis_y_128> <pelvis_z_128>
<pelvis_t_7> <pelvis_x_129> <pelvis_y_128> <pelvis_z_128> </pelvis>
<r_hip> <r_hip_t_0> <r_hip_x_115> ... </r_hip>
... 17 joints total ...
</agent>
<snac> <snac_131580> <snac_134777> <snac_147244>
<snac_131267> <snac_135192> <snac_148152>
<snac_128995> <snac_133704> <snac_145875> </snac>
SNAC listen format: 3 tokens per base frame (L0 + L1_even + L1_odd), 37.5 tokens/sec, ~9–10 tokens per 8-frame chunk at 30 FPS.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive-v2")
print(len(tok)) # 156505
# All VLA and SNAC tokens are single atomic tokens
print(tok.encode("<seed2_1137>", add_special_tokens=False)) # [59908]
print(tok.encode("<pelvis_x_128>", add_special_tokens=False)) # [131151]
print(tok.encode("<fps_30>", add_special_tokens=False)) # [130992]
print(tok.encode("<snac_128266>", add_special_tokens=False)) # single ID
print(tok.encode("<snac_132362>", add_special_tokens=False)) # single ID
print(tok.encode("<snac_144650>", add_special_tokens=False)) # single ID
How it was created
from transformers import AutoTokenizer
# Start from existing v1 tokenizer (144,215 vocab)
tok = AutoTokenizer.from_pretrained("EmpathicRobotics/tokenizer-vla-adaptive")
snac_tokens = ["<snac>", "</snac>"]
snac_tokens += [f"<snac_{i + 128266}>" for i in range(4096)] # L0
snac_tokens += [f"<snac_{i + 132362}>" for i in range(4096)] # L1 even
snac_tokens += [f"<snac_{i + 144650}>" for i in range(4096)] # L1 odd
tok.add_tokens(snac_tokens, special_tokens=True) # all atomic
tok.save_pretrained("tokenizer-vla-adaptive-v2")
# vocab size: 156,505
Script: tools/build_tokenizers.py in the
finevideo-vla repo.
Related
| Resource | Link |
|---|---|
| v1 tokenizer (no SNAC) | tokenizer-vla-adaptive |
| Qwen3-based version | tokenizer-vla-qwen3 |
| VLA model trained with v1 | vla-1.7b-pab-spline-adaptive |
| FineVideo-VLA dataset | FineVideo-Phase7-Flattened |