X-Tokenizer
Project Page ยท GitHub ยท Paper ยท Install ยท Usage
A residual vector-quantization tokenizer for robot manipulation actions, trained jointly on 18 robot embodiments. It turns a continuous action sequence into a short sequence of discrete tokens, and decodes the tokens back to actions.
- 18 canonical robot embodiments with per-embodiment conditioning
- compression ratio 4 (e.g. 32 action frames -> 8 latent steps)
- codebook: 2048 codes ร 4 residual quantizers per latent step
- 26-dim flat action layout (bimanual EEF + chassis + lift + head)
- dynamic sequence length: any
chunk_size โ [8, 64]at inference, no reload - two token layouts available at encode/decode time (
time_major/quantizer_major) so the same checkpoint can feed both step-by-step and depth-first autoregressive consumers
Install
git clone <repo>
cd X-Tokenizer
pip install -e .
# Only needed if you plan to compute statistics from your own data.
pip install -e ".[stats]"
Core deps (PyTorch, vector-quantize-pytorch, scipy, pyyaml, tqdm) install
automatically. tdigest is lazy-loaded and only required to build
statistics.
Download xtokenizer.pth from the release page and place it where the
examples expect it (the examples default to ./xtokenizer.pth).
Two ways to use this package
| You have | Do this |
|---|---|
A statistics file (my_statistics.json) for your robot |
Case A โ go to "Encode / decode your actions" below |
| Only raw absolute actions, no statistics | Case B โ run "Compute statistics" once, then go back to Case A |
Both cases assume your data already follows the action-dict format described in the next section.
Step 1: Shape your data into an episode dict
Every step in this package consumes one or more episodes, where each
episode is a Python dict mapping a stripped key name to a [T, dim] array
of absolute actions:
Key (no _relative) |
Shape | Meaning | Unit |
|---|---|---|---|
follow_left_ee_cartesian_pos |
[T, 3] |
Left end-effector position | meters |
follow_left_ee_rotation_6D |
[T, 6] |
Left EE 6D rotation (rows 1โ2 of R) |
unitless |
follow_left_gripper |
[T, 1] |
Left gripper opening | [0, 1] |
follow_right_ee_cartesian_pos |
[T, 3] |
Right EE position | meters |
follow_right_ee_rotation_6D |
[T, 6] |
Right EE 6D rotation | unitless |
follow_right_gripper |
[T, 1] |
Right gripper opening | [0, 1] |
velocity_decomposed |
[T, 3] |
Chassis (vx, vy, omega) | m/s, rad/s |
height |
[T, 1] |
Lift height | meters |
head_actions |
[T, 2] |
Head (pitch, yaw) | radians |
Helpers for common raw formats
from xtokenizer.data import (
split_by_dof_dims, stack_by_dof_dims,
euler_to_6d, quat_to_6d,
)
# Raw [T, 26] matrix -> dict (used when your data is one big array).
episode = split_by_dof_dims(action_matrix, tok.get_dof_layout())
# dict -> [T, 26] matrix (inverse helper).
matrix = stack_by_dof_dims(episode, tok.get_dof_layout())
# Convert rotations into the 6D representation expected by the model.
r6 = euler_to_6d(euler_xyz) # [T, 3] roll-pitch-yaw -> [T, 6]
r6 = quat_to_6d(quat_xyzw) # [T, 4] -> [T, 6]
When your robot lacks some segments
If your robot has no head, no chassis, etc., just omit those keys from the dict. For training-time statistics the missing dimensions are tracked as NaN and skipped automatically. At encode/decode time, build a DOF mask so the model zeros those dimensions on decode:
from xtokenizer.data import make_dof_mask_from_present_keys, make_dof_mask_from_nan
# Option 1: by enumerating the keys your robot provides.
dof_mask = make_dof_mask_from_present_keys(
tok.get_dof_layout(),
present_keys=["follow_left_ee_cartesian_pos",
"follow_left_ee_rotation_6D",
"follow_left_gripper"],
T=chunk_length,
)
# Option 2: derive from NaN entries in an existing [T, 26] action array.
dof_mask = make_dof_mask_from_nan(action_matrix)
If you have all 26 dimensions, you can leave dof_mask=None (the default).
Sample episodes for trying things out
Two synthetic episodes live in examples/data/episode_001.npz and
episode_002.npz. Each .npz already stores arrays under the stripped key
names, so:
import numpy as np
with np.load("examples/data/episode_001.npz") as f:
episode = {k: f[k] for k in f.files}
gives you a dict that matches the table above. Regenerate them any time
with python examples/data/_generate_demo_data.py.
Case B: compute statistics from your data
Run this once per dataset. It writes ./my_statistics.json containing
the percentile blocks the normalizer needs.
from xtokenizer import XTokenizer
from xtokenizer.data import save_statistics_json
from xtokenizer.tools.statistics_builder import StatisticsAccumulator
tok = XTokenizer.from_pretrained("./xtokenizer.pth", device="cpu")
acc = StatisticsAccumulator(
dof_dims=tok.get_dof_layout(),
predict_action_keys=tok.get_predict_action_keys(),
obs_state_keys=tok.get_obs_state_keys(),
chunk_size=32, # must match training chunk length (32 for the released ckpt)
frame_interval=1,
)
for episode in your_episode_iterator():
acc.add_episode(episode)
save_statistics_json({"MyRobot": acc.export()}, "./my_statistics.json")
A complete runnable version, including loading episodes from .npz files,
is in examples/02_compute_statistics.py.
CLI alternative for batched workflows
If your episodes are stored as .npz files, the bundled CLI parallelises
the same computation:
xtokenizer-compute-statistics \
--episodes-dir ./my_npz_episodes \
--loader npz \
--dataset-type MyRobot \
--checkpoint ./xtokenizer.pth \
--output ./my_statistics.json \
--chunk-size 32 \
--num-workers 8
Each .npz file must contain arrays named after the stripped action keys
above. To plug your own loader, pass
--loader your_module:your_callable_returning_episode_dict.
--chunk-size 32 must match the value used at training time, otherwise
the delta_001 magnitudes drift and normalization clips heavily.
Case A: encode / decode your actions
Once you have ./my_statistics.json, the model is ready. The full pipeline
is six explicit steps:
absolute actions [T, D]
| Step A: compute delta vs. an observation frame (SO(3) for 6D rotation)
v
delta [T, D]
| Step B: normalize delta and obs using the statistics
v
normalized_delta -> Step C: encode -> indices (your tokens)
-> Step D: decode -> normalized_delta_recon
| Step E: unnormalize
v
delta_recon
| Step F: compose onto obs frame
v
absolute actions recon
Steps B and E are where the statistics file is consumed. Steps A,
C, D, F do not need statistics. The explicit version is in
examples/01_encode_decode.py; once you have read it, the same six steps
collapse into the high-level API:
from xtokenizer import XTokenizer
tok = XTokenizer.from_pretrained("./xtokenizer.pth",
statistics_path="./my_statistics.json")
indices = tok.encode_from_absolute(absolute_actions, # [B, T, 26]
obs_state_absolute, # [B, 1, 26]
robot_type="MyRobot")
actions_recon = tok.decode_to_absolute(indices,
target_length=T,
obs_state_absolute=obs_state_absolute,
robot_type="MyRobot")
The released checkpoint accepts any T โ [8, 64] per call; you don't have
to reload between different chunk sizes.
Token layout: time_major vs. quantizer_major
For every T action frames the model emits T' = T / 4 latents, each of
which is quantized by Q = 4 residual codebooks, so a chunk produces a
total of T' ร Q discrete tokens. All encode / decode /
reconstruct / encode_from_absolute / decode_to_absolute methods
take a token_order keyword that controls how those tokens are laid out
in the returned tensor:
token_order |
indices.shape |
flatten order (indices.reshape(B, -1)) |
|---|---|---|
"time_major" (default, backward-compatible) |
[B, T', Q] |
[t0_q0, t0_q1, ..., t0_q3, t1_q0, ..., tN_q3] |
"quantizer_major" |
[B, Q, T'] |
[q0_t0, q0_t1, ..., q0_tN, q1_t0, ..., q3_tN] |
Pick quantizer_major if your downstream model wants to emit the whole
first-residual stream before the second one (e.g. a "draft then refine"
generation order); pick time_major if it autoregresses time-step by
time-step. Both layouts are exactly equivalent โ only the ordering of
the same underlying tokens differs.
# encode side: shape changes with token_order
out = tok.encode(normalized_delta, obs_state=normalized_obs,
token_order="quantizer_major")
indices = out["indices"] # [B, Q, T'] e.g. [2, 4, 8]
flat = indices.reshape(indices.shape[0], -1) # [B, Q*T'] ready for an LLM
# decode side: accepts 3D ([B, Q, T']) or flat ([B, Q*T']) as long as
# token_order matches what you encoded with.
recon = tok.decode(flat, target_length=T, obs_state=normalized_obs,
token_order="quantizer_major")
# The high-level absolute API takes the same flag.
idx = tok.encode_from_absolute(absolute_actions, obs_state_absolute,
robot_type="MyRobot",
token_order="quantizer_major")
recon_abs = tok.decode_to_absolute(idx, target_length=T,
obs_state_absolute=obs_state_absolute,
robot_type="MyRobot",
token_order="quantizer_major")
Within a single latent step,
q0..q3are residual codes (q1quantizes whatq0left over, etc.). The RVQ is unchanged by the token layout, but downstream generators should still emitq0 -> q1 -> q2 -> q3per latent so they stay on the training distribution.
Examples in this repo
| File | Case | What it does |
|---|---|---|
examples/01_encode_decode.py |
A | Loads stats + a demo episode, walks the six-step pipeline explicitly, then verifies the high-level one-liner matches numerically. |
examples/02_compute_statistics.py |
B | Reads the two demo .npz episodes and writes ./my_statistics.json. |
examples/data/episode_*.npz |
โ | Two synthetic episodes used by examples 01 and 02. |
examples/data/_generate_demo_data.py |
โ | Regenerate the demo episodes. |
A typical first-time session is python examples/02_compute_statistics.py
followed by python examples/01_encode_decode.py.
Concept reference
Action layout
The exact dof_dims mapping ships inside the checkpoint (data_spec
field). tok.get_dof_layout() returns it as an ordered dict matching the
table above. Slices 0:3, 3:9, 9:10, 10:13, 13:19, 19:20, 20:23, 23:24, 24:26 correspond to the nine keys in order.
Delta vs. absolute action
The tokenizer only ever sees deltas relative to an observation frame.
Cartesian / scalar segments use plain subtraction; the 6D rotation
segments use the SO(3) composition R_delta = R_abs @ R_state.T; head
angles wrap to (-pi, pi]. Helpers live in xtokenizer.data.action_layout
(compute_delta_action, compute_absolute_action).
Normalization & statistics
Deltas are min-max normalized to roughly [-1, 1] using the 0.1% / 99.9%
quantiles (q001 / delta_001). Training statistics are not shipped;
compute your own (Case B). When statistics are loaded, the high-level API
will normalize both the delta and the obs frame; missing keys in your
statistics fall back to (min=0, delta=1) (i.e. no normalization), so it
is worth running validate_statistics once after building.
Robot types
The model has 18 canonical embodiment slots:
0 Unknown
1 X2Arm (default when robot_type=None)
2 Franka
3 UR5
4 Piper
5 Viperx
6 ARX5
7 WidowX
8 GoogleRobot
9 AgiBot
10 UMI
11 Realman
12 Ark
13 R1Lite
14 AlphaBot
15 MMK2
16 Leju
17 A2D
Pass the canonical name as a string (robot_type="Franka"). Names absent
from the registry resolve to id 0; passing robot_type=None uses id 1.
Dropout robustness
During training, obs_state and robot_type were each independently
dropped with probability 0.2. At inference all four combinations are
legal: both provided (best precision), only obs_state, only
robot_type, or neither. Reconstruction quality is typically highest
when both are passed.
API cheatsheet
XTokenizer.from_pretrained(model_path, statistics_path=None,
robot_types_path=None, device=None)
# Low-level: caller supplies normalized deltas
tok.encode(normalized_delta, obs_state=None, dof_mask=None,
padding_mask=None, robot_type=None,
token_order="time_major") # or "quantizer_major"
tok.decode(indices, target_length, obs_state=None, dof_mask=None,
padding_mask=None, robot_type=None,
token_order="time_major") # accepts [B,T',Q] / [B,Q,T'] / flat
tok.reconstruct(normalized_delta, ..., token_order="time_major")
# High-level: absolute actions in/out (requires statistics_path)
tok.encode_from_absolute(absolute_actions, obs_state_absolute, robot_type,
dof_mask=None, padding_mask=None,
token_order="time_major")
tok.decode_to_absolute(indices, target_length, obs_state_absolute, robot_type,
dof_mask=None, token_order="time_major")
# Statistics access (used internally by the high-level API; also handy
# when you split the pipeline yourself).
tok.get_normalizer(stats_key) -> (predict_min, predict_delta, obs_min, obs_delta)
# NumPy variants of every method above: <method>_numpy
# Introspection
tok.get_action_dim()
tok.get_vocab_size()
tok.get_num_quantizers()
tok.get_compression_ratio()
tok.get_max_chunk_length()
tok.get_dof_layout() # ordered {key: dim}
tok.get_predict_action_keys()
tok.get_obs_state_keys()
tok.get_canonical_robot_types() # name -> id
Data utilities re-exported under xtokenizer.data:
ActionLayout, compute_delta_action, compute_absolute_action,
split_by_dof_dims, stack_by_dof_dims,
make_dof_mask_from_nan, make_dof_mask_from_present_keys,
normalize_action, unnormalize_action, build_normalizer_from_keys,
euler_to_6d, quat_to_6d, so3_6d_to_matrix, matrix_to_so3_6d,
compute_delta_6d_batch, compose_6d_batch,
load_statistics_json, save_statistics_json, validate_statistics,
Citation
If you find X-Tokenizer useful, please cite:
@misc{kang2026xtokenizermultimodalactiontokenizer,
title={X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining},
author={Xirui Kang and Yanpei Shi and Lucy Liang and Roy Gan and Dongxiu Liu and Pushi Zhang and Danpeng Chen and Xiaoyi Qin and Yinan Zheng and Jinliang Zheng and Hao Wang and Xianyuan Zhan and Hang Su},
year={2026},
eprint={2606.14752},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.14752},
}
License
Apache 2.0 โ see LICENSE.