X-Tokenizer

Project Page · GitHub · Paper · Install · Usage

A residual vector-quantization tokenizer for robot manipulation actions, trained jointly on 18 robot embodiments. It turns a continuous action sequence into a short sequence of discrete tokens, and decodes the tokens back to actions.

18 canonical robot embodiments with per-embodiment conditioning
compression ratio 4 (e.g. 32 action frames -> 8 latent steps)
codebook: 2048 codes × 4 residual quantizers per latent step
26-dim flat action layout (bimanual EEF + chassis + lift + head)
dynamic sequence length: any chunk_size ∈ [8, 64] at inference, no reload
two token layouts available at encode/decode time (time_major / quantizer_major) so the same checkpoint can feed both step-by-step and depth-first autoregressive consumers

Install

git clone <repo>
cd X-Tokenizer
pip install -e .

# Only needed if you plan to compute statistics from your own data.
pip install -e ".[stats]"

Core deps (PyTorch, vector-quantize-pytorch, scipy, pyyaml, tqdm) install automatically. tdigest is lazy-loaded and only required to build statistics.

Download xtokenizer.pth from the release page and place it where the examples expect it (the examples default to ./xtokenizer.pth).

Two ways to use this package

You have	Do this
A statistics file (`my_statistics.json`) for your robot	Case A — go to "Encode / decode your actions" below
Only raw absolute actions, no statistics	Case B — run "Compute statistics" once, then go back to Case A

Both cases assume your data already follows the action-dict format described in the next section.

Step 1: Shape your data into an episode dict

Every step in this package consumes one or more episodes, where each episode is a Python dict mapping a stripped key name to a [T, dim] array of absolute actions:

Key (no `_relative`)	Shape	Meaning	Unit
`follow_left_ee_cartesian_pos`	`[T, 3]`	Left end-effector position	meters
`follow_left_ee_rotation_6D`	`[T, 6]`	Left EE 6D rotation (rows 1–2 of `R`)	unitless
`follow_left_gripper`	`[T, 1]`	Left gripper opening	`[0, 1]`
`follow_right_ee_cartesian_pos`	`[T, 3]`	Right EE position	meters
`follow_right_ee_rotation_6D`	`[T, 6]`	Right EE 6D rotation	unitless
`follow_right_gripper`	`[T, 1]`	Right gripper opening	`[0, 1]`
`velocity_decomposed`	`[T, 3]`	Chassis (vx, vy, omega)	m/s, rad/s
`height`	`[T, 1]`	Lift height	meters
`head_actions`	`[T, 2]`	Head (pitch, yaw)	radians

Helpers for common raw formats

from xtokenizer.data import (
    split_by_dof_dims, stack_by_dof_dims,
    euler_to_6d, quat_to_6d,
)

# Raw [T, 26] matrix -> dict (used when your data is one big array).
episode = split_by_dof_dims(action_matrix, tok.get_dof_layout())

# dict -> [T, 26] matrix (inverse helper).
matrix = stack_by_dof_dims(episode, tok.get_dof_layout())

# Convert rotations into the 6D representation expected by the model.
r6 = euler_to_6d(euler_xyz)     # [T, 3] roll-pitch-yaw -> [T, 6]
r6 = quat_to_6d(quat_xyzw)      # [T, 4] -> [T, 6]

When your robot lacks some segments

If your robot has no head, no chassis, etc., just omit those keys from the dict. For training-time statistics the missing dimensions are tracked as NaN and skipped automatically. At encode/decode time, build a DOF mask so the model zeros those dimensions on decode:

from xtokenizer.data import make_dof_mask_from_present_keys, make_dof_mask_from_nan

# Option 1: by enumerating the keys your robot provides.
dof_mask = make_dof_mask_from_present_keys(
    tok.get_dof_layout(),
    present_keys=["follow_left_ee_cartesian_pos",
                  "follow_left_ee_rotation_6D",
                  "follow_left_gripper"],
    T=chunk_length,
)

# Option 2: derive from NaN entries in an existing [T, 26] action array.
dof_mask = make_dof_mask_from_nan(action_matrix)

If you have all 26 dimensions, you can leave dof_mask=None (the default).

Sample episodes for trying things out

Two synthetic episodes live in examples/data/episode_001.npz and episode_002.npz. Each .npz already stores arrays under the stripped key names, so:

import numpy as np
with np.load("examples/data/episode_001.npz") as f:
    episode = {k: f[k] for k in f.files}

gives you a dict that matches the table above. Regenerate them any time with python examples/data/_generate_demo_data.py.

Case B: compute statistics from your data

Run this once per dataset. It writes ./my_statistics.json containing the percentile blocks the normalizer needs.

from xtokenizer import XTokenizer
from xtokenizer.data import save_statistics_json
from xtokenizer.tools.statistics_builder import StatisticsAccumulator

tok = XTokenizer.from_pretrained("./xtokenizer.pth", device="cpu")

acc = StatisticsAccumulator(
    dof_dims=tok.get_dof_layout(),
    predict_action_keys=tok.get_predict_action_keys(),
    obs_state_keys=tok.get_obs_state_keys(),
    chunk_size=32,        # must match training chunk length (32 for the released ckpt)
    frame_interval=1,
)
for episode in your_episode_iterator():
    acc.add_episode(episode)

save_statistics_json({"MyRobot": acc.export()}, "./my_statistics.json")

A complete runnable version, including loading episodes from .npz files, is in examples/02_compute_statistics.py.

CLI alternative for batched workflows

If your episodes are stored as .npz files, the bundled CLI parallelises the same computation:

xtokenizer-compute-statistics \
  --episodes-dir ./my_npz_episodes \
  --loader npz \
  --dataset-type MyRobot \
  --checkpoint ./xtokenizer.pth \
  --output ./my_statistics.json \
  --chunk-size 32 \
  --num-workers 8

Each .npz file must contain arrays named after the stripped action keys above. To plug your own loader, pass --loader your_module:your_callable_returning_episode_dict.

--chunk-size 32 must match the value used at training time, otherwise the delta_001 magnitudes drift and normalization clips heavily.

Case A: encode / decode your actions

Once you have ./my_statistics.json, the model is ready. The full pipeline is six explicit steps:

absolute actions [T, D]
        |  Step A: compute delta vs. an observation frame (SO(3) for 6D rotation)
        v
delta [T, D]
        |  Step B: normalize delta and obs using the statistics
        v
normalized_delta -> Step C: encode -> indices  (your tokens)
                                   -> Step D: decode -> normalized_delta_recon
                                                       |  Step E: unnormalize
                                                       v
                                                     delta_recon
                                                       |  Step F: compose onto obs frame
                                                       v
                                                  absolute actions recon

Steps B and E are where the statistics file is consumed. Steps A, C, D, F do not need statistics. The explicit version is in examples/01_encode_decode.py; once you have read it, the same six steps collapse into the high-level API:

from xtokenizer import XTokenizer

tok = XTokenizer.from_pretrained("./xtokenizer.pth",
                                 statistics_path="./my_statistics.json")

indices = tok.encode_from_absolute(absolute_actions,  # [B, T, 26]
                                   obs_state_absolute,  # [B, 1, 26]
                                   robot_type="MyRobot")

actions_recon = tok.decode_to_absolute(indices,
                                       target_length=T,
                                       obs_state_absolute=obs_state_absolute,
                                       robot_type="MyRobot")

The released checkpoint accepts any T ∈ [8, 64] per call; you don't have to reload between different chunk sizes.

Token layout: `time_major` vs. `quantizer_major`

For every T action frames the model emits T' = T / 4 latents, each of which is quantized by Q = 4 residual codebooks, so a chunk produces a total of T' × Q discrete tokens. All encode / decode / reconstruct / encode_from_absolute / decode_to_absolute methods take a token_order keyword that controls how those tokens are laid out in the returned tensor:

`token_order`	`indices.shape`	flatten order (`indices.reshape(B, -1)`)
`"time_major"` (default, backward-compatible)	`[B, T', Q]`	`[t0_q0, t0_q1, ..., t0_q3, t1_q0, ..., tN_q3]`
`"quantizer_major"`	`[B, Q, T']`	`[q0_t0, q0_t1, ..., q0_tN, q1_t0, ..., q3_tN]`

Pick quantizer_major if your downstream model wants to emit the whole first-residual stream before the second one (e.g. a "draft then refine" generation order); pick time_major if it autoregresses time-step by time-step. Both layouts are exactly equivalent — only the ordering of the same underlying tokens differs.

# encode side: shape changes with token_order
out = tok.encode(normalized_delta, obs_state=normalized_obs,
                 token_order="quantizer_major")
indices = out["indices"]                        # [B, Q, T'] e.g. [2, 4, 8]
flat = indices.reshape(indices.shape[0], -1)    # [B, Q*T']  ready for an LLM

# decode side: accepts 3D ([B, Q, T']) or flat ([B, Q*T']) as long as
# token_order matches what you encoded with.
recon = tok.decode(flat, target_length=T, obs_state=normalized_obs,
                   token_order="quantizer_major")

# The high-level absolute API takes the same flag.
idx = tok.encode_from_absolute(absolute_actions, obs_state_absolute,
                               robot_type="MyRobot",
                               token_order="quantizer_major")
recon_abs = tok.decode_to_absolute(idx, target_length=T,
                                   obs_state_absolute=obs_state_absolute,
                                   robot_type="MyRobot",
                                   token_order="quantizer_major")

Within a single latent step, q0..q3 are residual codes (q1 quantizes what q0 left over, etc.). The RVQ is unchanged by the token layout, but downstream generators should still emit q0 -> q1 -> q2 -> q3 per latent so they stay on the training distribution.

Examples in this repo

File	Case	What it does
`examples/01_encode_decode.py`	A	Loads stats + a demo episode, walks the six-step pipeline explicitly, then verifies the high-level one-liner matches numerically.
`examples/02_compute_statistics.py`	B	Reads the two demo `.npz` episodes and writes `./my_statistics.json`.
`examples/data/episode_*.npz`	—	Two synthetic episodes used by examples 01 and 02.
`examples/data/_generate_demo_data.py`	—	Regenerate the demo episodes.

A typical first-time session is python examples/02_compute_statistics.py followed by python examples/01_encode_decode.py.

Concept reference

Action layout

The exact dof_dims mapping ships inside the checkpoint (data_spec field). tok.get_dof_layout() returns it as an ordered dict matching the table above. Slices 0:3, 3:9, 9:10, 10:13, 13:19, 19:20, 20:23, 23:24, 24:26 correspond to the nine keys in order.

Delta vs. absolute action

The tokenizer only ever sees deltas relative to an observation frame. Cartesian / scalar segments use plain subtraction; the 6D rotation segments use the SO(3) composition R_delta = R_abs @ R_state.T; head angles wrap to (-pi, pi]. Helpers live in xtokenizer.data.action_layout (compute_delta_action, compute_absolute_action).

Normalization & statistics

Deltas are min-max normalized to roughly [-1, 1] using the 0.1% / 99.9% quantiles (q001 / delta_001). Training statistics are not shipped; compute your own (Case B). When statistics are loaded, the high-level API will normalize both the delta and the obs frame; missing keys in your statistics fall back to (min=0, delta=1) (i.e. no normalization), so it is worth running validate_statistics once after building.

Robot types

The model has 18 canonical embodiment slots:

0  Unknown
1  X2Arm        (default when robot_type=None)
2  Franka
3  UR5
4  Piper
5  Viperx
6  ARX5
7  WidowX
8  GoogleRobot
9  AgiBot
10 UMI
11 Realman
12 Ark
13 R1Lite
14 AlphaBot
15 MMK2
16 Leju
17 A2D

Pass the canonical name as a string (robot_type="Franka"). Names absent from the registry resolve to id 0; passing robot_type=None uses id 1.

Dropout robustness

During training, obs_state and robot_type were each independently dropped with probability 0.2. At inference all four combinations are legal: both provided (best precision), only obs_state, only robot_type, or neither. Reconstruction quality is typically highest when both are passed.

API cheatsheet

XTokenizer.from_pretrained(model_path, statistics_path=None,
                           robot_types_path=None, device=None)

# Low-level: caller supplies normalized deltas
tok.encode(normalized_delta, obs_state=None, dof_mask=None,
           padding_mask=None, robot_type=None,
           token_order="time_major")          # or "quantizer_major"
tok.decode(indices, target_length, obs_state=None, dof_mask=None,
           padding_mask=None, robot_type=None,
           token_order="time_major")          # accepts [B,T',Q] / [B,Q,T'] / flat
tok.reconstruct(normalized_delta, ..., token_order="time_major")

# High-level: absolute actions in/out (requires statistics_path)
tok.encode_from_absolute(absolute_actions, obs_state_absolute, robot_type,
                         dof_mask=None, padding_mask=None,
                         token_order="time_major")
tok.decode_to_absolute(indices, target_length, obs_state_absolute, robot_type,
                       dof_mask=None, token_order="time_major")

# Statistics access (used internally by the high-level API; also handy
# when you split the pipeline yourself).
tok.get_normalizer(stats_key) -> (predict_min, predict_delta, obs_min, obs_delta)

# NumPy variants of every method above: <method>_numpy

# Introspection
tok.get_action_dim()
tok.get_vocab_size()
tok.get_num_quantizers()
tok.get_compression_ratio()
tok.get_max_chunk_length()
tok.get_dof_layout()              # ordered {key: dim}
tok.get_predict_action_keys()
tok.get_obs_state_keys()
tok.get_canonical_robot_types()   # name -> id

Data utilities re-exported under xtokenizer.data:

ActionLayout, compute_delta_action, compute_absolute_action,
split_by_dof_dims, stack_by_dof_dims,
make_dof_mask_from_nan, make_dof_mask_from_present_keys,
normalize_action, unnormalize_action, build_normalizer_from_keys,
euler_to_6d, quat_to_6d, so3_6d_to_matrix, matrix_to_so3_6d,
compute_delta_6d_batch, compose_6d_batch,
load_statistics_json, save_statistics_json, validate_statistics,

Citation

If you find X-Tokenizer useful, please cite:

@misc{kang2026xtokenizermultimodalactiontokenizer,
      title={X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining},
      author={Xirui Kang and Yanpei Shi and Lucy Liang and Roy Gan and Dongxiu Liu and Pushi Zhang and Danpeng Chen and Xiaoyi Qin and Yinan Zheng and Jinliang Zheng and Hao Wang and Xianyuan Zhan and Hang Su},
      year={2026},
      eprint={2606.14752},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.14752},
}

License

Apache 2.0 — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for x-square-robot/X-Tokenizer

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

Paper • 2606.14752 • Published 12 days ago