--- license: apache-2.0 pipeline_tag: robotics tags: - Robotics - Vision --- # X-Tokenizer

Project Page · GitHub · Paper · Install · Usage

Project Page GitHub Repo Paper

A residual vector-quantization tokenizer for robot manipulation actions, trained jointly on 18 robot embodiments. It turns a continuous action sequence into a short sequence of discrete tokens, and decodes the tokens back to actions. - 18 canonical robot embodiments with per-embodiment conditioning - compression ratio 4 (e.g. 32 action frames -> 8 latent steps) - codebook: 2048 codes × 4 residual quantizers per latent step - 26-dim flat action layout (bimanual EEF + chassis + lift + head) - dynamic sequence length: any `chunk_size ∈ [8, 64]` at inference, no reload - two token layouts available at encode/decode time (`time_major` / `quantizer_major`) so the same checkpoint can feed both step-by-step and depth-first autoregressive consumers ## Install ```bash git clone cd X-Tokenizer pip install -e . # Only needed if you plan to compute statistics from your own data. pip install -e ".[stats]" ``` Core deps (PyTorch, vector-quantize-pytorch, scipy, pyyaml, tqdm) install automatically. `tdigest` is lazy-loaded and only required to build statistics. Download `xtokenizer.pth` from the release page and place it where the examples expect it (the examples default to `./xtokenizer.pth`). --- ## Two ways to use this package | You have | Do this | | --- | --- | | A statistics file (`my_statistics.json`) for your robot | **Case A** — go to "Encode / decode your actions" below | | Only raw absolute actions, no statistics | **Case B** — run "Compute statistics" once, then go back to Case A | Both cases assume your data already follows the action-dict format described in the next section. --- ## Step 1: Shape your data into an episode dict Every step in this package consumes one or more **episodes**, where each episode is a Python dict mapping a stripped key name to a `[T, dim]` array of absolute actions: | Key (no `_relative`) | Shape | Meaning | Unit | | --- | --- | --- | --- | | `follow_left_ee_cartesian_pos` | `[T, 3]` | Left end-effector position | meters | | `follow_left_ee_rotation_6D` | `[T, 6]` | Left EE 6D rotation (rows 1–2 of `R`) | unitless | | `follow_left_gripper` | `[T, 1]` | Left gripper opening | `[0, 1]` | | `follow_right_ee_cartesian_pos` | `[T, 3]` | Right EE position | meters | | `follow_right_ee_rotation_6D` | `[T, 6]` | Right EE 6D rotation | unitless | | `follow_right_gripper` | `[T, 1]` | Right gripper opening | `[0, 1]` | | `velocity_decomposed` | `[T, 3]` | Chassis (vx, vy, omega) | m/s, rad/s | | `height` | `[T, 1]` | Lift height | meters | | `head_actions` | `[T, 2]` | Head (pitch, yaw) | radians | ### Helpers for common raw formats ```python from xtokenizer.data import ( split_by_dof_dims, stack_by_dof_dims, euler_to_6d, quat_to_6d, ) # Raw [T, 26] matrix -> dict (used when your data is one big array). episode = split_by_dof_dims(action_matrix, tok.get_dof_layout()) # dict -> [T, 26] matrix (inverse helper). matrix = stack_by_dof_dims(episode, tok.get_dof_layout()) # Convert rotations into the 6D representation expected by the model. r6 = euler_to_6d(euler_xyz) # [T, 3] roll-pitch-yaw -> [T, 6] r6 = quat_to_6d(quat_xyzw) # [T, 4] -> [T, 6] ``` ### When your robot lacks some segments If your robot has no head, no chassis, etc., just **omit those keys from the dict**. For training-time statistics the missing dimensions are tracked as NaN and skipped automatically. At encode/decode time, build a DOF mask so the model zeros those dimensions on decode: ```python from xtokenizer.data import make_dof_mask_from_present_keys, make_dof_mask_from_nan # Option 1: by enumerating the keys your robot provides. dof_mask = make_dof_mask_from_present_keys( tok.get_dof_layout(), present_keys=["follow_left_ee_cartesian_pos", "follow_left_ee_rotation_6D", "follow_left_gripper"], T=chunk_length, ) # Option 2: derive from NaN entries in an existing [T, 26] action array. dof_mask = make_dof_mask_from_nan(action_matrix) ``` If you have all 26 dimensions, you can leave `dof_mask=None` (the default). ### Sample episodes for trying things out Two synthetic episodes live in `examples/data/episode_001.npz` and `episode_002.npz`. Each `.npz` already stores arrays under the stripped key names, so: ```python import numpy as np with np.load("examples/data/episode_001.npz") as f: episode = {k: f[k] for k in f.files} ``` gives you a dict that matches the table above. Regenerate them any time with `python examples/data/_generate_demo_data.py`. --- ## Case B: compute statistics from your data Run this **once** per dataset. It writes `./my_statistics.json` containing the percentile blocks the normalizer needs. ```python from xtokenizer import XTokenizer from xtokenizer.data import save_statistics_json from xtokenizer.tools.statistics_builder import StatisticsAccumulator tok = XTokenizer.from_pretrained("./xtokenizer.pth", device="cpu") acc = StatisticsAccumulator( dof_dims=tok.get_dof_layout(), predict_action_keys=tok.get_predict_action_keys(), obs_state_keys=tok.get_obs_state_keys(), chunk_size=32, # must match training chunk length (32 for the released ckpt) frame_interval=1, ) for episode in your_episode_iterator(): acc.add_episode(episode) save_statistics_json({"MyRobot": acc.export()}, "./my_statistics.json") ``` A complete runnable version, including loading episodes from `.npz` files, is in `examples/02_compute_statistics.py`. ### CLI alternative for batched workflows If your episodes are stored as `.npz` files, the bundled CLI parallelises the same computation: ```bash xtokenizer-compute-statistics \ --episodes-dir ./my_npz_episodes \ --loader npz \ --dataset-type MyRobot \ --checkpoint ./xtokenizer.pth \ --output ./my_statistics.json \ --chunk-size 32 \ --num-workers 8 ``` Each `.npz` file must contain arrays named after the stripped action keys above. To plug your own loader, pass `--loader your_module:your_callable_returning_episode_dict`. `--chunk-size 32` must match the value used at training time, otherwise the `delta_001` magnitudes drift and normalization clips heavily. --- ## Case A: encode / decode your actions Once you have `./my_statistics.json`, the model is ready. The full pipeline is six explicit steps: ``` absolute actions [T, D] | Step A: compute delta vs. an observation frame (SO(3) for 6D rotation) v delta [T, D] | Step B: normalize delta and obs using the statistics v normalized_delta -> Step C: encode -> indices (your tokens) -> Step D: decode -> normalized_delta_recon | Step E: unnormalize v delta_recon | Step F: compose onto obs frame v absolute actions recon ``` Steps **B** and **E** are where the statistics file is consumed. Steps A, C, D, F do not need statistics. The explicit version is in `examples/01_encode_decode.py`; once you have read it, the same six steps collapse into the high-level API: ```python from xtokenizer import XTokenizer tok = XTokenizer.from_pretrained("./xtokenizer.pth", statistics_path="./my_statistics.json") indices = tok.encode_from_absolute(absolute_actions, # [B, T, 26] obs_state_absolute, # [B, 1, 26] robot_type="MyRobot") actions_recon = tok.decode_to_absolute(indices, target_length=T, obs_state_absolute=obs_state_absolute, robot_type="MyRobot") ``` The released checkpoint accepts any `T ∈ [8, 64]` per call; you don't have to reload between different chunk sizes. ### Token layout: `time_major` vs. `quantizer_major` For every `T` action frames the model emits `T' = T / 4` latents, each of which is quantized by `Q = 4` residual codebooks, so a chunk produces a total of `T' × Q` discrete tokens. All `encode` / `decode` / `reconstruct` / `encode_from_absolute` / `decode_to_absolute` methods take a `token_order` keyword that controls how those tokens are laid out in the returned tensor: | `token_order` | `indices.shape` | flatten order (`indices.reshape(B, -1)`) | | -------------------- | ------------------- | ------------------------------------------------------ | | `"time_major"` (default, backward-compatible) | `[B, T', Q]` | `[t0_q0, t0_q1, ..., t0_q3, t1_q0, ..., tN_q3]` | | `"quantizer_major"` | `[B, Q, T']` | `[q0_t0, q0_t1, ..., q0_tN, q1_t0, ..., q3_tN]` | Pick `quantizer_major` if your downstream model wants to emit the whole first-residual stream before the second one (e.g. a "draft then refine" generation order); pick `time_major` if it autoregresses time-step by time-step. Both layouts are exactly equivalent — only the ordering of the same underlying tokens differs. ```python # encode side: shape changes with token_order out = tok.encode(normalized_delta, obs_state=normalized_obs, token_order="quantizer_major") indices = out["indices"] # [B, Q, T'] e.g. [2, 4, 8] flat = indices.reshape(indices.shape[0], -1) # [B, Q*T'] ready for an LLM # decode side: accepts 3D ([B, Q, T']) or flat ([B, Q*T']) as long as # token_order matches what you encoded with. recon = tok.decode(flat, target_length=T, obs_state=normalized_obs, token_order="quantizer_major") # The high-level absolute API takes the same flag. idx = tok.encode_from_absolute(absolute_actions, obs_state_absolute, robot_type="MyRobot", token_order="quantizer_major") recon_abs = tok.decode_to_absolute(idx, target_length=T, obs_state_absolute=obs_state_absolute, robot_type="MyRobot", token_order="quantizer_major") ``` > Within a single latent step, `q0..q3` are **residual** codes (`q1` > quantizes what `q0` left over, etc.). The RVQ is unchanged by the > token layout, but downstream generators should still emit > `q0 -> q1 -> q2 -> q3` per latent so they stay on the training > distribution. --- ## Examples in this repo | File | Case | What it does | | --- | --- | --- | | `examples/01_encode_decode.py` | A | Loads stats + a demo episode, walks the six-step pipeline explicitly, then verifies the high-level one-liner matches numerically. | | `examples/02_compute_statistics.py` | B | Reads the two demo `.npz` episodes and writes `./my_statistics.json`. | | `examples/data/episode_*.npz` | — | Two synthetic episodes used by examples 01 and 02. | | `examples/data/_generate_demo_data.py` | — | Regenerate the demo episodes. | A typical first-time session is `python examples/02_compute_statistics.py` followed by `python examples/01_encode_decode.py`. --- ## Concept reference ### Action layout The exact `dof_dims` mapping ships inside the checkpoint (`data_spec` field). `tok.get_dof_layout()` returns it as an ordered dict matching the table above. Slices `0:3, 3:9, 9:10, 10:13, 13:19, 19:20, 20:23, 23:24, 24:26` correspond to the nine keys in order. ### Delta vs. absolute action The tokenizer only ever sees deltas relative to an observation frame. Cartesian / scalar segments use plain subtraction; the 6D rotation segments use the SO(3) composition `R_delta = R_abs @ R_state.T`; head angles wrap to `(-pi, pi]`. Helpers live in `xtokenizer.data.action_layout` (`compute_delta_action`, `compute_absolute_action`). ### Normalization & statistics Deltas are min-max normalized to roughly `[-1, 1]` using the 0.1% / 99.9% quantiles (`q001` / `delta_001`). Training statistics are not shipped; compute your own (Case B). When statistics are loaded, the high-level API will normalize both the delta and the obs frame; missing keys in your statistics fall back to `(min=0, delta=1)` (i.e. no normalization), so it is worth running `validate_statistics` once after building. ### Robot types The model has 18 canonical embodiment slots: ``` 0 Unknown 1 X2Arm (default when robot_type=None) 2 Franka 3 UR5 4 Piper 5 Viperx 6 ARX5 7 WidowX 8 GoogleRobot 9 AgiBot 10 UMI 11 Realman 12 Ark 13 R1Lite 14 AlphaBot 15 MMK2 16 Leju 17 A2D ``` Pass the canonical name as a string (`robot_type="Franka"`). Names absent from the registry resolve to id 0; passing `robot_type=None` uses id 1. ### Dropout robustness During training, `obs_state` and `robot_type` were each independently dropped with probability 0.2. At inference all four combinations are legal: both provided (best precision), only `obs_state`, only `robot_type`, or neither. Reconstruction quality is typically highest when both are passed. --- ## API cheatsheet ```python XTokenizer.from_pretrained(model_path, statistics_path=None, robot_types_path=None, device=None) # Low-level: caller supplies normalized deltas tok.encode(normalized_delta, obs_state=None, dof_mask=None, padding_mask=None, robot_type=None, token_order="time_major") # or "quantizer_major" tok.decode(indices, target_length, obs_state=None, dof_mask=None, padding_mask=None, robot_type=None, token_order="time_major") # accepts [B,T',Q] / [B,Q,T'] / flat tok.reconstruct(normalized_delta, ..., token_order="time_major") # High-level: absolute actions in/out (requires statistics_path) tok.encode_from_absolute(absolute_actions, obs_state_absolute, robot_type, dof_mask=None, padding_mask=None, token_order="time_major") tok.decode_to_absolute(indices, target_length, obs_state_absolute, robot_type, dof_mask=None, token_order="time_major") # Statistics access (used internally by the high-level API; also handy # when you split the pipeline yourself). tok.get_normalizer(stats_key) -> (predict_min, predict_delta, obs_min, obs_delta) # NumPy variants of every method above: _numpy # Introspection tok.get_action_dim() tok.get_vocab_size() tok.get_num_quantizers() tok.get_compression_ratio() tok.get_max_chunk_length() tok.get_dof_layout() # ordered {key: dim} tok.get_predict_action_keys() tok.get_obs_state_keys() tok.get_canonical_robot_types() # name -> id ``` Data utilities re-exported under `xtokenizer.data`: ``` ActionLayout, compute_delta_action, compute_absolute_action, split_by_dof_dims, stack_by_dof_dims, make_dof_mask_from_nan, make_dof_mask_from_present_keys, normalize_action, unnormalize_action, build_normalizer_from_keys, euler_to_6d, quat_to_6d, so3_6d_to_matrix, matrix_to_so3_6d, compute_delta_6d_batch, compose_6d_batch, load_statistics_json, save_statistics_json, validate_statistics, ``` --- ## Citation If you find X-Tokenizer useful, please cite: ```bibtex @misc{kang2026xtokenizermultimodalactiontokenizer, title={X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining}, author={Xirui Kang and Yanpei Shi and Lucy Liang and Roy Gan and Dongxiu Liu and Pushi Zhang and Danpeng Chen and Xiaoyi Qin and Yinan Zheng and Jinliang Zheng and Hao Wang and Xianyuan Zhan and Hang Su}, year={2026}, eprint={2606.14752}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2606.14752}, } ``` --- ## License Apache 2.0 — see [LICENSE](LICENSE).