---
license: apache-2.0
pipeline_tag: robotics
tags:
- Robotics
- Vision
---
# X-Tokenizer

<p align="center">
  <a href="https://x-square-robot.github.io/X-Tokenizer_projectPage/"><strong>Project Page</strong></a>
  ·
  <a href="https://github.com/X-Square-Robot/X-Tokenizer"><strong>GitHub</strong></a>
  ·
  <a href="https://arxiv.org/pdf/2606.14752"><strong>Paper</strong></a>
  ·
  <a href="#install">Install</a>
  ·
  <a href="#two-ways-to-use-this-package">Usage</a>
</p>

<p align="center">
  <a href="https://x-square-robot.github.io/X-Tokenizer_projectPage/">
    <img alt="Project Page" src="https://img.shields.io/badge/Project%20Page-X--Tokenizer-2f80ed?style=for-the-badge">
  </a>
  <a href="https://github.com/X-Square-Robot/X-Tokenizer">
    <img alt="GitHub Repo" src="https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github">
  </a>
  <a href="https://arxiv.org/pdf/2606.14752">
    <img alt="Paper" src="https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge">
  </a>
</p>

A residual vector-quantization tokenizer for robot manipulation actions,
trained jointly on 18 robot embodiments. It turns a continuous action
sequence into a short sequence of discrete tokens, and decodes the tokens
back to actions.

- 18 canonical robot embodiments with per-embodiment conditioning
- compression ratio 4 (e.g. 32 action frames -> 8 latent steps)
- codebook: 2048 codes × 4 residual quantizers per latent step
- 26-dim flat action layout (bimanual EEF + chassis + lift + head)
- dynamic sequence length: any `chunk_size ∈ [8, 64]` at inference, no reload
- two token layouts available at encode/decode time (`time_major` /
  `quantizer_major`) so the same checkpoint can feed both step-by-step and
  depth-first autoregressive consumers

## Install

```bash
git clone <repo>
cd X-Tokenizer
pip install -e .

# Only needed if you plan to compute statistics from your own data.
pip install -e ".[stats]"
```

Core deps (PyTorch, vector-quantize-pytorch, scipy, pyyaml, tqdm) install
automatically. `tdigest` is lazy-loaded and only required to build
statistics.

Download `xtokenizer.pth` from the release page and place it where the
examples expect it (the examples default to `./xtokenizer.pth`).

---

## Two ways to use this package

| You have | Do this |
| --- | --- |
| A statistics file (`my_statistics.json`) for your robot | **Case A** — go to "Encode / decode your actions" below |
| Only raw absolute actions, no statistics | **Case B** — run "Compute statistics" once, then go back to Case A |

Both cases assume your data already follows the action-dict format described
in the next section.

---

## Step 1: Shape your data into an episode dict

Every step in this package consumes one or more **episodes**, where each
episode is a Python dict mapping a stripped key name to a `[T, dim]` array
of absolute actions:

| Key (no `_relative`) | Shape | Meaning | Unit |
| --- | --- | --- | --- |
| `follow_left_ee_cartesian_pos` | `[T, 3]` | Left end-effector position | meters |
| `follow_left_ee_rotation_6D` | `[T, 6]` | Left EE 6D rotation (rows 1–2 of `R`) | unitless |
| `follow_left_gripper` | `[T, 1]` | Left gripper opening | `[0, 1]` |
| `follow_right_ee_cartesian_pos` | `[T, 3]` | Right EE position | meters |
| `follow_right_ee_rotation_6D` | `[T, 6]` | Right EE 6D rotation | unitless |
| `follow_right_gripper` | `[T, 1]` | Right gripper opening | `[0, 1]` |
| `velocity_decomposed` | `[T, 3]` | Chassis (vx, vy, omega) | m/s, rad/s |
| `height` | `[T, 1]` | Lift height | meters |
| `head_actions` | `[T, 2]` | Head (pitch, yaw) | radians |

### Helpers for common raw formats

```python
from xtokenizer.data import (
    split_by_dof_dims, stack_by_dof_dims,
    euler_to_6d, quat_to_6d,
)

# Raw [T, 26] matrix -> dict (used when your data is one big array).
episode = split_by_dof_dims(action_matrix, tok.get_dof_layout())

# dict -> [T, 26] matrix (inverse helper).
matrix = stack_by_dof_dims(episode, tok.get_dof_layout())

# Convert rotations into the 6D representation expected by the model.
r6 = euler_to_6d(euler_xyz)     # [T, 3] roll-pitch-yaw -> [T, 6]
r6 = quat_to_6d(quat_xyzw)      # [T, 4] -> [T, 6]
```

### When your robot lacks some segments

If your robot has no head, no chassis, etc., just **omit those keys from the
dict**. For training-time statistics the missing dimensions are tracked as
NaN and skipped automatically. At encode/decode time, build a DOF mask so
the model zeros those dimensions on decode:

```python
from xtokenizer.data import make_dof_mask_from_present_keys, make_dof_mask_from_nan

# Option 1: by enumerating the keys your robot provides.
dof_mask = make_dof_mask_from_present_keys(
    tok.get_dof_layout(),
    present_keys=["follow_left_ee_cartesian_pos",
                  "follow_left_ee_rotation_6D",
                  "follow_left_gripper"],
    T=chunk_length,
)

# Option 2: derive from NaN entries in an existing [T, 26] action array.
dof_mask = make_dof_mask_from_nan(action_matrix)
```

If you have all 26 dimensions, you can leave `dof_mask=None` (the default).

### Sample episodes for trying things out

Two synthetic episodes live in `examples/data/episode_001.npz` and
`episode_002.npz`. Each `.npz` already stores arrays under the stripped key
names, so:

```python
import numpy as np
with np.load("examples/data/episode_001.npz") as f:
    episode = {k: f[k] for k in f.files}
```

gives you a dict that matches the table above. Regenerate them any time
with `python examples/data/_generate_demo_data.py`.

---

## Case B: compute statistics from your data

Run this **once** per dataset. It writes `./my_statistics.json` containing
the percentile blocks the normalizer needs.

```python
from xtokenizer import XTokenizer
from xtokenizer.data import save_statistics_json
from xtokenizer.tools.statistics_builder import StatisticsAccumulator

tok = XTokenizer.from_pretrained("./xtokenizer.pth", device="cpu")

acc = StatisticsAccumulator(
    dof_dims=tok.get_dof_layout(),
    predict_action_keys=tok.get_predict_action_keys(),
    obs_state_keys=tok.get_obs_state_keys(),
    chunk_size=32,        # must match training chunk length (32 for the released ckpt)
    frame_interval=1,
)
for episode in your_episode_iterator():
    acc.add_episode(episode)

save_statistics_json({"MyRobot": acc.export()}, "./my_statistics.json")
```

A complete runnable version, including loading episodes from `.npz` files,
is in `examples/02_compute_statistics.py`.

### CLI alternative for batched workflows

If your episodes are stored as `.npz` files, the bundled CLI parallelises
the same computation:

```bash
xtokenizer-compute-statistics \
  --episodes-dir ./my_npz_episodes \
  --loader npz \
  --dataset-type MyRobot \
  --checkpoint ./xtokenizer.pth \
  --output ./my_statistics.json \
  --chunk-size 32 \
  --num-workers 8
```

Each `.npz` file must contain arrays named after the stripped action keys
above. To plug your own loader, pass
`--loader your_module:your_callable_returning_episode_dict`.

`--chunk-size 32` must match the value used at training time, otherwise
the `delta_001` magnitudes drift and normalization clips heavily.

---

## Case A: encode / decode your actions

Once you have `./my_statistics.json`, the model is ready. The full pipeline
is six explicit steps:

```
absolute actions [T, D]
        |  Step A: compute delta vs. an observation frame (SO(3) for 6D rotation)
        v
delta [T, D]
        |  Step B: normalize delta and obs using the statistics
        v
normalized_delta -> Step C: encode -> indices  (your tokens)
                                   -> Step D: decode -> normalized_delta_recon
                                                       |  Step E: unnormalize
                                                       v
                                                     delta_recon
                                                       |  Step F: compose onto obs frame
                                                       v
                                                  absolute actions recon
```

Steps **B** and **E** are where the statistics file is consumed. Steps A,
C, D, F do not need statistics. The explicit version is in
`examples/01_encode_decode.py`; once you have read it, the same six steps
collapse into the high-level API:

```python
from xtokenizer import XTokenizer

tok = XTokenizer.from_pretrained("./xtokenizer.pth",
                                 statistics_path="./my_statistics.json")

indices = tok.encode_from_absolute(absolute_actions,  # [B, T, 26]
                                   obs_state_absolute,  # [B, 1, 26]
                                   robot_type="MyRobot")

actions_recon = tok.decode_to_absolute(indices,
                                       target_length=T,
                                       obs_state_absolute=obs_state_absolute,
                                       robot_type="MyRobot")
```

The released checkpoint accepts any `T ∈ [8, 64]` per call; you don't have
to reload between different chunk sizes.

### Token layout: `time_major` vs. `quantizer_major`

For every `T` action frames the model emits `T' = T / 4` latents, each of
which is quantized by `Q = 4` residual codebooks, so a chunk produces a
total of `T' × Q` discrete tokens. All `encode` / `decode` /
`reconstruct` / `encode_from_absolute` / `decode_to_absolute` methods
take a `token_order` keyword that controls how those tokens are laid out
in the returned tensor:

| `token_order`        | `indices.shape`     | flatten order (`indices.reshape(B, -1)`)              |
| -------------------- | ------------------- | ------------------------------------------------------ |
| `"time_major"` (default, backward-compatible) | `[B, T', Q]` | `[t0_q0, t0_q1, ..., t0_q3, t1_q0, ..., tN_q3]`        |
| `"quantizer_major"`  | `[B, Q, T']`        | `[q0_t0, q0_t1, ..., q0_tN, q1_t0, ..., q3_tN]`        |

Pick `quantizer_major` if your downstream model wants to emit the whole
first-residual stream before the second one (e.g. a "draft then refine"
generation order); pick `time_major` if it autoregresses time-step by
time-step. Both layouts are exactly equivalent — only the ordering of
the same underlying tokens differs.

```python
# encode side: shape changes with token_order
out = tok.encode(normalized_delta, obs_state=normalized_obs,
                 token_order="quantizer_major")
indices = out["indices"]                        # [B, Q, T'] e.g. [2, 4, 8]
flat = indices.reshape(indices.shape[0], -1)    # [B, Q*T']  ready for an LLM

# decode side: accepts 3D ([B, Q, T']) or flat ([B, Q*T']) as long as
# token_order matches what you encoded with.
recon = tok.decode(flat, target_length=T, obs_state=normalized_obs,
                   token_order="quantizer_major")

# The high-level absolute API takes the same flag.
idx = tok.encode_from_absolute(absolute_actions, obs_state_absolute,
                               robot_type="MyRobot",
                               token_order="quantizer_major")
recon_abs = tok.decode_to_absolute(idx, target_length=T,
                                   obs_state_absolute=obs_state_absolute,
                                   robot_type="MyRobot",
                                   token_order="quantizer_major")
```

> Within a single latent step, `q0..q3` are **residual** codes (`q1`
> quantizes what `q0` left over, etc.). The RVQ is unchanged by the
> token layout, but downstream generators should still emit
> `q0 -> q1 -> q2 -> q3` per latent so they stay on the training
> distribution.

---

## Examples in this repo

| File | Case | What it does |
| --- | --- | --- |
| `examples/01_encode_decode.py` | A | Loads stats + a demo episode, walks the six-step pipeline explicitly, then verifies the high-level one-liner matches numerically. |
| `examples/02_compute_statistics.py` | B | Reads the two demo `.npz` episodes and writes `./my_statistics.json`. |
| `examples/data/episode_*.npz` | — | Two synthetic episodes used by examples 01 and 02. |
| `examples/data/_generate_demo_data.py` | — | Regenerate the demo episodes. |

A typical first-time session is `python examples/02_compute_statistics.py`
followed by `python examples/01_encode_decode.py`.

---

## Concept reference

### Action layout

The exact `dof_dims` mapping ships inside the checkpoint (`data_spec`
field). `tok.get_dof_layout()` returns it as an ordered dict matching the
table above. Slices `0:3, 3:9, 9:10, 10:13, 13:19, 19:20, 20:23, 23:24,
24:26` correspond to the nine keys in order.

### Delta vs. absolute action

The tokenizer only ever sees deltas relative to an observation frame.
Cartesian / scalar segments use plain subtraction; the 6D rotation
segments use the SO(3) composition `R_delta = R_abs @ R_state.T`; head
angles wrap to `(-pi, pi]`. Helpers live in `xtokenizer.data.action_layout`
(`compute_delta_action`, `compute_absolute_action`).

### Normalization & statistics

Deltas are min-max normalized to roughly `[-1, 1]` using the 0.1% / 99.9%
quantiles (`q001` / `delta_001`). Training statistics are not shipped;
compute your own (Case B). When statistics are loaded, the high-level API
will normalize both the delta and the obs frame; missing keys in your
statistics fall back to `(min=0, delta=1)` (i.e. no normalization), so it
is worth running `validate_statistics` once after building.

### Robot types

The model has 18 canonical embodiment slots:

```
0  Unknown
1  X2Arm        (default when robot_type=None)
2  Franka
3  UR5
4  Piper
5  Viperx
6  ARX5
7  WidowX
8  GoogleRobot
9  AgiBot
10 UMI
11 Realman
12 Ark
13 R1Lite
14 AlphaBot
15 MMK2
16 Leju
17 A2D
```

Pass the canonical name as a string (`robot_type="Franka"`). Names absent
from the registry resolve to id 0; passing `robot_type=None` uses id 1.

### Dropout robustness

During training, `obs_state` and `robot_type` were each independently
dropped with probability 0.2. At inference all four combinations are
legal: both provided (best precision), only `obs_state`, only
`robot_type`, or neither. Reconstruction quality is typically highest
when both are passed.

---

## API cheatsheet

```python
XTokenizer.from_pretrained(model_path, statistics_path=None,
                           robot_types_path=None, device=None)

# Low-level: caller supplies normalized deltas
tok.encode(normalized_delta, obs_state=None, dof_mask=None,
           padding_mask=None, robot_type=None,
           token_order="time_major")          # or "quantizer_major"
tok.decode(indices, target_length, obs_state=None, dof_mask=None,
           padding_mask=None, robot_type=None,
           token_order="time_major")          # accepts [B,T',Q] / [B,Q,T'] / flat
tok.reconstruct(normalized_delta, ..., token_order="time_major")

# High-level: absolute actions in/out (requires statistics_path)
tok.encode_from_absolute(absolute_actions, obs_state_absolute, robot_type,
                         dof_mask=None, padding_mask=None,
                         token_order="time_major")
tok.decode_to_absolute(indices, target_length, obs_state_absolute, robot_type,
                       dof_mask=None, token_order="time_major")

# Statistics access (used internally by the high-level API; also handy
# when you split the pipeline yourself).
tok.get_normalizer(stats_key) -> (predict_min, predict_delta, obs_min, obs_delta)

# NumPy variants of every method above: <method>_numpy

# Introspection
tok.get_action_dim()
tok.get_vocab_size()
tok.get_num_quantizers()
tok.get_compression_ratio()
tok.get_max_chunk_length()
tok.get_dof_layout()              # ordered {key: dim}
tok.get_predict_action_keys()
tok.get_obs_state_keys()
tok.get_canonical_robot_types()   # name -> id
```

Data utilities re-exported under `xtokenizer.data`:

```
ActionLayout, compute_delta_action, compute_absolute_action,
split_by_dof_dims, stack_by_dof_dims,
make_dof_mask_from_nan, make_dof_mask_from_present_keys,
normalize_action, unnormalize_action, build_normalizer_from_keys,
euler_to_6d, quat_to_6d, so3_6d_to_matrix, matrix_to_so3_6d,
compute_delta_6d_batch, compose_6d_batch,
load_statistics_json, save_statistics_json, validate_statistics,
```


---

## Citation

If you find X-Tokenizer useful, please cite:

```bibtex
@misc{kang2026xtokenizermultimodalactiontokenizer,
      title={X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining},
      author={Xirui Kang and Yanpei Shi and Lucy Liang and Roy Gan and Dongxiu Liu and Pushi Zhang and Danpeng Chen and Xiaoyi Qin and Yinan Zheng and Jinliang Zheng and Hao Wang and Xianyuan Zhan and Hang Su},
      year={2026},
      eprint={2606.14752},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.14752},
}
```

---
## License

Apache 2.0 — see [LICENSE](LICENSE).