|
|
--- |
|
|
tags: |
|
|
- RDT |
|
|
- rdt |
|
|
- tokenizer |
|
|
- action |
|
|
- discrete |
|
|
- vector-quantization |
|
|
- RDT 2 |
|
|
license: apache-2.0 |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# RVQ-AT: Residual VQ Action Tokenizer for RDT 2 |
|
|
|
|
|
**RVQ-AT** is a fast, compact **Residual Vector-Quantization** (RVQ) tokenizer for robot action streams. |
|
|
It converts continuous control trajectories into short sequences of **discrete action tokens** that plug directly into autoregressive VLA models. |
|
|
|
|
|
Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes **residuals** level-by-level. This yields: |
|
|
|
|
|
* **Higher fidelity at the same bitrate** (lower recon MSE / higher SNR) |
|
|
* **Shorter token sequences** for the same time horizon |
|
|
* **Stable training** via commitment loss, EMA codebook updates, and dead-code revival |
|
|
|
|
|
Here, we provide: |
|
|
|
|
|
1. **RVQ-AT** β a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data. |
|
|
2. **Simple APIs to fit your own tokenizer** on custom action datasets. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Using the Universal RVQ-AT Tokenizer |
|
|
|
|
|
We recommend chunking actions into \~**0.8 s windows** with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to **\[-1, 1]** before tokenization. Batched encode/decode are supported. |
|
|
|
|
|
```python |
|
|
# Run under repository: https://github.com/thu-ml/RDT2 |
|
|
|
|
|
import torch |
|
|
import numpy as np |
|
|
from models.normalizer import LinearNormalizer |
|
|
from vqvae.models.multivqvae import MultiVQVAE |
|
|
|
|
|
# Load from the Hub (replace with your repo id once published) |
|
|
vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval() |
|
|
normalizer = LinearNormalizer.load( |
|
|
"<Path_to_normalizer>" # Download from: |
|
|
# http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt |
|
|
) |
|
|
|
|
|
# Load your RELATIVE action chunk |
|
|
action_chunk = torch.zeros((1, 24, 20)) |
|
|
# action_chunk shape: (B, T, action_dim) |
|
|
# - T = 24: predicts the future 0.8s in fps=30 β 24 frames |
|
|
# - action_dim = 20: following UMI setting (both arms, right to left) |
|
|
# - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m |
|
|
# - [3-8]: RIGHT ARM end effector rotation (6D representation) |
|
|
# - [9]: RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open |
|
|
# - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m |
|
|
# - [13-18]: LEFT ARM end effector rotation (6D representation) |
|
|
# - [19]: LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open |
|
|
|
|
|
# Normalize action |
|
|
nsample = normalizer["action"].normalize(action_chunk).cuda() |
|
|
|
|
|
# Encode β tokens |
|
|
# tokens: torch.LongTensor with shape (B, num_valid_action_token) |
|
|
# num_valid_action_token = 27, values in range [0, 1024) |
|
|
tokens = vae.encode(nsample) # or vae.encode(action_chunk) |
|
|
|
|
|
# Decode back to continuous actions |
|
|
recon_nsample = vae.decode(tokens) |
|
|
recon_action_chunk = normalizer["action"].unnormalize(recon_nsample) |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## [IMPORTANT] Recommended Preprocessing |
|
|
|
|
|
Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data, |
|
|
we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ. |
|
|
Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
<!-- ## Performance (Universal Model) |
|
|
|
|
|
*(Representative, measured on internal eval β replace with your numbers when available.)* |
|
|
|
|
|
* **Compression:** 4 levels Γ 1 token/step β 4 tokens/step (often reduced further with temporal stride). |
|
|
* **Reconstruction:** MSE β 25β40% vs. single-codebook VQ at equal bitrate. |
|
|
* **Latency:** <1 ms per 50Γ14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible. |
|
|
|
|
|
--- |
|
|
--> |
|
|
## Safety & Intended Use |
|
|
|
|
|
RVQ-AT is a representation learning component. **Do not** deploy decoded actions directly to hardware without: |
|
|
|
|
|
* Proper sim-to-real validation, |
|
|
* Safety bounds/clamping and rate limiters, |
|
|
* Task-level monitors and e-stop fallbacks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use RVQ-AT in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{rdt2, |
|
|
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data}, |
|
|
author={RDT Team}, |
|
|
url={https://github.com/thu-ml/RDT2}, |
|
|
month={September}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
* Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This repository and the released models are licensed under **Apache-2.0**. You may use, modify, and distribute, provided you **keep a copy of the original license and notices** in your distributions and **state significant changes** when you make them. |