RVQActionTokenizer / README.md

robotics-diffusion-transformer's picture

Update README.md

4e2c172 verified 3 months ago

4.85 kB

	---
	tags:
	- RDT
	- rdt
	- tokenizer
	- action
	- discrete
	- vector-quantization
	- RDT 2
	license: apache-2.0
	pipeline_tag: robotics
	---

	# RVQ-AT: Residual VQ Action Tokenizer for RDT 2

	RVQ-AT is a fast, compact Residual Vector-Quantization (RVQ) tokenizer for robot action streams.
	It converts continuous control trajectories into short sequences of discrete action tokens that plug directly into autoregressive VLA models.

	Unlike single-codebook VQ, RVQ-AT stacks multiple small codebooks and quantizes residuals level-by-level. This yields:

	* Higher fidelity at the same bitrate (lower recon MSE / higher SNR)
	* Shorter token sequences for the same time horizon
	* Stable training via commitment loss, EMA codebook updates, and dead-code revival

	Here, we provide:

	1. RVQ-AT — a general-purpose tokenizer trained on diverse UMI manipulation, but generalizes well on tele-operation data.
	2. Simple APIs to fit your own tokenizer on custom action datasets.

	---


	## Using the Universal RVQ-AT Tokenizer

	We recommend chunking actions into \~0.8 s windows with fps = 30 and normalizing each action dimension using [normalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) to \[-1, 1] before tokenization. Batched encode/decode are supported.

	```python
	# Run under repository: https://github.com/thu-ml/RDT2

	import torch
	import numpy as np
	from models.normalizer import LinearNormalizer
	from vqvae.models.multivqvae import MultiVQVAE

	# Load from the Hub (replace with your repo id once published)
	vae = MultiVQVAE.from_pretrained("outputs/vqvae_hf").cuda().eval()
	normalizer = LinearNormalizer.load(
	"<Path_to_normalizer>" # Download from:
	# http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
	)

	# Load your RELATIVE action chunk
	action_chunk = torch.zeros((1, 24, 20))
	# action_chunk shape: (B, T, action_dim)
	# - T = 24: predicts the future 0.8s in fps=30 → 24 frames
	# - action_dim = 20: following UMI setting (both arms, right to left)
	# - [0-2]: RIGHT ARM end effector position in (x, y, z), unit: m
	# - [3-8]: RIGHT ARM end effector rotation (6D representation)
	# - [9]: RIGHT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open
	# - [10-12]: LEFT ARM end effector position in (x, y, z), unit: m
	# - [13-18]: LEFT ARM end effector rotation (6D representation)
	# - [19]: LEFT ARM gripper width, normalized to [0, 0.088], 0.088 means fully open

	# Normalize action
	nsample = normalizer["action"].normalize(action_chunk).cuda()

	# Encode → tokens
	# tokens: torch.LongTensor with shape (B, num_valid_action_token)
	# num_valid_action_token = 27, values in range [0, 1024)
	tokens = vae.encode(nsample) # or vae.encode(action_chunk)

	# Decode back to continuous actions
	recon_nsample = vae.decode(tokens)
	recon_action_chunk = normalizer["action"].unnormalize(recon_nsample)

	```

	---

	## [IMPORTANT] Recommended Preprocessing

	Although our Residual VQ demonstrates strong generalization across both hand-held gripper data and real robot data,
	we recommend that if you plan to fine-tune on your own dataset, you first verify that the statistics of your data fall within the bounds of our RVQ.
	Afterward, evaluate the reconstruction error on your data before using it for your own purpose, especially fine-tuning.

	---

	<!-- ## Performance (Universal Model)

	(Representative, measured on internal eval — replace with your numbers when available.)

	* Compression: 4 levels × 1 token/step → 4 tokens/step (often reduced further with temporal stride).
	* Reconstruction: MSE ↓ 25–40% vs. single-codebook VQ at equal bitrate.
	* Latency: <1 ms per 50×14 chunk on A100/PCIe; CPU-only real-time at 50 Hz feasible.

	---
	-->
	## Safety & Intended Use

	RVQ-AT is a representation learning component. Do not deploy decoded actions directly to hardware without:

	* Proper sim-to-real validation,
	* Safety bounds/clamping and rate limiters,
	* Task-level monitors and e-stop fallbacks.

	---

	## Citation

	If you use RVQ-AT in your work, please cite:

	```bibtex
	@software{rdt2,
	title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
	author={RDT Team},
	url={https://github.com/thu-ml/RDT2},
	month={September},
	year={2025}
	}
	```

	---

	## Contact

	* Issues & requests: open a GitHub issue (see [here](https://github.com/thu-ml/RDT2/blob/main/CONTRIBUTING.md) for guidelines) or start a Hub discussion on the model page.

	---

	## License

	This repository and the released models are licensed under Apache-2.0. You may use, modify, and distribute, provided you keep a copy of the original license and notices in your distributions and state significant changes when you make them.