README.md · robotics-diffusion-transformer/RDT2-VQ at refs/pr/1

RDT2-VQ / README.md

nielsr HF Staff

Add Arxiv link to metadata and improve model card

49e72ca verified about 1 month ago

preview code

raw

history blame

9.24 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: robotics
	arxiv: 2602.03310
	tags:
	- RDT
	- rdt
	- RDT 2
	- Vision-Language-Action
	- Bimanual
	- Manipulation
	- Zero-shot
	- UMI
	---

	# RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens

	RDT2-VQ is an autoregressive Vision-Language-Action (VLA) model adapted from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and trained on large-scale UMI bimanual manipulation data.

	It predicts a short-horizon relative action chunk (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight Residual VQ (RVQ) tokenizer, enabling robust zero-shot transfer across unseen embodiments for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).

	[Paper](https://huggingface.co/papers/2602.03310) - [Home](https://rdt-robotics.github.io/rdt2/) - [Github](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [Discord](https://discord.gg/vsZS3zmf9A)

	---

	## Table of contents

	* [Highlights](#highlights)
	* [Model details](#model-details)
	* [Hardware & software requirements](#hardware--software-requirements)
	* [Quickstart (inference)](#quickstart-inference)
	* [Precision settings](#precision-settings)
	* [Intended uses & limitations](#intended-uses--limitations)
	* [Troubleshooting](#troubleshooting)
	* [Changelog](#changelog)
	* [Citation](#citation)
	* [Contact](#contact)

	---

	## Highlights

	* Zero-shot cross-embodiment: Demonstrated on Bimanual UR5e and Franka Research 3 setups; designed to generalize further with correct hardware calibration.
	* UMI scale: Trained on 10k+ hours from 100+ indoor scenes of human manipulation with the UMI gripper.
	* Residual VQ action tokenizer: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.

	---

	## Model details

	### Architecture

	* Backbone: Qwen2.5-VL-7B-Instruct (vision-language).
	* Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
	* Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).

	### Action representation (UMI bimanual, per 24-step chunk)

	* 20-D per step = right (10) + left (10):

	* pos (x,y,z): 3
	* rot (6D rotation): 6
	* gripper width: 1
	* Output tensor shape: (T=24, D=20), relative deltas, `float32`.
	* The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.

	### Tokenizer

	* Tokenizer repo: [`robotics-diffusion-transformer/RVQActionTokenizer`](https://huggingface.co/robotics-diffusion-transformer/RVQActionTokenizer)
	* Use float32 for the VQ model.
	* Provide a [LinearNormalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt) for action scaling (UMI convention).

	---

	## Hardware & software requirements

	Approximate single-GPU requirements (Qwen2.5-VL-7B-Instruct scale):

	\| Mode \| RAM \| VRAM \| Example GPU \|
	\| --------- \| ------: \| ------: \| ----------------------- \|
	\| Inference \| ≥ 32 GB \| ≥ 16 GB \| RTX 4090 \|
	\| LoRA FT \| – \| ≥ 32 GB \| A100 40GB \|
	\| Full FT \| – \| ≥ 80 GB \| A100 80GB / H100 / B200 \|

	> For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.

	Tested OS: Ubuntu 24.04.

	---

	## Quickstart (inference)

	```python
	# Run under repository: https://github.com/thu-ml/RDT2

	import torch
	from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

	from vqvae import MultiVQVAE
	from models.normalizer import LinearNormalizer
	from utils import batch_predict_action

	# assuming using gpu 0
	device = "cuda:0"


	processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"robotics-diffusion-transformer/RDT2-VQ",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map=device
	).eval()
	vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
	vae = vae.to(device=device, dtype=torch.float32)

	valid_action_id_length = (
	vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
	)
	# TODO: modify to your own downloaded normalizer path
	# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
	normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt") #

	result = batch_predict_action(
	model,
	processor,
	vae,
	normalizer,
	examples=[
	{
	"obs": {
	# NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
	"camera0_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
	"camera1_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
	},
	"meta": {
	"num_camera": 2
	}
	},
	..., # we support batch inference, so you can pass a list of examples
	],
	valid_action_id_length=valid_action_id_length,
	apply_jpeg_compression=True,
	# Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
	instruction="Pick up the apple."
	# We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
	)

	# get the predict action from example 0
	action_chunk = result["action_pred"][0] # torch.FloatTensor of shape (24, 20) with dtype=torch.float32
	# action_chunk (T, D) with T=24, D=20
	# T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
	# D=20: following the setting of UMI, we predict the action for both arms from right to left
	# - [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
	# - [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
	# - [9]: RIGHT ARM gripper width (unit: m)
	# - [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
	# - [13-18]: LEFT ARM end effector rotation in 6D rotation representation
	# - [19]: LEFT ARM gripper width (unit: m)

	# rescale gripper width from [0, 0.088] to [0, 0.1]
	for robot_idx in range(2):
	action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
	```

	> For installation and fine-tuning instructions, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).

	---


	## Intended uses & limitations

	Intended uses

	* Research in robot manipulation and VLA modeling.
	* Zero-shot or few-shot deployment on bimanual systems following the repo’s [hardware calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration) steps.

	Limitations

	* Open-world robustness depends on calibration quality, camera placement, and gripper specifics.
	* Requires correct normalization and RVQ code compatibility.
	* Safety-critical deployment requires supervision, interlocks, and conservative velocity/force limits.

	Safety & responsible use

	* Always test in simulation or with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).

	---

	## Troubleshooting

	\| Symptom \| Likely cause \| Suggested fix \|
	\| ---------------------------------- \| -------------- \| ------------------------------------------------------------------- \|
	\| Drifting / unstable gripper widths \| Scale mismatch \| Apply LinearNormalizer; rescale widths (\[0,0.088] → \[0,0.1]). \|
	\| Poor instruction following \| Prompt format \| Use “Verb + Object.” with capitalization + period. \|
	\| No improvement after FT \| OOD actions \| Check RVQ bounds & reconstruction error; verify normalization. \|
	\| Vision brittleness \| JPEG gap \| Enable `--image_corruption`; ensure 384×384 inputs. \|

	---

	## Changelog

	* 2025-09: Initial release of RDT2-VQ on Hugging Face.

	---

	## Citation

	```bibtex
	@article{rdt2,
	title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
	author={Ji, Xuan and others},
	journal={arXiv preprint arXiv:2602.03310},
	year={2026}
	}

	@software{rdt2_code,
	title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
	author={RDT Team},
	url={https://github.com/thu-ml/RDT2},
	month={September},
	year={2025}
	}
	```

	---

	## Contact

	* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
	* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
	* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)