File size: 8,949 Bytes
8376ba3 c1cbf46 14ab27b 7957a57 14ab27b a133ff9 14ab27b d99792b 14ab27b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: robotics
library_name: transformers
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
---
# RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
---
## Table of contents
* [Highlights](#highlights)
* [Model details](#model-details)
* [Hardware & software requirements](#hardware--software-requirements)
* [Quickstart (inference)](#quickstart-inference)
* [Precision settings](#precision-settings)
* [Intended uses & limitations](#intended-uses--limitations)
* [Troubleshooting](#troubleshooting)
* [Changelog](#changelog)
* [Citation](#citation)
* [Contact](#contact)
---
## Highlights
* **Zero-shot cross-embodiment**: Demonstrated on Bimanual **UR5e** and **Franka Research 3** setups; designed to generalize further with correct hardware calibration.
* **UMI scale**: Trained on **10k+ hours** from **100+ indoor scenes** of human manipulation with the UMI gripper.
* **Residual VQ action tokenizer**: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.
---
## Model details
### Architecture
* **Backbone**: Qwen2.5-VL-7B-Instruct (vision-language).
* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
### Action representation (UMI bimanual, per 24-step chunk)
* 20-D per step = right (10) + left (10):
* pos (x,y,z): 3
* rot (6D rotation): 6
* gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
* The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.
### Tokenizer
* **Tokenizer repo**: [`robotics-diffusion-transformer/RVQActionTokenizer`](https://huggingface.co/robotics-diffusion-transformer/RVQActionTokenizer)
* Use **float32** for the VQ model.
* Provide a **[LinearNormalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt)** for action scaling (UMI convention).
---
## Hardware & software requirements
Approximate **single-GPU** requirements (Qwen2.5-VL-7B-Instruct scale):
| Mode | RAM | VRAM | Example GPU |
| --------- | ------: | ------: | ----------------------- |
| Inference | ≥ 32 GB | ≥ 16 GB | RTX 4090 |
| LoRA FT | – | ≥ 32 GB | A100 40GB |
| Full FT | – | ≥ 80 GB | A100 80GB / H100 / B200 |
> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **hardware setup & calibration** (camera stand/pose, flange, etc.) before running closed-loop policies.
**Tested OS**: Ubuntu 24.04.
---
## Quickstart (inference)
```python
# Run under repository: https://github.com/thu-ml/RDT2
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from vqvae import MultiVQVAE
from models.normalizer import LinearNormalizer
from utils import batch_predict_action
# assuming using gpu 0
device = "cuda:0"
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"robotics-diffusion-transformer/RDT2-VQ"
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device
).eval()
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
vae = vae.to(device=device, dtype=torch.float32)
valid_action_id_length = (
vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
)
# TODO: modify to your own downloaded normalizer path
# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt") #
result = batch_predict_action(
model,
processor,
vae,
normalizer,
examples=[
{
"obs": {
# NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
"camera0_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
"camera1_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
},
"meta": {
"num_camera": 2
}
},
..., # we support batch inference, so you can pass a list of examples
],
valid_action_id_length=valid_action_id_length,
apply_jpeg_compression=True,
# Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
instruction="Pick up the apple."
# We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
)
# get the predict action from example 0
action_chunk = result["action_pred"][0] # torch.FloatTensor of shape (24, 20) with dtype=torch.float32
# action_chunk (T, D) with T=24, D=20
# T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
# D=20: following the setting of UMI, we predict the action for both arms from right to left
# - [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
# - [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
# - [9]: RIGHT ARM gripper width (unit: m)
# - [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
# - [13-18]: LEFT ARM end effector rotation in 6D rotation representation
# - [19]: LEFT ARM gripper width (unit: m)
# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```
> For **installation and fine-tuning instructions**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
---
## Intended uses & limitations
**Intended uses**
* Research in **robot manipulation** and **VLA modeling**.
* Zero-shot or few-shot deployment on bimanual systems following the repo’s **[hardware calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** steps.
**Limitations**
* Open-world robustness depends on **calibration quality**, camera placement, and gripper specifics.
* Requires correct **normalization** and **RVQ code compatibility**.
* Safety-critical deployment requires **supervision**, interlocks, and conservative velocity/force limits.
**Safety & responsible use**
* Always test in simulation or with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
---
## Troubleshooting
| Symptom | Likely cause | Suggested fix |
| ---------------------------------- | -------------- | ------------------------------------------------------------------- |
| Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths (\[0,0.088] → \[0,0.1]). |
| Poor instruction following | Prompt format | Use “**Verb + Object.**” with capitalization + period. |
| No improvement after FT | OOD actions | Check RVQ bounds & reconstruction error; verify normalization. |
| Vision brittleness | JPEG gap | Enable `--image_corruption`; ensure 384×384 inputs. |
---
## Changelog
* **2025-09**: Initial release of **RDT2-VQ** on Hugging Face.
---
## Citation
```bibtex
@software{rdt2,
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
author={RDT Team},
url={https://github.com/thu-ml/RDT2},
month={September},
year={2025}
}
```
---
## Contact
* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
|