File size: 8,949 Bytes
8376ba3
 
c1cbf46
 
 
 
 
 
 
 
 
 
 
 
 
 
14ab27b
 
 
 
 
7957a57
14ab27b
 
 
a133ff9
14ab27b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d99792b
 
 
 
 
 
14ab27b
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: robotics
library_name: transformers
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
---

# RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens

**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).

[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)

---

## Table of contents

* [Highlights](#highlights)
* [Model details](#model-details)
* [Hardware & software requirements](#hardware--software-requirements)
* [Quickstart (inference)](#quickstart-inference)
* [Precision settings](#precision-settings)
* [Intended uses & limitations](#intended-uses--limitations)
* [Troubleshooting](#troubleshooting)
* [Changelog](#changelog)
* [Citation](#citation)
* [Contact](#contact)

---

## Highlights

* **Zero-shot cross-embodiment**: Demonstrated on Bimanual **UR5e** and **Franka Research 3** setups; designed to generalize further with correct hardware calibration.
* **UMI scale**: Trained on **10k+ hours** from **100+ indoor scenes** of human manipulation with the UMI gripper.
* **Residual VQ action tokenizer**: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.

---

## Model details

### Architecture

* **Backbone**: Qwen2.5-VL-7B-Instruct (vision-language).
* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).

### Action representation (UMI bimanual, per 24-step chunk)

* 20-D per step = right (10) + left (10):

  * pos (x,y,z): 3
  * rot (6D rotation): 6
  * gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
* The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.

### Tokenizer

* **Tokenizer repo**: [`robotics-diffusion-transformer/RVQActionTokenizer`](https://huggingface.co/robotics-diffusion-transformer/RVQActionTokenizer)
* Use **float32** for the VQ model.
* Provide a **[LinearNormalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt)** for action scaling (UMI convention).

---

## Hardware & software requirements

Approximate **single-GPU** requirements (Qwen2.5-VL-7B-Instruct scale):

| Mode      |     RAM |    VRAM | Example GPU             |
| --------- | ------: | ------: | ----------------------- |
| Inference | ≥ 32 GB | ≥ 16 GB | RTX 4090                |
| LoRA FT   |       – | ≥ 32 GB | A100 40GB               |
| Full FT   |       – | ≥ 80 GB | A100 80GB / H100 / B200 |

> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **hardware setup & calibration** (camera stand/pose, flange, etc.) before running closed-loop policies.

**Tested OS**: Ubuntu 24.04.

---

## Quickstart (inference)

```python
# Run under repository: https://github.com/thu-ml/RDT2

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

from vqvae import MultiVQVAE
from models.normalizer import LinearNormalizer
from utils import batch_predict_action

# assuming using gpu 0
device = "cuda:0"


processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "robotics-diffusion-transformer/RDT2-VQ"
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device
).eval()
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
vae = vae.to(device=device, dtype=torch.float32)

valid_action_id_length = (
    vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
)
# TODO: modify to your own downloaded normalizer path
# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt")  # 

result = batch_predict_action(
    model,
    processor,
    vae,
    normalizer,
    examples=[
        {
            "obs": {
                # NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
                "camera0_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
                "camera1_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
            },
            "meta": {
                "num_camera": 2
            }
        },
        ...,    # we support batch inference, so you can pass a list of examples
    ],
    valid_action_id_length=valid_action_id_length,
    apply_jpeg_compression=True,
    # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
    instruction="Pick up the apple."
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)

# get the predict action from example 0
action_chunk = result["action_pred"][0] # torch.FloatTensor of shape (24, 20) with dtype=torch.float32
# action_chunk (T, D) with T=24, D=20
#   T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
#   D=20: following the setting of UMI, we predict the action for both arms from right to left
#   - [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
#   - [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
#   - [9]: RIGHT ARM gripper width (unit: m)
#   - [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
#   - [13-18]: LEFT ARM end effector rotation in 6D rotation representation
#   - [19]: LEFT ARM gripper width (unit: m)

# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```

> For **installation and fine-tuning instructions**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).

---


## Intended uses & limitations

**Intended uses**

* Research in **robot manipulation** and **VLA modeling**.
* Zero-shot or few-shot deployment on bimanual systems following the repo’s **[hardware calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** steps.

**Limitations**

* Open-world robustness depends on **calibration quality**, camera placement, and gripper specifics.
* Requires correct **normalization** and **RVQ code compatibility**.
* Safety-critical deployment requires **supervision**, interlocks, and conservative velocity/force limits.

**Safety & responsible use**

* Always test in simulation or with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).

---

## Troubleshooting

| Symptom                            | Likely cause   | Suggested fix                                                       |
| ---------------------------------- | -------------- | ------------------------------------------------------------------- |
| Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths (\[0,0.088] → \[0,0.1]). |
| Poor instruction following         | Prompt format  | Use “**Verb + Object.**” with capitalization + period.              |
| No improvement after FT            | OOD actions    | Check RVQ bounds & reconstruction error; verify normalization.      |
| Vision brittleness                 | JPEG gap       | Enable `--image_corruption`; ensure 384×384 inputs.                 |

---

## Changelog

* **2025-09**: Initial release of **RDT2-VQ** on Hugging Face.

---

## Citation

```bibtex
@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}
```

---

## Contact

* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)