robotics-diffusion-transformer commited on
Commit
14ab27b
·
verified ·
1 Parent(s): c1cbf46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -1
README.md CHANGED
@@ -14,4 +14,215 @@ tags:
14
  - Bimanual
15
  - Manipulation
16
  - Zero-shot
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  - Bimanual
15
  - Manipulation
16
  - Zero-shot
17
+ - UMI
18
+ ---
19
+
20
+ # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
21
+
22
+ **RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **\[Qwen2.5-VL-7B-Instruct]** and trained on large-scale **UMI** bimanual manipulation data.
23
+ It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
24
+ Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
25
+
26
+ > Home: **rdt-robotics.github.io/rdt2** • Discord: **discord.gg/vsZS3zmf9A** • License: **Apache-2.0**
27
+
28
+ ---
29
+
30
+ ## Table of contents
31
+
32
+ * [Highlights](#highlights)
33
+ * [Model details](#model-details)
34
+ * [Hardware & software requirements](#hardware--software-requirements)
35
+ * [Quickstart (inference)](#quickstart-inference)
36
+ * [Precision settings](#precision-settings)
37
+ * [Intended uses & limitations](#intended-uses--limitations)
38
+ * [Troubleshooting](#troubleshooting)
39
+ * [Changelog](#changelog)
40
+ * [Citation](#citation)
41
+ * [Contact](#contact)
42
+
43
+ ---
44
+
45
+ ## Highlights
46
+
47
+ * **Zero-shot cross-embodiment**: Demonstrated on Bimanual **UR5e** and **Franka Research 3** setups; designed to generalize further with correct hardware calibration.
48
+ * **UMI scale**: Trained on **10k+ hours** from **100+ indoor scenes** of human manipulation with the UMI gripper.
49
+ * **Residual VQ action tokenizer**: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.
50
+
51
+ ---
52
+
53
+ ## Model details
54
+
55
+ ### Architecture
56
+
57
+ * **Backbone**: Qwen2.5-VL-7B-Instruct (vision-language).
58
+ * **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
59
+ * **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
60
+
61
+ ### Action representation (UMI bimanual, per 24-step chunk)
62
+
63
+ * 20-D per step = right (10) + left (10):
64
+
65
+ * pos (x,y,z): 3
66
+ * rot (6D rotation): 6
67
+ * gripper width: 1
68
+ * Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
69
+ * The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.
70
+
71
+ ### Tokenizer
72
+
73
+ * **Tokenizer repo**: [`robotics-diffusion-transformer/RVQActionTokenizer`](https://huggingface.co/robotics-diffusion-transformer/RVQActionTokenizer)
74
+ * Use **float32** for the VQ model.
75
+ * Provide a **[LinearNormalizer](http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt)** for action scaling (UMI convention).
76
+
77
+ ---
78
+
79
+ ## Hardware & software requirements
80
+
81
+ Approximate **single-GPU** requirements (Qwen2.5-VL-7B-Instruct scale):
82
+
83
+ | Mode | RAM | VRAM | Example GPU |
84
+ | --------- | ------: | ------: | ----------------------- |
85
+ | Inference | ≥ 32 GB | ≥ 16 GB | RTX 4090 |
86
+ | LoRA FT | – | ≥ 32 GB | A100 40GB |
87
+ | Full FT | – | ≥ 80 GB | A100 80GB / H100 / B200 |
88
+
89
+ > For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **hardware setup & calibration** (camera stand/pose, flange, etc.) before running closed-loop policies.
90
+
91
+ **Tested OS**: Ubuntu 24.04.
92
+
93
+ ---
94
+
95
+ ## Quickstart (inference)
96
+
97
+ ```python
98
+ # Run under repository: https://github.com/thu-ml/RDT2
99
+
100
+ import torch
101
+ from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
102
+
103
+ from vqvae import MultiVQVAE
104
+ from models.normalizer import LinearNormalizer
105
+ from utils import batch_predict_action
106
+
107
+ # assuming using gpu 0
108
+ device = "cuda:0"
109
+
110
+
111
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
112
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
113
+ "robotics-diffusion-transformer/RDT2-VQ"
114
+ torch_dtype=torch.bfloat16,
115
+ attn_implementation="flash_attention_2",
116
+ device_map=device
117
+ ).eval()
118
+ vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
119
+ vae = vae.to(device=device, dtype=torch.float32)
120
+
121
+ valid_action_id_length = (
122
+ vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
123
+ )
124
+ # TODO: modify to your own downloaded normalizer path
125
+ # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
126
+ normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt") #
127
+
128
+ result = batch_predict_action(
129
+ model,
130
+ processor,
131
+ vae,
132
+ normalizer,
133
+ examples=[
134
+ {
135
+ "obs": {
136
+ # NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
137
+ "camera0_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
138
+ "camera1_rgb": ..., # RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
139
+ },
140
+ "meta": {
141
+ "num_camera": 2
142
+ }
143
+ },
144
+ ..., # we support batch inference, so you can pass a list of examples
145
+ ],
146
+ valid_action_id_length=valid_action_id_length,
147
+ apply_jpeg_compression=True,
148
+ # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
149
+ instruction="Pick up the apple."
150
+ # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
151
+ )
152
+
153
+ # get the predict action from example 0
154
+ action_chunk = result["action_pred"][0] # torch.FloatTensor of shape (24, 20) with dtype=torch.float32
155
+ # action_chunk (T, D) with T=24, D=20
156
+ # T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
157
+ # D=20: following the setting of UMI, we predict the action for both arms from right to left
158
+ # - [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
159
+ # - [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
160
+ # - [9]: RIGHT ARM gripper width (unit: m)
161
+ # - [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
162
+ # - [13-18]: LEFT ARM end effector rotation in 6D rotation representation
163
+ # - [19]: LEFT ARM gripper width (unit: m)
164
+
165
+ # rescale gripper width from [0, 0.088] to [0, 0.1]
166
+ for robot_idx in range(2):
167
+ action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
168
+ ```
169
+
170
+ > For **installation and fine-tuning instructions**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
171
+
172
+ ---
173
+
174
+
175
+ ## Intended uses & limitations
176
+
177
+ **Intended uses**
178
+
179
+ * Research in **robot manipulation** and **VLA modeling**.
180
+ * Zero-shot or few-shot deployment on bimanual systems following the repo’s **[hardware calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** steps.
181
+
182
+ **Limitations**
183
+
184
+ * Open-world robustness depends on **calibration quality**, camera placement, and gripper specifics.
185
+ * Requires correct **normalization** and **RVQ code compatibility**.
186
+ * Safety-critical deployment requires **supervision**, interlocks, and conservative velocity/force limits.
187
+
188
+ **Safety & responsible use**
189
+
190
+ * Always test in simulation or with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
191
+
192
+ ---
193
+
194
+ ## Troubleshooting
195
+
196
+ | Symptom | Likely cause | Suggested fix |
197
+ | ---------------------------------- | -------------- | ------------------------------------------------------------------- |
198
+ | Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths (\[0,0.088] → \[0,0.1]). |
199
+ | Poor instruction following | Prompt format | Use “**Verb + Object.**” with capitalization + period. |
200
+ | No improvement after FT | OOD actions | Check RVQ bounds & reconstruction error; verify normalization. |
201
+ | Vision brittleness | JPEG gap | Enable `--image_corruption`; ensure 384×384 inputs. |
202
+
203
+ ---
204
+
205
+ ## Changelog
206
+
207
+ * **2025-09**: Initial release of **RDT2-VQ** on Hugging Face.
208
+
209
+ ---
210
+
211
+ ## Citation
212
+
213
+ ```bibtex
214
+ @misc{rdt2_2025,
215
+ title = {RDT 2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
216
+ author = {RDT Robotics Team},
217
+ year = {2025},
218
+ url = {https://rdt-robotics.github.io/rdt2/}
219
+ }
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Contact
225
+
226
+ * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
227
+ * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
228
+ * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)