Add Arxiv link to metadata and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +20 -12
README.md CHANGED
@@ -1,11 +1,12 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
- pipeline_tag: robotics
 
8
  library_name: transformers
 
 
 
9
  tags:
10
  - RDT
11
  - rdt
@@ -19,11 +20,11 @@ tags:
19
 
20
  # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
21
 
22
- **RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
23
- It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
24
- Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
25
 
26
- [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
27
 
28
  ---
29
 
@@ -110,7 +111,7 @@ device = "cuda:0"
110
 
111
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
112
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
113
- "robotics-diffusion-transformer/RDT2-VQ"
114
  torch_dtype=torch.bfloat16,
115
  attn_implementation="flash_attention_2",
116
  device_map=device
@@ -142,7 +143,7 @@ result = batch_predict_action(
142
  }
143
  },
144
  ..., # we support batch inference, so you can pass a list of examples
145
- ]
146
  valid_action_id_length=valid_action_id_length,
147
  apply_jpeg_compression=True,
148
  # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
@@ -211,7 +212,14 @@ for robot_idx in range(2):
211
  ## Citation
212
 
213
  ```bibtex
214
- @software{rdt2,
 
 
 
 
 
 
 
215
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
216
  author={RDT Team},
217
  url={https://github.com/thu-ml/RDT2},
@@ -226,4 +234,4 @@ for robot_idx in range(2):
226
 
227
  * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
228
  * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
229
- * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
  library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: robotics
9
+ arxiv: 2602.03310
10
  tags:
11
  - RDT
12
  - rdt
 
20
 
21
  # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
22
 
23
+ **RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
24
+
25
+ It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
26
 
27
+ [**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
28
 
29
  ---
30
 
 
111
 
112
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
113
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
114
+ "robotics-diffusion-transformer/RDT2-VQ",
115
  torch_dtype=torch.bfloat16,
116
  attn_implementation="flash_attention_2",
117
  device_map=device
 
143
  }
144
  },
145
  ..., # we support batch inference, so you can pass a list of examples
146
+ ],
147
  valid_action_id_length=valid_action_id_length,
148
  apply_jpeg_compression=True,
149
  # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
 
212
  ## Citation
213
 
214
  ```bibtex
215
+ @article{rdt2,
216
+ title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
217
+ author={Ji, Xuan and others},
218
+ journal={arXiv preprint arXiv:2602.03310},
219
+ year={2026}
220
+ }
221
+
222
+ @software{rdt2_code,
223
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
224
  author={RDT Team},
225
  url={https://github.com/thu-ml/RDT2},
 
234
 
235
  * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
236
  * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
237
+ * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)