nielsr HF Staff commited on
Commit
49e72ca
·
verified ·
1 Parent(s): 7957a57

Add Arxiv link to metadata and improve model card

Browse files

Hi! I'm Niels from the Hugging Face community team.

I've updated the model card to include the `arxiv` metadata tag, linking this repository to its research paper: [RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization](https://huggingface.co/papers/2602.03310).

Linking artifacts to their respective papers helps researchers and users find and cite your work more easily. I also made a small fix to the Quickstart code snippet to ensure it runs correctly and added links to the project page and paper in the header.

Files changed (1) hide show
  1. README.md +20 -12
README.md CHANGED
@@ -1,11 +1,12 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
- pipeline_tag: robotics
 
8
  library_name: transformers
 
 
 
9
  tags:
10
  - RDT
11
  - rdt
@@ -19,11 +20,11 @@ tags:
19
 
20
  # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
21
 
22
- **RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
23
- It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
24
- Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
25
 
26
- [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
27
 
28
  ---
29
 
@@ -110,7 +111,7 @@ device = "cuda:0"
110
 
111
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
112
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
113
- "robotics-diffusion-transformer/RDT2-VQ"
114
  torch_dtype=torch.bfloat16,
115
  attn_implementation="flash_attention_2",
116
  device_map=device
@@ -142,7 +143,7 @@ result = batch_predict_action(
142
  }
143
  },
144
  ..., # we support batch inference, so you can pass a list of examples
145
- ]
146
  valid_action_id_length=valid_action_id_length,
147
  apply_jpeg_compression=True,
148
  # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
@@ -211,7 +212,14 @@ for robot_idx in range(2):
211
  ## Citation
212
 
213
  ```bibtex
214
- @software{rdt2,
 
 
 
 
 
 
 
215
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
216
  author={RDT Team},
217
  url={https://github.com/thu-ml/RDT2},
@@ -226,4 +234,4 @@ for robot_idx in range(2):
226
 
227
  * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
228
  * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
229
- * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
  library_name: transformers
7
+ license: apache-2.0
8
+ pipeline_tag: robotics
9
+ arxiv: 2602.03310
10
  tags:
11
  - RDT
12
  - rdt
 
20
 
21
  # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
22
 
23
+ **RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
24
+
25
+ It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
26
 
27
+ [**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
28
 
29
  ---
30
 
 
111
 
112
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
113
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
114
+ "robotics-diffusion-transformer/RDT2-VQ",
115
  torch_dtype=torch.bfloat16,
116
  attn_implementation="flash_attention_2",
117
  device_map=device
 
143
  }
144
  },
145
  ..., # we support batch inference, so you can pass a list of examples
146
+ ],
147
  valid_action_id_length=valid_action_id_length,
148
  apply_jpeg_compression=True,
149
  # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
 
212
  ## Citation
213
 
214
  ```bibtex
215
+ @article{rdt2,
216
+ title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
217
+ author={Ji, Xuan and others},
218
+ journal={arXiv preprint arXiv:2602.03310},
219
+ year={2026}
220
+ }
221
+
222
+ @software{rdt2_code,
223
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
224
  author={RDT Team},
225
  url={https://github.com/thu-ml/RDT2},
 
234
 
235
  * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
236
  * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
237
+ * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)