Add Arxiv link to metadata and improve model card
Browse filesHi! I'm Niels from the Hugging Face community team.
I've updated the model card to include the `arxiv` metadata tag, linking this repository to its research paper: [RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization](https://huggingface.co/papers/2602.03310).
Linking artifacts to their respective papers helps researchers and users find and cite your work more easily. I also made a small fix to the Quickstart code snippet to ensure it runs correctly and added links to the project page and paper in the header.
README.md
CHANGED
|
@@ -1,11 +1,12 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 7 |
-
|
|
|
|
| 8 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
| 9 |
tags:
|
| 10 |
- RDT
|
| 11 |
- rdt
|
|
@@ -19,11 +20,11 @@ tags:
|
|
| 19 |
|
| 20 |
# RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
|
| 21 |
|
| 22 |
-
**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
|
| 23 |
-
|
| 24 |
-
Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
|
| 25 |
|
| 26 |
-
[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
|
| 27 |
|
| 28 |
---
|
| 29 |
|
|
@@ -110,7 +111,7 @@ device = "cuda:0"
|
|
| 110 |
|
| 111 |
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
|
| 112 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 113 |
-
"robotics-diffusion-transformer/RDT2-VQ"
|
| 114 |
torch_dtype=torch.bfloat16,
|
| 115 |
attn_implementation="flash_attention_2",
|
| 116 |
device_map=device
|
|
@@ -142,7 +143,7 @@ result = batch_predict_action(
|
|
| 142 |
}
|
| 143 |
},
|
| 144 |
..., # we support batch inference, so you can pass a list of examples
|
| 145 |
-
]
|
| 146 |
valid_action_id_length=valid_action_id_length,
|
| 147 |
apply_jpeg_compression=True,
|
| 148 |
# Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
|
|
@@ -211,7 +212,14 @@ for robot_idx in range(2):
|
|
| 211 |
## Citation
|
| 212 |
|
| 213 |
```bibtex
|
| 214 |
-
@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
|
| 216 |
author={RDT Team},
|
| 217 |
url={https://github.com/thu-ml/RDT2},
|
|
@@ -226,4 +234,4 @@ for robot_idx in range(2):
|
|
| 226 |
|
| 227 |
* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
|
| 228 |
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
|
| 229 |
-
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: robotics
|
| 9 |
+
arxiv: 2602.03310
|
| 10 |
tags:
|
| 11 |
- RDT
|
| 12 |
- rdt
|
|
|
|
| 20 |
|
| 21 |
# RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
|
| 22 |
|
| 23 |
+
**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
|
| 24 |
+
|
| 25 |
+
It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
|
| 26 |
|
| 27 |
+
[**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
|
| 28 |
|
| 29 |
---
|
| 30 |
|
|
|
|
| 111 |
|
| 112 |
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
|
| 113 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 114 |
+
"robotics-diffusion-transformer/RDT2-VQ",
|
| 115 |
torch_dtype=torch.bfloat16,
|
| 116 |
attn_implementation="flash_attention_2",
|
| 117 |
device_map=device
|
|
|
|
| 143 |
}
|
| 144 |
},
|
| 145 |
..., # we support batch inference, so you can pass a list of examples
|
| 146 |
+
],
|
| 147 |
valid_action_id_length=valid_action_id_length,
|
| 148 |
apply_jpeg_compression=True,
|
| 149 |
# Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
|
|
|
|
| 212 |
## Citation
|
| 213 |
|
| 214 |
```bibtex
|
| 215 |
+
@article{rdt2,
|
| 216 |
+
title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
|
| 217 |
+
author={Ji, Xuan and others},
|
| 218 |
+
journal={arXiv preprint arXiv:2602.03310},
|
| 219 |
+
year={2026}
|
| 220 |
+
}
|
| 221 |
+
|
| 222 |
+
@software{rdt2_code,
|
| 223 |
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
|
| 224 |
author={RDT Team},
|
| 225 |
url={https://github.com/thu-ml/RDT2},
|
|
|
|
| 234 |
|
| 235 |
* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
|
| 236 |
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
|
| 237 |
+
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
|