Add Arxiv link to metadata and improve model card

Hi! I'm Niels from the Hugging Face community team.

I've updated the model card to include the `arxiv` metadata tag, linking this repository to its research paper: [RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization](https://huggingface.co/papers/2602.03310).

Linking artifacts to their respective papers helps researchers and users find and cite your work more easily. I also made a small fix to the Quickstart code snippet to ensure it runs correctly and added links to the project page and paper in the header.

Files changed (1) hide show

README.md +20 -12

README.md CHANGED Viewed

@@ -1,11 +1,12 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
-pipeline_tag: robotics
 library_name: transformers
 tags:
 - RDT
 - rdt
@@ -19,11 +20,11 @@ tags:
 # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
-**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
-It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction.
-Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
-[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
 ---
@@ -110,7 +111,7 @@ device = "cuda:0"
 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    "robotics-diffusion-transformer/RDT2-VQ"
     torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
     device_map=device
@@ -142,7 +143,7 @@ result = batch_predict_action(
             }
         },
         ...,    # we support batch inference, so you can pass a list of examples
-    ]，
     valid_action_id_length=valid_action_id_length,
     apply_jpeg_compression=True,
     # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
@@ -211,7 +212,14 @@ for robot_idx in range(2):
 ## Citation
 ```bibtex
-@software{rdt2,
     title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
     author={RDT Team},
     url={https://github.com/thu-ml/RDT2},
@@ -226,4 +234,4 @@ for robot_idx in range(2):
 * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
 * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
-* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)

 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+language:
+- en
 library_name: transformers
+license: apache-2.0
+pipeline_tag: robotics
+arxiv: 2602.03310
 tags:
 - RDT
 - rdt
 # RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
+**RDT2-VQ** is an autoregressive Vision-Language-Action (VLA) model adapted from **[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)** and trained on large-scale **UMI** bimanual manipulation data.
+It predicts a short-horizon **relative action chunk** (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight **Residual VQ (RVQ)** tokenizer, enabling robust zero-shot transfer across **unseen embodiments** for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
+[**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
 ---
 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "robotics-diffusion-transformer/RDT2-VQ",
     torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
     device_map=device
             }
         },
         ...,    # we support batch inference, so you can pass a list of examples
+    ],
     valid_action_id_length=valid_action_id_length,
     apply_jpeg_compression=True,
     # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
 ## Citation
 ```bibtex
+@article{rdt2,
+  title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
+  author={Ji, Xuan and others},
+  journal={arXiv preprint arXiv:2602.03310},
+  year={2026}
+}
+@software{rdt2_code,
     title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
     author={RDT Team},
     url={https://github.com/thu-ml/RDT2},
 * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
 * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
+* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)