| | --- |
| | base_model: |
| | - robotics-diffusion-transformer/rdt-1b |
| | language: |
| | - en |
| | license: apache-2.0 |
| | pipeline_tag: robotics |
| | arxiv: 2602.03310 |
| | tags: |
| | - RDT |
| | - rdt |
| | - RDT 2 |
| | - Vision-Language-Action |
| | - Bimanual |
| | - Manipulation |
| | - Zero-shot |
| | - UMI |
| | - Flowmatching |
| | - Diffusion |
| | - Action Expert |
| | --- |
| | |
| | # RDT2-FM: Flow-Matching Action Expert for RDT 2 |
| |
|
| | RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. |
| | By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. |
| | This repository specifically provides the action expert component of RDT2-FM. |
| |
|
| | [**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A) |
| |
|
| | --- |
| |
|
| | ## Highlights |
| |
|
| | * **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions. |
| | * **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration. |
| | * **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation. |
| |
|
| | --- |
| |
|
| | ## Quickstart (inference) |
| |
|
| | This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference. |
| |
|
| | ```python |
| | import yaml |
| | import torch |
| | import numpy as np |
| | from models.rdt_inferencer import RDTInferencer |
| | |
| | # Load configuration from the official repo |
| | with open("configs/rdt/post_train.yaml", "r") as f: |
| | model_config = yaml.safe_load(f) |
| | |
| | # Initialize the inferencer |
| | model = RDTInferencer( |
| | config=model_config, |
| | pretrained_path="robotics-diffusion-transformer/RDT2-FM", |
| | # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt |
| | normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt", |
| | pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", |
| | device="cuda:0", |
| | dtype=torch.bfloat16, |
| | ) |
| | |
| | # Inference step |
| | result = model.step( |
| | observations={ |
| | 'images': { |
| | 'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB |
| | 'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB |
| | }, |
| | 'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32) |
| | }, |
| | instruction="Pick up the apple." # Recommended format: "Verb + Object." |
| | ) |
| | |
| | # action_chunk shape: (24, 20) with dtype=np.float32 |
| | action_chunk = result.detach().cpu().numpy() |
| | |
| | # Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware |
| | for robot_idx in range(2): |
| | action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1 |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | ### Architecture |
| |
|
| | * **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based). |
| | * **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions. |
| | * **Observation**: Two wrist-camera RGB images (right/left), 384×384. |
| | * **Instruction**: Short imperative text. |
| |
|
| | ### Action Representation (UMI bimanual, per 24-step chunk) |
| |
|
| | * 20-D per step = right (10) + left (10): |
| | * pos (x,y,z): 3 |
| | * rot (6D rotation): 6 |
| | * gripper width: 1 |
| | * Output tensor shape: **(T=24, D=20)**, relative deltas. |
| |
|
| | --- |
| |
|
| | ## Hardware & Software Requirements |
| |
|
| | | Mode | RAM | VRAM | GPU | |
| | | ------------------------- | ---: | ---: | --- | |
| | | Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 | |
| | | Fine-tuning FM head | – | ~ 16 GB | RTX 4090 | |
| |
|
| | > **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2). |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{rdt2, |
| | title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization}, |
| | author={RDT Team}, |
| | journal={arXiv preprint arXiv:2602.03310}, |
| | year={2025} |
| | } |
| | |
| | @software{rdt2_repo, |
| | title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data}, |
| | author={RDT Team}, |
| | url={https://github.com/thu-ml/RDT2}, |
| | month={September}, |
| | year={2025} |
| | } |
| | ``` |