--- base_model: - robotics-diffusion-transformer/rdt-1b language: - en license: apache-2.0 pipeline_tag: robotics arxiv: 2602.03310 tags: - RDT - rdt - RDT 2 - Vision-Language-Action - Bimanual - Manipulation - Zero-shot - UMI - Flowmatching - Diffusion - Action Expert --- # RDT2-FM: Flow-Matching Action Expert for RDT 2 RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. This repository specifically provides the action expert component of RDT2-FM. [**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A) --- ## Highlights * **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions. * **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration. * **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation. --- ## Quickstart (inference) This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference. ```python import yaml import torch import numpy as np from models.rdt_inferencer import RDTInferencer # Load configuration from the official repo with open("configs/rdt/post_train.yaml", "r") as f: model_config = yaml.safe_load(f) # Initialize the inferencer model = RDTInferencer( config=model_config, pretrained_path="robotics-diffusion-transformer/RDT2-FM", # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt", pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", device="cuda:0", dtype=torch.bfloat16, ) # Inference step result = model.step( observations={ 'images': { 'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB 'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB }, 'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32) }, instruction="Pick up the apple." # Recommended format: "Verb + Object." ) # action_chunk shape: (24, 20) with dtype=np.float32 action_chunk = result.detach().cpu().numpy() # Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware for robot_idx in range(2): action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1 ``` --- ## Model Details ### Architecture * **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based). * **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions. * **Observation**: Two wrist-camera RGB images (right/left), 384×384. * **Instruction**: Short imperative text. ### Action Representation (UMI bimanual, per 24-step chunk) * 20-D per step = right (10) + left (10): * pos (x,y,z): 3 * rot (6D rotation): 6 * gripper width: 1 * Output tensor shape: **(T=24, D=20)**, relative deltas. --- ## Hardware & Software Requirements | Mode | RAM | VRAM | GPU | | ------------------------- | ---: | ---: | --- | | Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 | | Fine-tuning FM head | – | ~ 16 GB | RTX 4090 | > **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2). --- ## Citation ```bibtex @article{rdt2, title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization}, author={RDT Team}, journal={arXiv preprint arXiv:2602.03310}, year={2025} } @software{rdt2_repo, title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data}, author={RDT Team}, url={https://github.com/thu-ml/RDT2}, month={September}, year={2025} } ```