metadata
base_model:
- robotics-diffusion-transformer/rdt-1b
language:
- en
license: apache-2.0
pipeline_tag: robotics
arxiv: 2602.0331
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
- Flowmatching
- Diffusion
- Action Expert
RDT2-FM: Flow-Matching Action Expert for RDT 2
RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. This repository specifically provides the action expert component of RDT2-FM.
Paper - Home - Github - Discord
Highlights
- Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
- Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
- Scales with RDT2-VQ: Pairs with the VLM backbone (RDT2-VQ) trained on 10k+ hours and 100+ scenes of UMI manipulation.
Quickstart (inference)
This model requires the RDT2 repository for inference.
import yaml
import torch
import numpy as np
from models.rdt_inferencer import RDTInferencer
# Load configuration from the official repo
with open("configs/rdt/post_train.yaml", "r") as f:
model_config = yaml.safe_load(f)
# Initialize the inferencer
model = RDTInferencer(
config=model_config,
pretrained_path="robotics-diffusion-transformer/RDT2-FM",
# download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ",
device="cuda:0",
dtype=torch.bfloat16,
)
# Inference step
result = model.step(
observations={
'images': {
'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB
'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
},
'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
},
instruction="Pick up the apple." # Recommended format: "Verb + Object."
)
# action_chunk shape: (24, 20) with dtype=np.float32
action_chunk = result.detach().cpu().numpy()
# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
for robot_idx in range(2):
action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
Model Details
Architecture
- Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
- Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
- Observation: Two wrist-camera RGB images (right/left), 384×384.
- Instruction: Short imperative text.
Action Representation (UMI bimanual, per 24-step chunk)
- 20-D per step = right (10) + left (10):
- pos (x,y,z): 3
- rot (6D rotation): 6
- gripper width: 1
- Output tensor shape: (T=24, D=20), relative deltas.
Hardware & Software Requirements
| Mode | RAM | VRAM | GPU |
|---|---|---|---|
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
| Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
Note: For real-world deployment, please follow the hardware setup and calibration guides in the GitHub README.
Citation
@article{rdt2,
title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
author={RDT Team},
journal={arXiv preprint arXiv:2602.03310},
year={2025}
}
@software{rdt2_repo,
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
author={RDT Team},
url={https://github.com/thu-ml/RDT2},
month={September},
year={2025}
}