RDT2-FM / README.md

nielsr HF Staff

Add paper link and improve model card

d37314a verified about 1 month ago

4.48 kB

base_model:
  - robotics-diffusion-transformer/rdt-1b
language:
  - en
license: apache-2.0
pipeline_tag: robotics
arxiv: 2602.0331
tags:
  - RDT
  - rdt
  - RDT 2
  - Vision-Language-Action
  - Bimanual
  - Manipulation
  - Zero-shot
  - UMI
  - Flowmatching
  - Diffusion
  - Action Expert

RDT2-FM: Flow-Matching Action Expert for RDT 2

RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. This repository specifically provides the action expert component of RDT2-FM.

Paper - Home - Github - Discord

Highlights

Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
Scales with RDT2-VQ: Pairs with the VLM backbone (RDT2-VQ) trained on 10k+ hours and 100+ scenes of UMI manipulation.

Quickstart (inference)

This model requires the RDT2 repository for inference.

import yaml
import torch
import numpy as np
from models.rdt_inferencer import RDTInferencer

# Load configuration from the official repo
with open("configs/rdt/post_train.yaml", "r") as f:
    model_config = yaml.safe_load(f)

# Initialize the inferencer
model = RDTInferencer(
    config=model_config,
    pretrained_path="robotics-diffusion-transformer/RDT2-FM",
    # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
    normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
    pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", 
    device="cuda:0",
    dtype=torch.bfloat16,
)

# Inference step
result = model.step(
    observations={
        'images': {
            'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8),  # Placeholder: Left arm RGB
            'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
        },
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction="Pick up the apple." # Recommended format: "Verb + Object."
)

# action_chunk shape: (24, 20) with dtype=np.float32
action_chunk = result.detach().cpu().numpy()

# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1

Model Details

Architecture

Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
Observation: Two wrist-camera RGB images (right/left), 384×384.
Instruction: Short imperative text.

Action Representation (UMI bimanual, per 24-step chunk)

20-D per step = right (10) + left (10):
- pos (x,y,z): 3
- rot (6D rotation): 6
- gripper width: 1
Output tensor shape: (T=24, D=20), relative deltas.

Hardware & Software Requirements

Mode	RAM	VRAM	GPU
Inference (FM head + VLM)	≥ 32 GB	~ 16 GB	RTX 4090
Fine-tuning FM head	–	~ 16 GB	RTX 4090

Note: For real-world deployment, please follow the hardware setup and calibration guides in the GitHub README.

Citation

@article{rdt2,
  title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
  author={RDT Team},
  journal={arXiv preprint arXiv:2602.03310},
  year={2025}
}

@software{rdt2_repo,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}