RDT2-FM / README.md
nielsr's picture
nielsr HF Staff
Add paper link and improve model card
d37314a verified
|
raw
history blame
4.48 kB
---
base_model:
- robotics-diffusion-transformer/rdt-1b
language:
- en
license: apache-2.0
pipeline_tag: robotics
arxiv: 2602.03310
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
- Flowmatching
- Diffusion
- Action Expert
---
# RDT2-FM: Flow-Matching Action Expert for RDT 2
RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
This repository specifically provides the action expert component of RDT2-FM.
[**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A)
---
## Highlights
* **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
* **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
* **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.
---
## Quickstart (inference)
This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference.
```python
import yaml
import torch
import numpy as np
from models.rdt_inferencer import RDTInferencer
# Load configuration from the official repo
with open("configs/rdt/post_train.yaml", "r") as f:
model_config = yaml.safe_load(f)
# Initialize the inferencer
model = RDTInferencer(
config=model_config,
pretrained_path="robotics-diffusion-transformer/RDT2-FM",
# download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ",
device="cuda:0",
dtype=torch.bfloat16,
)
# Inference step
result = model.step(
observations={
'images': {
'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB
'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
},
'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
},
instruction="Pick up the apple." # Recommended format: "Verb + Object."
)
# action_chunk shape: (24, 20) with dtype=np.float32
action_chunk = result.detach().cpu().numpy()
# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
for robot_idx in range(2):
action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```
---
## Model Details
### Architecture
* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
* **Observation**: Two wrist-camera RGB images (right/left), 384×384.
* **Instruction**: Short imperative text.
### Action Representation (UMI bimanual, per 24-step chunk)
* 20-D per step = right (10) + left (10):
* pos (x,y,z): 3
* rot (6D rotation): 6
* gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas.
---
## Hardware & Software Requirements
| Mode | RAM | VRAM | GPU |
| ------------------------- | ---: | ---: | --- |
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
| Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
> **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2).
---
## Citation
```bibtex
@article{rdt2,
title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
author={RDT Team},
journal={arXiv preprint arXiv:2602.03310},
year={2025}
}
@software{rdt2_repo,
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
author={RDT Team},
url={https://github.com/thu-ml/RDT2},
month={September},
year={2025}
}
```