RDT2-FM / README.md
robotics-diffusion-transformer's picture
Update README.md
5b09826 verified
---
license: apache-2.0
language:
- en
base_model:
- robotics-diffusion-transformer/rdt-1b
pipeline_tag: robotics
library_name: transformers
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
- Flowmatching
- Diffusion
- Action Expert
---
# RDT2-FM: Flow-Matching Action Expert for RDT 2
<!-- RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
Concretely, This repository contains the **action expert** for RDT2-FM. -->
RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
This repository specifically provides the action expert component of RDT2-FM.
[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A) - [**Paper**](https://arxiv.org/abs/2602.03310)
---
## Table of contents
* [Highlights](#highlights)
* [Model details](#model-details)
* [Hardware & software requirements](#hardware--software-requirements)
* [Quickstart (inference)](#quickstart-inference)
* [Precision settings](#precision-settings)
* [Intended uses & limitations](#intended-uses--limitations)
* [Troubleshooting](#troubleshooting)
* [Changelog](#changelog)
* [Citation](#citation)
* [Contact](#contact)
---
## Highlights
* **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
* **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
* **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.
---
## Model details
### Architecture
* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
### Action representation (UMI bimanual, per 24-step chunk)
* 20-D per step = right (10) + left (10):
* pos (x,y,z): 3
* rot (6D rotation): 6
* gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
---
## Hardware & software requirements
Approximate **single-GPU** requirements:
| Mode | RAM | VRAM | Example GPU |
| ------------------------- | ------: | ------: | ----------------------- |
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
| Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.
**Tested OS**: Ubuntu 24.04.
---
## Quickstart (inference)
```python
# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
import yaml
from models.rdt_inferencer import RDTInferencer
with open("configs/rdt/post_train.yaml", "r") as f:
model_config = yaml.safe_load(f)
model = RDTInferencer(
config=model_config,
pretrained_path="robotics-diffusion-transformer/RDT2-FM",
# TODO: modify `normalizer_path` to your own downloaded normalizer path
# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
device="cuda:0",
dtype=torch.bfloat16,
)
result = model.step(
observations={
'images': {
# 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
},
# use zero input current state for currently
# preserve input interface for future fine-tuning
'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
},
instruction=instruction # Language instruction
# We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
)
# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
# with the same format as RDT2-VQ
action_chunk = result.detach().cpu().numpy()
# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```
> For guides on **installation and fine-tuning**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
---
## Precision settings
* **RDT2-FM (action expert)**: `bfloat16` for training and inference.
* **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).
---
## Intended uses & limitations
**Intended uses**
* Research in **robot manipulation** and **VLA modeling**.
* Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.
**Limitations**
* Performance depends on **calibration quality**, camera placement, and correct normalization.
* Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
**Safety & responsible use**
* Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
---
## Troubleshooting
| Symptom | Likely cause | Suggested fix |
| ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
| Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]). |
| Poor instruction following | Prompt format / backbone config | Use **“Verb + Object.”**; ensure backbone is loaded on same device. |
---
## Changelog
* **2025-09**: Initial release of **RDT2-FM** on Hugging Face.
---
## Citation
```bibtex
@software{rdt2,
title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
author={RDT Team},
url={https://github.com/thu-ml/RDT2},
month={September},
year={2025}
}
```
---
## Contact
* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)