README.md · robotics-diffusion-transformer/RDT2-FM at refs/pr/2

File size: 4,477 Bytes

---
base_model:
- robotics-diffusion-transformer/rdt-1b
language:
- en
license: apache-2.0
pipeline_tag: robotics
arxiv: 2602.03310
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
- Flowmatching
- Diffusion
- Action Expert
---

# RDT2-FM: Flow-Matching Action Expert for RDT 2

RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
This repository specifically provides the action expert component of RDT2-FM.

[**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A)

---

## Highlights

* **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
* **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
* **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.

---

## Quickstart (inference)

This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference.

```python
import yaml
import torch
import numpy as np
from models.rdt_inferencer import RDTInferencer

# Load configuration from the official repo
with open("configs/rdt/post_train.yaml", "r") as f:
    model_config = yaml.safe_load(f)

# Initialize the inferencer
model = RDTInferencer(
    config=model_config,
    pretrained_path="robotics-diffusion-transformer/RDT2-FM",
    # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
    normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
    pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", 
    device="cuda:0",
    dtype=torch.bfloat16,
)

# Inference step
result = model.step(
    observations={
        'images': {
            'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8),  # Placeholder: Left arm RGB
            'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
        },
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction="Pick up the apple." # Recommended format: "Verb + Object."
)

# action_chunk shape: (24, 20) with dtype=np.float32
action_chunk = result.detach().cpu().numpy()

# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```

---

## Model Details

### Architecture

* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
* **Observation**: Two wrist-camera RGB images (right/left), 384×384.
* **Instruction**: Short imperative text.

### Action Representation (UMI bimanual, per 24-step chunk)

* 20-D per step = right (10) + left (10):
  * pos (x,y,z): 3
  * rot (6D rotation): 6
  * gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas.

---

## Hardware & Software Requirements

| Mode                      | RAM | VRAM | GPU |
| ------------------------- | ---: | ---: | --- |
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
| Fine-tuning FM head       | – | ~ 16 GB | RTX 4090 |

> **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2).

---

## Citation

```bibtex
@article{rdt2,
  title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
  author={RDT Team},
  journal={arXiv preprint arXiv:2602.03310},
  year={2025}
}

@software{rdt2_repo,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}
```