README.md · robotics-diffusion-transformer/RDT2-FM at refs/pr/2

RDT2-FM / README.md

nielsr HF Staff

Add paper link and improve model card

d37314a verified about 1 month ago

preview code

raw

history blame

4.48 kB

	---
	base_model:
	- robotics-diffusion-transformer/rdt-1b
	language:
	- en
	license: apache-2.0
	pipeline_tag: robotics
	arxiv: 2602.03310
	tags:
	- RDT
	- rdt
	- RDT 2
	- Vision-Language-Action
	- Bimanual
	- Manipulation
	- Zero-shot
	- UMI
	- Flowmatching
	- Diffusion
	- Action Expert
	---

	# RDT2-FM: Flow-Matching Action Expert for RDT 2

	RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
	By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
	This repository specifically provides the action expert component of RDT2-FM.

	[Paper](https://huggingface.co/papers/2602.03310) - [Home](https://rdt-robotics.github.io/rdt2/) - [Github](https://github.com/thu-ml/RDT2) - [Discord](https://discord.gg/vsZS3zmf9A)

	---

	## Highlights

	* Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
	* Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
	* Scales with RDT2-VQ: Pairs with the VLM backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) trained on 10k+ hours and 100+ scenes of UMI manipulation.

	---

	## Quickstart (inference)

	This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference.

	```python
	import yaml
	import torch
	import numpy as np
	from models.rdt_inferencer import RDTInferencer

	# Load configuration from the official repo
	with open("configs/rdt/post_train.yaml", "r") as f:
	model_config = yaml.safe_load(f)

	# Initialize the inferencer
	model = RDTInferencer(
	config=model_config,
	pretrained_path="robotics-diffusion-transformer/RDT2-FM",
	# download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
	normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
	pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ",
	device="cuda:0",
	dtype=torch.bfloat16,
	)

	# Inference step
	result = model.step(
	observations={
	'images': {
	'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB
	'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
	},
	'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
	},
	instruction="Pick up the apple." # Recommended format: "Verb + Object."
	)

	# action_chunk shape: (24, 20) with dtype=np.float32
	action_chunk = result.detach().cpu().numpy()

	# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
	for robot_idx in range(2):
	action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
	```

	---

	## Model Details

	### Architecture

	* Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
	* Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
	* Observation: Two wrist-camera RGB images (right/left), 384×384.
	* Instruction: Short imperative text.

	### Action Representation (UMI bimanual, per 24-step chunk)

	* 20-D per step = right (10) + left (10):
	* pos (x,y,z): 3
	* rot (6D rotation): 6
	* gripper width: 1
	* Output tensor shape: (T=24, D=20), relative deltas.

	---

	## Hardware & Software Requirements

	\| Mode \| RAM \| VRAM \| GPU \|
	\| ------------------------- \| ---: \| ---: \| --- \|
	\| Inference (FM head + VLM) \| ≥ 32 GB \| ~ 16 GB \| RTX 4090 \|
	\| Fine-tuning FM head \| – \| ~ 16 GB \| RTX 4090 \|

	> Note: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2).

	---

	## Citation

	```bibtex
	@article{rdt2,
	title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
	author={RDT Team},
	journal={arXiv preprint arXiv:2602.03310},
	year={2025}
	}

	@software{rdt2_repo,
	title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
	author={RDT Team},
	url={https://github.com/thu-ml/RDT2},
	month={September},
	year={2025}
	}
	```