README.md · robotics-diffusion-transformer/RDT2-FM at main

RDT2-FM / README.md

robotics-diffusion-transformer

Update README.md

5b09826 verified about 1 hour ago

preview code

raw

history blame contribute delete

7.97 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- robotics-diffusion-transformer/rdt-1b
	pipeline_tag: robotics
	library_name: transformers
	tags:
	- RDT
	- rdt
	- RDT 2
	- Vision-Language-Action
	- Bimanual
	- Manipulation
	- Zero-shot
	- UMI
	- Flowmatching
	- Diffusion
	- Action Expert
	---


	# RDT2-FM: Flow-Matching Action Expert for RDT 2

	<!-- RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon relative action chunks with an action expert with improved RDT architecture and flow-matching objective.
	Using a flow-matching objective, RDT2-FM delivering lower inference latency while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
	Concretely, This repository contains the action expert for RDT2-FM. -->
	RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
	By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
	This repository specifically provides the action expert component of RDT2-FM.

	[Home](https://rdt-robotics.github.io/rdt2/) - [Github](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [Discord](https://discord.gg/vsZS3zmf9A) - [Paper](https://arxiv.org/abs/2602.03310)

	---

	## Table of contents

	* [Highlights](#highlights)
	* [Model details](#model-details)
	* [Hardware & software requirements](#hardware--software-requirements)
	* [Quickstart (inference)](#quickstart-inference)
	* [Precision settings](#precision-settings)
	* [Intended uses & limitations](#intended-uses--limitations)
	* [Troubleshooting](#troubleshooting)
	* [Changelog](#changelog)
	* [Citation](#citation)
	* [Contact](#contact)

	---

	## Highlights

	* Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
	* Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
	* Scales with RDT2-VQ: Pairs with the VLM backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) trained on 10k+ hours and 100+ scenes of UMI manipulation.

	---

	## Model details

	### Architecture

	* Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
	* Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
	* Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
	* Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).

	### Action representation (UMI bimanual, per 24-step chunk)

	* 20-D per step = right (10) + left (10):

	* pos (x,y,z): 3
	* rot (6D rotation): 6
	* gripper width: 1
	* Output tensor shape: (T=24, D=20), relative deltas, `float32`.

	---

	## Hardware & software requirements

	Approximate single-GPU requirements:

	\| Mode \| RAM \| VRAM \| Example GPU \|
	\| ------------------------- \| ------: \| ------: \| ----------------------- \|
	\| Inference (FM head + VLM) \| ≥ 32 GB \| ~ 16 GB \| RTX 4090 \|
	\| Fine-tuning FM head \| – \| ~ 16 GB \| RTX 4090 \|

	> For deployment on real robots, follow your platform’s end-effector + camera choices and perform [hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration) (camera stand/pose, flange, etc.) before running closed-loop policies.

	Tested OS: Ubuntu 24.04.

	---

	## Quickstart (inference)

	```python
	# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
	import yaml

	from models.rdt_inferencer import RDTInferencer


	with open("configs/rdt/post_train.yaml", "r") as f:
	model_config = yaml.safe_load(f)

	model = RDTInferencer(
	config=model_config,
	pretrained_path="robotics-diffusion-transformer/RDT2-FM",
	# TODO: modify `normalizer_path` to your own downloaded normalizer path
	# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
	normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
	pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
	device="cuda:0",
	dtype=torch.bfloat16,
	)

	result = model.step(
	observations={
	'images': {
	# 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
	'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
	'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
	},
	# use zero input current state for currently
	# preserve input interface for future fine-tuning
	'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
	},
	instruction=instruction # Language instruction
	# We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
	)


	# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
	# with the same format as RDT2-VQ
	action_chunk = result.detach().cpu().numpy()

	# rescale gripper width from [0, 0.088] to [0, 0.1]
	for robot_idx in range(2):
	action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
	```

	> For guides on installation and fine-tuning, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).

	---

	## Precision settings

	* RDT2-FM (action expert): `bfloat16` for training and inference.
	* RDT2-VQ (VLM backbone): `bfloat16` by default (Qwen2.5-VL practices).

	---

	## Intended uses & limitations

	Intended uses

	* Research in robot manipulation and VLA modeling.
	* Low-latency, short-horizon control on bimanual systems following hardware calibration steps.

	Limitations

	* Performance depends on calibration quality, camera placement, and correct normalization.
	* Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.

	Safety & responsible use

	* Always test with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).

	---

	## Troubleshooting

	\| Symptom \| Likely cause \| Suggested fix \|
	\| ---------------------------------- \| ------------------------------- \| ---------------------------------------------------------------------- \|
	\| Drifting / unstable gripper widths \| Scale mismatch \| Apply LinearNormalizer; rescale widths ([0,0.088] → [0,0.1]). \|
	\| Poor instruction following \| Prompt format / backbone config \| Use “Verb + Object.”; ensure backbone is loaded on same device. \|

	---

	## Changelog

	* 2025-09: Initial release of RDT2-FM on Hugging Face.

	---

	## Citation

	```bibtex
	@software{rdt2,
	title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
	author={RDT Team},
	url={https://github.com/thu-ml/RDT2},
	month={September},
	year={2025}
	}
	```

	---

	## Contact

	* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
	* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
	* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)