robotics-diffusion-transformer
/

RDT2-FM

@@ -1,11 +1,11 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - robotics-diffusion-transformer/rdt-1b
 pipeline_tag: robotics
-library_name: transformers
 tags:
 - RDT
 - rdt
@@ -20,32 +20,13 @@ tags:
 - Action Expert
 ---
 # RDT2-FM: Flow-Matching Action Expert for RDT 2
-<!-- RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
-Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
-Concretely, This repository contains the **action expert** for RDT2-FM.  -->
 RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
 By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
 This repository specifically provides the action expert component of RDT2-FM.
-[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
----
-## Table of contents
-* [Highlights](#highlights)
-* [Model details](#model-details)
-* [Hardware & software requirements](#hardware--software-requirements)
-* [Quickstart (inference)](#quickstart-inference)
-* [Precision settings](#precision-settings)
-* [Intended uses & limitations](#intended-uses--limitations)
-* [Troubleshooting](#troubleshooting)
-* [Changelog](#changelog)
-* [Citation](#citation)
-* [Contact](#contact)
 ---
@@ -57,149 +38,98 @@ This repository specifically provides the action expert component of RDT2-FM.
 ---
-## Model details
-### Architecture
-* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
-* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
-* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
-* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
-### Action representation (UMI bimanual, per 24-step chunk)
-* 20-D per step = right (10) + left (10):
-  * pos (x,y,z): 3
-  * rot (6D rotation): 6
-  * gripper width: 1
-* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
----
-## Hardware & software requirements
-Approximate **single-GPU** requirements:
-| Mode                      |     RAM |    VRAM | Example GPU             |
-| ------------------------- | ------: | ------: | ----------------------- |
-| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090                |
-| Fine-tuning FM head       |       – | ~ 16 GB | RTX 4090                |
-> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.
-**Tested OS**: Ubuntu 24.04.
----
 ## Quickstart (inference)
 ```python
-# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
 import yaml
 from models.rdt_inferencer import RDTInferencer
 with open("configs/rdt/post_train.yaml", "r") as f:
-  model_config = yaml.safe_load(f)
 model = RDTInferencer(
-  config=model_config,
-  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
-  # TODO: modify `normalizer_path` to your own downloaded normalizer path
-  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
-  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
-  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
-  device="cuda:0",
-  dtype=torch.bfloat16,
 )
 result = model.step(
     observations={
         'images': {
-            # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
-            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
-            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
         },
-        # use zero input current state for currently
-        # preserve input interface for future fine-tuning
         'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
     },
-    instruction=instruction # Language instruction
-    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
 )
-# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
-# with the same format as RDT2-VQ
 action_chunk = result.detach().cpu().numpy()
-# rescale gripper width from [0, 0.088] to [0, 0.1]
 for robot_idx in range(2):
     action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
 ```
-> For guides on **installation and fine-tuning**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
 ---
-## Precision settings
-* **RDT2-FM (action expert)**: `bfloat16` for training and inference.
-* **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).
----
-## Intended uses & limitations
-**Intended uses**
-* Research in **robot manipulation** and **VLA modeling**.
-* Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.
-**Limitations**
-* Performance depends on **calibration quality**, camera placement, and correct normalization.
-* Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
-**Safety & responsible use**
-* Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
 ---
-## Troubleshooting
-| Symptom                            | Likely cause                    | Suggested fix                                                          |
-| ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
-| Drifting / unstable gripper widths | Scale mismatch                  | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]).      |
-| Poor instruction following         | Prompt format / backbone config | Use **“Verb + Object.”**; ensure backbone is loaded on same device.    |
----
-## Changelog
-* **2025-09**: Initial release of **RDT2-FM** on Hugging Face.
 ---
 ## Citation
 ```bibtex
-@software{rdt2,
     title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
     author={RDT Team},
     url={https://github.com/thu-ml/RDT2},
     month={September},
     year={2025}
 }
-```
----
-## Contact
-* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
-* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
-* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)

 ---
 base_model:
 - robotics-diffusion-transformer/rdt-1b
+language:
+- en
+license: apache-2.0
 pipeline_tag: robotics
+arxiv: 2602.03310
 tags:
 - RDT
 - rdt
 - Action Expert
 ---
 # RDT2-FM: Flow-Matching Action Expert for RDT 2
 RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
 By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
 This repository specifically provides the action expert component of RDT2-FM.
+[**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A)
 ---
 ---
 ## Quickstart (inference)
+This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference.
 ```python
 import yaml
+import torch
+import numpy as np
 from models.rdt_inferencer import RDTInferencer
+# Load configuration from the official repo
 with open("configs/rdt/post_train.yaml", "r") as f:
+    model_config = yaml.safe_load(f)
+# Initialize the inferencer
 model = RDTInferencer(
+    config=model_config,
+    pretrained_path="robotics-diffusion-transformer/RDT2-FM",
+    # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
+    normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
+    pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ",
+    device="cuda:0",
+    dtype=torch.bfloat16,
 )
+# Inference step
 result = model.step(
     observations={
         'images': {
+            'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8),  # Placeholder: Left arm RGB
+            'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
         },
         'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
     },
+    instruction="Pick up the apple." # Recommended format: "Verb + Object."
 )
+# action_chunk shape: (24, 20) with dtype=np.float32
 action_chunk = result.detach().cpu().numpy()
+# Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
 for robot_idx in range(2):
     action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
 ```
 ---
+## Model Details
+### Architecture
+* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
+* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
+* **Observation**: Two wrist-camera RGB images (right/left), 384×384.
+* **Instruction**: Short imperative text.
+### Action Representation (UMI bimanual, per 24-step chunk)
+* 20-D per step = right (10) + left (10):
+  * pos (x,y,z): 3
+  * rot (6D rotation): 6
+  * gripper width: 1
+* Output tensor shape: **(T=24, D=20)**, relative deltas.
 ---
+## Hardware & Software Requirements
+| Mode                      | RAM | VRAM | GPU |
+| ------------------------- | ---: | ---: | --- |
+| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
+| Fine-tuning FM head       | – | ~ 16 GB | RTX 4090 |
+> **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2).
 ---
 ## Citation
 ```bibtex
+@article{rdt2,
+  title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
+  author={RDT Team},
+  journal={arXiv preprint arXiv:2602.03310},
+  year={2025}
+}
+@software{rdt2_repo,
     title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
     author={RDT Team},
     url={https://github.com/thu-ml/RDT2},
     month={September},
     year={2025}
 }
+```