nielsr HF Staff commited on
Commit
d37314a
·
verified ·
1 Parent(s): c061c41

Add paper link and improve model card

Browse files

This PR improves the model card for RDT2-FM by:
- Linking the model to its research paper: [RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization](https://huggingface.co/papers/2602.03310).
- Adding the `arxiv: 2602.03310` metadata tag.
- Removing `library_name: transformers` from the metadata. The model uses a custom `RDTInferencer` class from the official repository and is not directly compatible with standard `AutoModel` loading, so removing this prevents the display of broken automated code snippets.
- Improving the sample usage section with full imports and clearer instructions.

Files changed (1) hide show
  1. README.md +52 -122
README.md CHANGED
@@ -1,11 +1,11 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - robotics-diffusion-transformer/rdt-1b
 
 
 
7
  pipeline_tag: robotics
8
- library_name: transformers
9
  tags:
10
  - RDT
11
  - rdt
@@ -20,32 +20,13 @@ tags:
20
  - Action Expert
21
  ---
22
 
23
-
24
  # RDT2-FM: Flow-Matching Action Expert for RDT 2
25
 
26
- <!-- RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
27
- Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
28
- Concretely, This repository contains the **action expert** for RDT2-FM. -->
29
  RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
30
  By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
31
  This repository specifically provides the action expert component of RDT2-FM.
32
 
33
- [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
34
-
35
- ---
36
-
37
- ## Table of contents
38
-
39
- * [Highlights](#highlights)
40
- * [Model details](#model-details)
41
- * [Hardware & software requirements](#hardware--software-requirements)
42
- * [Quickstart (inference)](#quickstart-inference)
43
- * [Precision settings](#precision-settings)
44
- * [Intended uses & limitations](#intended-uses--limitations)
45
- * [Troubleshooting](#troubleshooting)
46
- * [Changelog](#changelog)
47
- * [Citation](#citation)
48
- * [Contact](#contact)
49
 
50
  ---
51
 
@@ -57,149 +38,98 @@ This repository specifically provides the action expert component of RDT2-FM.
57
 
58
  ---
59
 
60
- ## Model details
61
-
62
- ### Architecture
63
-
64
- * **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
65
- * **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
66
- * **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
67
- * **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
68
-
69
- ### Action representation (UMI bimanual, per 24-step chunk)
70
-
71
- * 20-D per step = right (10) + left (10):
72
-
73
- * pos (x,y,z): 3
74
- * rot (6D rotation): 6
75
- * gripper width: 1
76
- * Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
77
-
78
- ---
79
-
80
- ## Hardware & software requirements
81
-
82
- Approximate **single-GPU** requirements:
83
-
84
- | Mode | RAM | VRAM | Example GPU |
85
- | ------------------------- | ------: | ------: | ----------------------- |
86
- | Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
87
- | Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
88
-
89
- > For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.
90
-
91
- **Tested OS**: Ubuntu 24.04.
92
-
93
- ---
94
-
95
  ## Quickstart (inference)
96
 
 
 
97
  ```python
98
- # Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
99
  import yaml
100
-
 
101
  from models.rdt_inferencer import RDTInferencer
102
 
103
-
104
  with open("configs/rdt/post_train.yaml", "r") as f:
105
- model_config = yaml.safe_load(f)
106
 
 
107
  model = RDTInferencer(
108
- config=model_config,
109
- pretrained_path="robotics-diffusion-transformer/RDT2-FM",
110
- # TODO: modify `normalizer_path` to your own downloaded normalizer path
111
- # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
112
- normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
113
- pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
114
- device="cuda:0",
115
- dtype=torch.bfloat16,
116
  )
117
 
 
118
  result = model.step(
119
  observations={
120
  'images': {
121
- # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
122
- 'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
123
- 'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
124
  },
125
- # use zero input current state for currently
126
- # preserve input interface for future fine-tuning
127
  'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
128
  },
129
- instruction=instruction # Language instruction
130
- # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
131
  )
132
 
133
-
134
- # relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
135
- # with the same format as RDT2-VQ
136
  action_chunk = result.detach().cpu().numpy()
137
 
138
- # rescale gripper width from [0, 0.088] to [0, 0.1]
139
  for robot_idx in range(2):
140
  action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
141
  ```
142
 
143
- > For guides on **installation and fine-tuning**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
144
-
145
  ---
146
 
147
- ## Precision settings
148
-
149
- * **RDT2-FM (action expert)**: `bfloat16` for training and inference.
150
- * **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).
151
 
152
- ---
153
-
154
- ## Intended uses & limitations
155
-
156
- **Intended uses**
157
-
158
- * Research in **robot manipulation** and **VLA modeling**.
159
- * Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.
160
-
161
- **Limitations**
162
 
163
- * Performance depends on **calibration quality**, camera placement, and correct normalization.
164
- * Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
 
 
165
 
166
- **Safety & responsible use**
167
 
168
- * Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
 
 
 
 
169
 
170
  ---
171
 
172
- ## Troubleshooting
173
-
174
- | Symptom | Likely cause | Suggested fix |
175
- | ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
176
- | Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]). |
177
- | Poor instruction following | Prompt format / backbone config | Use **“Verb + Object.”**; ensure backbone is loaded on same device. |
178
-
179
- ---
180
 
181
- ## Changelog
 
 
 
182
 
183
- * **2025-09**: Initial release of **RDT2-FM** on Hugging Face.
184
 
185
  ---
186
 
187
  ## Citation
188
 
189
  ```bibtex
190
- @software{rdt2,
 
 
 
 
 
 
 
191
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
192
  author={RDT Team},
193
  url={https://github.com/thu-ml/RDT2},
194
  month={September},
195
  year={2025}
196
  }
197
- ```
198
-
199
- ---
200
-
201
- ## Contact
202
-
203
- * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
204
- * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
205
- * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)
 
1
  ---
 
 
 
2
  base_model:
3
  - robotics-diffusion-transformer/rdt-1b
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
  pipeline_tag: robotics
8
+ arxiv: 2602.03310
9
  tags:
10
  - RDT
11
  - rdt
 
20
  - Action Expert
21
  ---
22
 
 
23
  # RDT2-FM: Flow-Matching Action Expert for RDT 2
24
 
 
 
 
25
  RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
26
  By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
27
  This repository specifically provides the action expert component of RDT2-FM.
28
 
29
+ [**Paper**](https://huggingface.co/papers/2602.03310) - [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2) - [**Discord**](https://discord.gg/vsZS3zmf9A)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ---
32
 
 
38
 
39
  ---
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ## Quickstart (inference)
42
 
43
+ This model requires the [RDT2 repository](https://github.com/thu-ml/RDT2) for inference.
44
+
45
  ```python
 
46
  import yaml
47
+ import torch
48
+ import numpy as np
49
  from models.rdt_inferencer import RDTInferencer
50
 
51
+ # Load configuration from the official repo
52
  with open("configs/rdt/post_train.yaml", "r") as f:
53
+ model_config = yaml.safe_load(f)
54
 
55
+ # Initialize the inferencer
56
  model = RDTInferencer(
57
+ config=model_config,
58
+ pretrained_path="robotics-diffusion-transformer/RDT2-FM",
59
+ # download normalizer from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
60
+ normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
61
+ pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ",
62
+ device="cuda:0",
63
+ dtype=torch.bfloat16,
 
64
  )
65
 
66
+ # Inference step
67
  result = model.step(
68
  observations={
69
  'images': {
70
+ 'left_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Left arm RGB
71
+ 'right_stereo': np.zeros((384, 384, 3), dtype=np.uint8), # Placeholder: Right arm RGB
 
72
  },
 
 
73
  'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
74
  },
75
+ instruction="Pick up the apple." # Recommended format: "Verb + Object."
 
76
  )
77
 
78
+ # action_chunk shape: (24, 20) with dtype=np.float32
 
 
79
  action_chunk = result.detach().cpu().numpy()
80
 
81
+ # Rescale gripper width from [0, 0.088] to [0, 0.1] for hardware
82
  for robot_idx in range(2):
83
  action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
84
  ```
85
 
 
 
86
  ---
87
 
88
+ ## Model Details
 
 
 
89
 
90
+ ### Architecture
 
 
 
 
 
 
 
 
 
91
 
92
+ * **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
93
+ * **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction continuous actions.
94
+ * **Observation**: Two wrist-camera RGB images (right/left), 384×384.
95
+ * **Instruction**: Short imperative text.
96
 
97
+ ### Action Representation (UMI bimanual, per 24-step chunk)
98
 
99
+ * 20-D per step = right (10) + left (10):
100
+ * pos (x,y,z): 3
101
+ * rot (6D rotation): 6
102
+ * gripper width: 1
103
+ * Output tensor shape: **(T=24, D=20)**, relative deltas.
104
 
105
  ---
106
 
107
+ ## Hardware & Software Requirements
 
 
 
 
 
 
 
108
 
109
+ | Mode | RAM | VRAM | GPU |
110
+ | ------------------------- | ---: | ---: | --- |
111
+ | Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
112
+ | Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
113
 
114
+ > **Note**: For real-world deployment, please follow the hardware setup and calibration guides in the [GitHub README](https://github.com/thu-ml/RDT2).
115
 
116
  ---
117
 
118
  ## Citation
119
 
120
  ```bibtex
121
+ @article{rdt2,
122
+ title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
123
+ author={RDT Team},
124
+ journal={arXiv preprint arXiv:2602.03310},
125
+ year={2025}
126
+ }
127
+
128
+ @software{rdt2_repo,
129
  title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
130
  author={RDT Team},
131
  url={https://github.com/thu-ml/RDT2},
132
  month={September},
133
  year={2025}
134
  }
135
+ ```