| | --- |
| | license: mit |
| | language: |
| | - en |
| | pipeline_tag: robotics |
| | library_name: transformers |
| | tags: |
| | - robotics |
| | - pytorch |
| | - multimodal |
| | - pretraining |
| | - vla |
| | - diffusion |
| | - rdt |
| | --- |
| | # RDT-1B |
| |
|
| | RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given language instruction and RGB images of up to three views, RDT can predict the next |
| | 64 robot actions. RDT is compatible with almost all modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis. |
| |
|
| | All the [code](https://github.com/thu-ml/RoboticsDiffusionTransformer/tree/main?tab=readme-ov-file), pre-trained model weights, and [data](https://github.com/thu-ml/RoboticsDiffusionTransformer) are licensed under the MIT license. |
| |
|
| | Please refer to our [project page](https://rdt-robotics.github.io/rdt-robotics/) and [paper](https://rdt-robotics.github.io/rdt-robotics/static/paper.pdf) for more information. |
| |
|
| | ## Model Details |
| |
|
| | - **Developed by:** The RDT team consisting of researchers from the [TSAIL group](https://ml.cs.tsinghua.edu.cn/) at Tsinghua University |
| | - **Task Type:** Vision-Language-Action (language, image => robot actions) |
| | - **Modle Type:** Diffusion Policy with Transformers |
| | - **License:** MIT |
| | - **Language(s) (NLP):** en |
| | - **Multi-Modal Encoders:** |
| | - **Vision Backbone:** [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
| | - **Language Model:** [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) |
| | - **Pre-Training Datasets:** 46 datasets consisting of [RT-1 Dataset](https://robotics-transformer1.github.io/), [RH20T](https://rh20t.github.io/), [DROID](https://droid-dataset.github.io/), [BridgeData V2](https://rail-berkeley.github.io/bridgedata/), [RoboSet](https://robopen.github.io/roboset/), and a subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/). See [this link](https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md#download-and-prepare-datasets) for a detailed list. |
| | - **Repository:** https://github.com/thu-ml/RoboticsDiffusionTransformer |
| | - **Paper :** https://rdt-robotics.github.io/rdt-robotics/static/paper.pdf |
| | - **Project Page:** https://rdt-robotics.github.io/rdt-robotics/ |
| |
|
| | ## Uses |
| |
|
| | RDT takes language instruction, RGB images (of up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions. |
| | RDT supports control of almost all robot manipulators with the help of the unified action space, which |
| | includes all the main physical quantities of the robot manipulator (e.g., the end-effector and joint, position and velocity, and the wheeled locomotion). |
| | To deploy on your robot platform, you need to fill the relevant quantities of the raw action vector into the unified space vector. See [our repository](https://github.com/thu-ml/RoboticsDiffusionTransformer) for more information. |
| |
|
| | **Out-of-Scope**: Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre-training datasets). |
| | In this case, we recommend collecting a small dataset of the target robot and then using it to fine-tune RDT. |
| | See [our repository](https://github.com/thu-ml/RoboticsDiffusionTransformer) for a tutorial. |
| |
|
| | Here's an example of how to use the RDT-1B model for inference on a robot: |
| | ```python |
| | # Please first clone the repository and install dependencies |
| | # Then switch to the root directory of the repository by "cd RoboticsDiffusionTransformer" |
| | |
| | # Import a create function from the code base |
| | from scripts.agilex_model import create_model |
| | |
| | # Names of cameras used for visual input |
| | CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist'] |
| | config = { |
| | 'episode_len': 1000, # Max length of one episode |
| | 'state_dim': 14, # Dimension of the robot's state |
| | 'chunk_size': 64, # Number of actions to predict in one step |
| | 'camera_names': CAMERA_NAMES, |
| | } |
| | pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" |
| | # Create the model with the specified configuration |
| | model = create_model( |
| | args=config, |
| | dtype=torch.bfloat16, |
| | pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path, |
| | pretrained='robotics-diffusion-transformer/rdt-1b', |
| | control_frequency=25, |
| | ) |
| | |
| | # Start inference process |
| | # Load the pre-computed language embeddings |
| | lang_embeddings_path = 'your/language/embedding/path' |
| | text_embedding = torch.load(lang_embeddings_path)['embeddings'] |
| | images: List(PIL.Image) = ... # The images from last 2 frames |
| | proprio = ... # The current robot state |
| | # Perform inference to predict the next `chunk_size` actions |
| | actions = policy.step( |
| | proprio=proprio, |
| | images=images, |
| | text_embeds=text_embedding |
| | ) |
| | ``` |
| |
|
| | <!-- RDT-1B supports finetuning on custom datasets, deploying and inferencing on real robots, and retraining the model. |
| | Please refer to [our repository](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md) for all the above guides. --> |
| |
|
| |
|
| | <!-- ## Citation --> |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | <!-- **BibTeX:** --> |
| |
|
| | <!-- [More Information Needed] --> |