Robotics
Transformers
Safetensors
RLDX-1
vla
vision-language-action
manipulation
flow-matching
rldx
libero
Instructions to use RLWRLD/RLDX-1-FT-LIBERO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RLWRLD/RLDX-1-FT-LIBERO with Transformers:
# Load model directly from transformers import RLDX model = RLDX.from_pretrained("RLWRLD/RLDX-1-FT-LIBERO", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: rlwrld-model-license-v1.0 | |
| license_link: LICENSE.md | |
| library_name: transformers | |
| pipeline_tag: robotics | |
| tags: | |
| - robotics | |
| - vla | |
| - vision-language-action | |
| - manipulation | |
| - flow-matching | |
| - rldx | |
| - libero | |
| base_model: RLWRLD/RLDX-1-PT | |
| # RLDX-1-FT-LIBERO | |
| [Paper](https://arxiv.org/abs/2605.03269) · [Project page](https://rlwrld.ai/rldx-1) · [Code](https://github.com/RLWRLD/RLDX-1) · [Models](https://huggingface.co/collections/RLWRLD/rldx-1) | |
| <p align="center"> | |
| <img src="teaser.png" width="100%" alt="RLDX-1 teaser"> | |
| </p> | |
| **RLDX-1** is a general-purpose Robot Foundation Model designed for dexterous | |
| manipulation. Powered by a **Multi-Stream Action Transformer (MSAT)**, it | |
| seamlessly unifies multimodal perception (visual + tactile), high-DoF | |
| actuation, and memory-aware decision-making in a single architecture. | |
| This repository hosts **`RLDX-1-FT-LIBERO`** — RLDX-1 finetuned on the | |
| **LIBERO 4-task suite (goal, object, spatial, long)**. The same checkpoint is also evaluated on the | |
| **LIBERO-Plus** generalization suite. It achieves **97.8%** on LIBERO (Avg) | |
| and **87.6%** on LIBERO-Plus. | |
| ## Highlights | |
| - **Multi-Stream Action Transformer (MSAT).** Cognition, physics, and | |
| action each get a dedicated stream coupled by joint self-attention — | |
| an extension of MM-DiT to action modeling. | |
| - **Motion awareness.** Multi-frame observations + a motion module | |
| capture temporal dynamics; intermediate VLM layers compress video | |
| tokens to keep the policy efficient. | |
| - **Long-term memory.** A memory module fuses past cognition features | |
| with the current ones for history-grounded decisions beyond a short | |
| multi-frame window. | |
| - **Physical sensing.** Tactile and torque enter as a dedicated physics | |
| stream; the decoder is jointly trained to predict future physical | |
| signals. | |
| - **Three-stage training.** Pre-training (generalization) → mid-training | |
| (functionality) → post-training (task adaptation), with synthetic data | |
| augmenting rare manipulation scenarios. | |
| - **Real-time inference.** Static graph capture + custom fused kernels | |
| bring the all-modality model to **43.7 ms / step on RTX 5090 | |
| (1.63× speedup, >22 Hz)**. | |
| ## Performance | |
| | Benchmark | Success Rate | | |
| |---|---| | |
| | LIBERO (Avg) | **97.8%** | | |
| | LIBERO-Plus | **87.6%** | | |
| ## Quick start | |
| ### Installation | |
| ```bash | |
| git clone https://github.com/RLWRLD/RLDX-1.git | |
| cd RLDX | |
| uv sync --python 3.10 | |
| uv pip install -e . | |
| ``` | |
| ### Inference | |
| ```python | |
| from rldx.policy.rldx_policy import RLDXPolicy | |
| from rldx.data.embodiment_tags import EmbodimentTag | |
| policy = RLDXPolicy( | |
| model_path="RLWRLD/RLDX-1-FT-LIBERO", | |
| embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT, | |
| device="cuda:0", | |
| ) | |
| action = policy.get_action(observation) | |
| ``` | |
| ### Real-time serving (ZeroMQ) | |
| ```bash | |
| uv run python rldx/eval/run_rldx_server.py \ | |
| --model-path RLWRLD/RLDX-1-FT-LIBERO \ | |
| --embodiment-tag GENERAL_EMBODIMENT \ | |
| --host 0.0.0.0 --port 20000 | |
| ``` | |
| To reproduce the benchmark numbers end-to-end: | |
| - LIBERO: [`run_scripts/eval/libero/README.md`](https://github.com/RLWRLD/RLDX-1/blob/main/run_scripts/eval/libero/README.md) | |
| - LIBERO-Plus: [`run_scripts/eval/libero_plus/README.md`](https://github.com/RLWRLD/RLDX-1/blob/main/run_scripts/eval/libero_plus/README.md) | |
| ## Model details | |
| - **Architecture:** Multi-Stream Action Transformer (MSAT) policy on a | |
| Qwen3-VL backbone with cognition-token perceptual summary. Trained with | |
| flow matching. | |
| - **Inputs:** RGB video (default 4 frames), state proprioception, language | |
| instruction. | |
| - **Outputs:** Action chunks of length 16. | |
| - **Embodiment tag:** `GENERAL_EMBODIMENT`. | |
| - **Base model:** [`RLWRLD/RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT). | |
| - **Backbone:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). | |
| - **Finetune data:** LIBERO 4-task suite (goal, object, spatial, long). | |
| - **Params:** 6.9B. | |
| For the full architectural walkthrough see | |
| [`docs/architecture.md`](https://github.com/RLWRLD/RLDX-1/blob/main/docs/architecture.md). | |
| ## RLDX-1 model family | |
| | Checkpoint | Description | | |
| |---|---| | |
| | [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) | Multi-source pretrained foundation | | |
| | [`RLDX-1-VLM`](https://huggingface.co/RLWRLD/RLDX-1-VLM) | Qwen3-VL-8B vision-language backbone | | |
| | [`RLDX-1-FT-ROBOCASA`](https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA) | RoboCasa Kitchen 24-task finetune | | |
| | [`RLDX-1-FT-RC365`](https://huggingface.co/RLWRLD/RLDX-1-FT-RC365) | RoboCasa-365 cross-task finetune | | |
| | [`RLDX-1-FT-LIBERO`](https://huggingface.co/RLWRLD/RLDX-1-FT-LIBERO) | LIBERO 4-task suite (goal, object, spatial, long) finetune (this repo) | | |
| | [`RLDX-1-FT-SIMPLER-GOOGLE`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-GOOGLE) | SIMPLER Google VM/VA finetune | | |
| | [`RLDX-1-FT-SIMPLER-WIDOWX`](https://huggingface.co/RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX) | SIMPLER WidowX finetune | | |
| | [`RLDX-1-FT-GR1`](https://huggingface.co/RLWRLD/RLDX-1-FT-GR1) | GR-1 Tabletop finetune | | |
| | [`RLDX-1-MT-DROID`](https://huggingface.co/RLWRLD/RLDX-1-MT-DROID) | DROID mid-train | | |
| | [`RLDX-1-MT-ALLEX`](https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX) | All add-ons (memory + motion + physics + video) | | |
| ## Intended use & limitations | |
| **Intended use.** Research on robotic manipulation, simulation benchmarking | |
| on LIBERO and LIBERO-Plus, and non-commercial real-robot deployment under | |
| the conditions of the RLWRLD Model License v1.0. | |
| **Out of scope.** Commercial deployment, military or weapons applications, | |
| non-consensual surveillance, and any use that violates applicable laws or | |
| regulations. See [`LICENSE.md`](LICENSE.md) §3.5 for the full list. | |
| **Limitations.** Performance is reported on the LIBERO 4-task suite (goal, object, spatial, long) training | |
| distribution and the LIBERO-Plus generalization suite. Out-of-distribution | |
| scenes, novel objects, or non-Franka embodiments are not guaranteed. For | |
| other embodiments or datasets, finetune from | |
| [`RLDX-1-PT`](https://huggingface.co/RLWRLD/RLDX-1-PT) instead. | |
| ## Citation | |
| ```bibtex | |
| @article{rldx2026, | |
| title={RLDX-1 Technical Report}, | |
| author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others}, | |
| year={2026}, | |
| note={RLWRLD}, | |
| eprint={2605.03269}, | |
| archivePrefix={arXiv}, | |
| url={https://arxiv.org/abs/2605.03269} | |
| } | |
| ``` | |
| ## License | |
| Released under the **RLWRLD Model License v1.0** — a non-commercial license | |
| with attribution and share-alike requirements. See [`LICENSE.md`](LICENSE.md) for | |
| the full text. By using this model you agree to those terms, including the | |
| use restrictions in §3.5. | |