RLDX-1

RLDX-1 teaser

RLDX-1 is a general-purpose Robot Foundation Model designed for dexterous manipulation. Powered by a Multi-Stream Action Transformer (MSAT), it seamlessly unifies multimodal perception (visual + tactile), high-DoF actuation, and memory-aware decision-making in a single architecture. RLDX-1 achieves state-of-the-art performance across diverse simulation benchmarks and is fully validated on real-world hardware.

This repository hosts RLDX-1-PT — a foundation checkpoint pretrained on a broad mixture of public manipulation corpora, from which all downstream RLDX-1-{FT,MT}-* releases finetune. Use it as your starting point for new embodiments and tasks.

RLDX-1 architecture

Highlights

Multi-Stream Action Transformer (MSAT). Cognition, physics, and action each get a dedicated stream coupled by joint self-attention — an extension of MM-DiT to action modeling.
Motion awareness. Multi-frame observations + a motion module capture temporal dynamics; intermediate VLM layers compress video tokens to keep the policy efficient.
Long-term memory. A memory module fuses past cognition features with the current ones for history-grounded decisions beyond a short multi-frame window.
Physical sensing. Tactile and torque enter as a dedicated physics stream; the decoder is jointly trained to predict future physical signals.
Three-stage training. Pre-training (generalization) → mid-training (functionality) → post-training (task adaptation), with synthetic data augmenting rare manipulation scenarios.
Real-time inference. Static graph capture + custom fused kernels bring the all-modality model to 43.7 ms / step on RTX 5090 (1.63× speedup, >22 Hz).

Released Checkpoints

This card describes RLDX-1-PT (foundation). The full RLDX-1 model family:

Checkpoint	Description	Params	Embodiment Tag
`RLDX-1-PT`	Multi-source pretrained foundation (this repo)	6.9B	per-dataset
`RLDX-1-VLM`	Qwen3-VL-8B vision-language backbone	8B	—
`RLDX-1-FT-ROBOCASA`	RoboCasa Kitchen 24-task finetune	6.9B	`GENERAL_EMBODIMENT`
`RLDX-1-FT-RC365`	RoboCasa-365 cross-task finetune	6.9B	`GENERAL_EMBODIMENT`
`RLDX-1-FT-LIBERO`	LIBERO 4-task suite (goal, object, spatial, long) finetune	6.9B	`GENERAL_EMBODIMENT`
`RLDX-1-FT-SIMPLER-GOOGLE`	SIMPLER Google VM/VA finetune	6.9B	`OXE_FRACTAL`
`RLDX-1-FT-SIMPLER-WIDOWX`	SIMPLER WidowX finetune	6.9B	`OXE_BRIDGE_ORIG`
`RLDX-1-FT-GR1`	GR-1 Tabletop finetune	6.9B	`GENERAL_EMBODIMENT`
`RLDX-1-MT-DROID`	DROID mid-train	8.1B	`OXE_DROID`
`RLDX-1-MT-ALLEX`	All add-ons (memory + motion + physics + video)	8.1B	`GENERAL_EMBODIMENT`

Performance

Success rate (%) of RLDX-1 finetuned on each benchmark's training set, evaluated with the linked checkpoint.

Benchmark	Success Rate	Checkpoint
LIBERO (Avg)	97.8	`RLDX-1-FT-LIBERO`
LIBERO-Plus	87.6	`RLDX-1-FT-LIBERO`
SIMPLER Google-VM	81.5	`RLDX-1-FT-SIMPLER-GOOGLE`
SIMPLER Google-VA	77.4	`RLDX-1-FT-SIMPLER-GOOGLE`
SIMPLER WidowX	71.9	`RLDX-1-FT-SIMPLER-WIDOWX`
RoboCasa Kitchen (24 tasks)	70.6	`RLDX-1-FT-ROBOCASA`
GR-1 Tabletop	58.7	`RLDX-1-FT-GR1`
RoboCasa365 (Avg)	31.5	`RLDX-1-FT-RC365`

Quick start

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX
uv sync --python 3.10
uv pip install -e .

Inference (single step)

from rldx.policy.rldx_policy import RLDXPolicy
from rldx.data.embodiment_tags import EmbodimentTag

policy = RLDXPolicy(
    model_path="RLWRLD/RLDX-1-FT-ROBOCASA",
    embodiment_tag=EmbodimentTag.GENERAL_EMBODIMENT,
    device="cuda:0",
)

action = policy.get_action(observation)

RLDX-1-PT is pretrained on a multi-source mixture, so for direct inference pair it with the embodiment tag matching your data source — e.g. OXE_FRACTAL, OXE_BRIDGE_ORIG, OXE_DROID, GALAXEA, AGIBOT_GRIPPER, AGIBOT_DEXHAND, NEURAL_GR1, HUMANOID_EVERYDAY_G1, HUMANOID_EVERYDAY_H1, etc. For custom robots, finetune.

Real-time serving (ZeroMQ)

uv run python rldx/eval/run_rldx_server.py \
    --model-path RLWRLD/RLDX-1-FT-ROBOCASA \
    --embodiment-tag GENERAL_EMBODIMENT \
    --host 0.0.0.0 --port 20000

A WebSocket server (run_rldx_server_pi.py) is also available for openpi-compatible clients.

Finetune from `RLDX-1-PT`

uv run python rldx/experiment/launch_train.py \
    --base-model-path RLWRLD/RLDX-1-PT \
    --dataset-path /path/to/your/dataset \
    --embodiment-tag GENERAL_EMBODIMENT \
    --video-length 4 --n-cog-tokens 64 \
    --global-batch-size 64 --learning-rate 1e-4 \
    --max-steps 60000 --save-steps 5000 \
    --output-dir ./outputs/my_finetune

To enable add-ons (memory / motion / physics) see the recipes in the main README and the training.md guide.

Model details

Architecture: Multi-Stream Action Transformer (MSAT) policy with a Qwen3-VL vision-language backbone, cognition-token perceptual summary, optional Transformer memory, motion module, and tactile/torque physics encoder/decoder. Trained with flow matching.
Inputs: RGB video (default 4 frames), state proprioception, optional tactile / torque signals, language instruction.
Outputs: Action chunks of length 16 (default --action-horizon 16).
Backbone: Qwen/Qwen3-VL-8B-Instruct.
Pretraining data: A mixture of public manipulation corpora, covering 27 Open X-Embodiment (OXE) datasets (DROID, Bridge, Fractal, Language Table, …) plus Galaxea, AgiBot World (Gripper + Dexhand), ActionNet, Neural-Curated GR-1 humanoid trajectories, and Unitree G1 / H1 from HumanoidEveryday.

For a full architectural walkthrough see docs/architecture.md.

Intended use & limitations

Intended use. Research on robotic manipulation, finetuning on custom embodiments, simulation benchmarking, and non-commercial real-robot deployment under the conditions of the RLWRLD Model License v1.0.

Out of scope. Commercial deployment, military or weapons applications, non-consensual surveillance, and any use that violates applicable laws or regulations. See LICENSE.md §3.5 for the full list.

Limitations. Performance depends heavily on embodiment match and data distribution. The pretrained checkpoint is OXE-conditioned and is not guaranteed to work zero-shot on novel embodiments without finetuning. Memory, motion, and physics modules are dormant in RLDX-1-PT and only activate when the corresponding flags are wired during finetuning (see RLDX-1-MT-ALLEX).

Citation

@article{rldx2026,
  title={RLDX-1 Technical Report},
  author={Kim, Dongyoung and Jang, Huiwon and Koo, Myungkyu and Jang, Suhyeok and Kim, Taeyoung and others},
  year={2026},
  note={RLWRLD},
  eprint={2605.03269},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.03269}
}

License

Released under the RLWRLD Model License v1.0 — a non-commercial license with attribution and share-alike requirements. See LICENSE.md for the full text. By using this model you agree to those terms, including the use restrictions in §3.5.

Downloads last month: 51

Safetensors

Model size

7B params

Tensor type

BF16

Video Preview

Robotics

Model tree for RLWRLD/RLDX-1-PT

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(259)

this model

Finetunes

8 models

Collection including RLWRLD/RLDX-1-PT

RLDX-1

Collection

RLDX-1 : General-purpose robotics foundation model for dexterous manipulation. • 11 items • Updated about 9 hours ago • 13

Paper for RLWRLD/RLDX-1-PT

RLDX-1 Technical Report

Paper • 2605.03269 • Published 2 days ago • 72

RLDX-1

Highlights

Released Checkpoints

Performance

Quick start

Inference (single step)

Real-time serving (ZeroMQ)

Finetune from RLDX-1-PT

Model details

Intended use & limitations

Citation

License

Model tree for RLWRLD/RLDX-1-PT

Collection including RLWRLD/RLDX-1-PT

Paper for RLWRLD/RLDX-1-PT

Finetune from `RLDX-1-PT`