WALL-OSS

Wall-OSS-0.5: A Deployment-Ready VLA with GradientBridged Pretraining

We introduce Wall-OSS-0.5, an open-source 4B Vision-Language-Action (VLA) foundation model built upon a 3B VLM backbone augmented with dedicated action-generation components. While traditional VLAs are treated merely as optimization initializations, Wall-OSS-0.5 is designed so that pretrained robotic capability is directly executable and measurable on physical hardware without any downstream fine-tuning.

The model is pretrained across more than 20 distinct robot embodiments, processing over one million trajectories per epoch alongside a grounded multimodal corpus. We adopt a novel gradient-bridged co-training recipe optimizing three complementary objectives:

Discrete Action Prediction: Routes strong VLM-native gradients into the backbone.
Multimodal Prediction: Preserves and strengthens grounded vision-language understanding.
Continuous Flow Matching: Serves as the deployment-time continuous action interface.

🌟 Key Highlights

Zero-Shot Real-Robot Behavior: Achieves non-trivial zero-shot completion on a 17-task suite (including held-out deformable manipulation tasks) directly from the pretrained checkpoint.
Markedly Stronger Adaptation Prior: After task-specific fine-tuning, Wall-OSS-0.5 reaches 60.5% average task progress on 15 real-robot tasks, outperforming $\pi_{0.5}$ by 17.5%.
No Capabilities Erosion: Multimodal evaluations confirm action-pretraining preserves broad vision-language competence while significantly sharpening embodied grounding.

🚀 Quick Start

Installation

# Create conda environment
conda create --name wallx python=3.10
conda activate wallx

# Install base requirements
pip install torch torchvision transformers
pip install huggingface_hub

# Install Wall-X from GitHub
git clone https://github.com/X-Square-Robot/wall-x.git
cd wall-x
pip install -e .

Basic Usage

import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load the model
model_path = "X-Square-Robot/wall-oss-0.5"  # or your local path
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Your inference code here...

🎯 Supervised Fine-Tuning (SFT)

For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:

📖 Training Documentation

The training process includes:

Dataset Preparation: How to prepare your robotics datasets in LeRobot format
Configuration Setup: Detailed configuration for GPU setup, model paths, and robot DOF settings
Training Scripts: Ready-to-use training scripts with proper hyperparameters

Quick Training Start

# Run training (see workspace/README.md for detailed configuration)
bash ./workspace/lerobot_example/run.sh

🔮 Inference

For detailed inference examples and model evaluation:

📖 Inference Documentation

Basic Inference Example

import torch
from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

# Load model
model_path = "X-Square-Robot/wall-oss-0.5"
model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
model.eval()

# Setup
batch_size = 1
seq_length = 50
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).bfloat16()

# Prepare inputs (example with synthetic data)
torch.manual_seed(0)
input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)

# Robotics-specific inputs
proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32)  # Joint states
agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32)  # DOF mask
dataset_names = ["x2_normal"]

# Move to device
inputs = {
    "input_ids": input_ids.to(device),
    "attention_mask": attention_mask.to(device),
    "moe_token_types": moe_token_types.to(device),
    "position_ids": position_ids.to(device),
    "proprioception": proprioception.to(device).bfloat16(),
    "agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
    "dof_mask": dof_mask.to(device).bfloat16(),
    "dataset_names": dataset_names,
    "mode": "validate"
}

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    print(f"Output logits shape: {outputs.logits.shape}")

Advanced Inference Scripts

For production-ready inference and evaluation scripts:

# Basic inference test
python ./scripts/fake_inference.py

# Generate open-loop comparison plots
python ./scripts/draw_openloop_plot.py

📁 View all inference scripts

📚 Complete Documentation

For comprehensive setup, training, and inference instructions:

🚀 Visit our GitHub Repository

The repository contains:

Detailed Installation Guide: Complete environment setup with all dependencies
Training Tutorials: Step-by-step SFT process with LeRobot datasets
Inference Examples: Multiple inference scripts and evaluation tools
Configuration Templates: Ready-to-use configs for different robot setups
Troubleshooting Guide: Common issues and solutions

📄 Cite Us

If you find WALL-OSS models useful, please cite:

@misc{walloss_paper_2025,
  title        = {WALL-OSS: Igniting VLMs toward the Embodied Space},
  author       = {X Square Robot},
  year         = {2025},
  howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
  note         = {White paper}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support