wall-oss-flow / README.md

Update README.md

09a62bd verified 4 months ago

6.78 kB

	# WALL-OSS

	<div align="left">

	<p align="center">
	<img src="assets/logo.png" width="600"/>
	<p>

	<div align="center">

	[![Paper](https://img.shields.io/badge/📄%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)

	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)

	[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)

	[![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)

	</div>

	</div>

	## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a>

	We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability.
	Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
	Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

	## 🎬 Video Demos

	<div align="center">
	<video width="80%" controls>
	<source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4">
	Your browser does not support the video tag.
	</video>
	<p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p>
	</div>



	## 🚀 Quick Start

	### Installation

	```bash
	# Create conda environment
	conda create --name wallx python=3.10
	conda activate wallx

	# Install base requirements
	pip install torch torchvision transformers
	pip install huggingface_hub

	# Install Wall-X from GitHub
	git clone https://github.com/X-Square-Robot/wall-x.git
	cd wall-x
	pip install -e .
	```

	### Basic Usage

	```python
	import torch
	from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

	# Load the model
	model_path = "X-Square-Robot/wall-oss-flow" # or your local path
	model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
	model.eval()

	# Configuration
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device).bfloat16()

	# Your inference code here...
	```

	## 🎯 Supervised Fine-Tuning (SFT)

	For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:

	📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)

	The training process includes:
	- Dataset Preparation: How to prepare your robotics datasets in LeRobot format
	- Configuration Setup: Detailed configuration for GPU setup, model paths, and robot DOF settings
	- Training Scripts: Ready-to-use training scripts with proper hyperparameters

	### Quick Training Start

	```bash
	# Run training (see workspace/README.md for detailed configuration)
	bash ./workspace/lerobot_example/run.sh
	```

	## 🔮 Inference

	For detailed inference examples and model evaluation:

	📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)

	### Basic Inference Example

	```python
	import torch
	from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction

	# Load model
	model_path = "X-Square-Robot/wall-x"
	model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
	model.eval()

	# Setup
	batch_size = 1
	seq_length = 50
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device).bfloat16()

	# Prepare inputs (example with synthetic data)
	torch.manual_seed(0)
	input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
	attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
	moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
	position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)

	# Robotics-specific inputs
	proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32) # Joint states
	agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
	dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32) # DOF mask
	dataset_names = ["x2_normal"]

	# Move to device
	inputs = {
	"input_ids": input_ids.to(device),
	"attention_mask": attention_mask.to(device),
	"moe_token_types": moe_token_types.to(device),
	"position_ids": position_ids.to(device),
	"proprioception": proprioception.to(device).bfloat16(),
	"agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
	"dof_mask": dof_mask.to(device).bfloat16(),
	"dataset_names": dataset_names,
	"mode": "validate"
	}

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	print(f"Output logits shape: {outputs.logits.shape}")
	```

	### Advanced Inference Scripts

	For production-ready inference and evaluation scripts:

	```bash
	# Basic inference test
	python ./scripts/fake_inference.py

	# Generate open-loop comparison plots
	python ./scripts/draw_openloop_plot.py
	```

	📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)

	## 📚 Complete Documentation

	For comprehensive setup, training, and inference instructions:

	### 🚀 [Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)

	The repository contains:
	- Detailed Installation Guide: Complete environment setup with all dependencies
	- Training Tutorials: Step-by-step SFT process with LeRobot datasets
	- Inference Examples: Multiple inference scripts and evaluation tools
	- Configuration Templates: Ready-to-use configs for different robot setups
	- Troubleshooting Guide: Common issues and solutions

	## 📄 Cite Us

	If you find WALL-OSS models useful, please cite:

	```bibtex
	@misc{walloss_paper_2025,
	title = {WALL-OSS: Igniting VLMs toward the Embodied Space},
	author = {X Square Robot},
	year = {2025},
	howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
	note = {White paper}
	}
	```