Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

.gitattributes +2 -0
README.md +190 -0
assets/logo.png +3 -0
chat_template.json +3 -0
config.json +106 -0
configuration.json +1 -0
generation_config.json +12 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +19 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/logo.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,190 @@

+# WALL-OSS
+<div align="left">
+<p align="center">
+    <img src="assets/logo.png" width="600"/>
+<p>
+<div align="center">
+[![Paper](https://img.shields.io/badge/📄%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)
+&nbsp;&nbsp;
+[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)
+&nbsp;&nbsp;
+[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)
+&nbsp;&nbsp;
+[![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)
+</div>
+</div>
+## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a>
+We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability.
+Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
+Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex   understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.
+## 🎬 Video Demos
+<div align="center">
+    <video width="80%" controls>
+        <source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4">
+        Your browser does not support the video tag.
+    </video>
+    <p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p>
+</div>
+## 🚀 Quick Start
+### Installation
+```bash
+# Create conda environment
+conda create --name wallx python=3.10
+conda activate wallx
+# Install base requirements
+pip install torch torchvision transformers
+pip install huggingface_hub
+# Install Wall-X from GitHub
+git clone https://github.com/X-Square-Robot/wall-x.git
+cd wall-x
+pip install -e .
+```
+### Basic Usage
+```python
+import torch
+from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction
+# Load the model
+model_path = "X-Square-Robot/wall-oss-flow"  # or your local path
+model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
+model.eval()
+# Configuration
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device).bfloat16()
+# Your inference code here...
+```
+## 🎯 Supervised Fine-Tuning (SFT)
+For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:
+**📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)**
+The training process includes:
+- **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format
+- **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings
+- **Training Scripts**: Ready-to-use training scripts with proper hyperparameters
+### Quick Training Start
+```bash
+# Run training (see workspace/README.md for detailed configuration)
+bash ./workspace/lerobot_example/run.sh
+```
+## 🔮 Inference
+For detailed inference examples and model evaluation:
+**📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)**
+### Basic Inference Example
+```python
+import torch
+from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction
+# Load model
+model_path = "X-Square-Robot/wall-x"
+model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
+model.eval()
+# Setup
+batch_size = 1
+seq_length = 50
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device).bfloat16()
+# Prepare inputs (example with synthetic data)
+torch.manual_seed(0)
+input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
+attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
+moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
+position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)
+# Robotics-specific inputs
+proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32)  # Joint states
+agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
+dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32)  # DOF mask
+dataset_names = ["x2_normal"]
+# Move to device
+inputs = {
+    "input_ids": input_ids.to(device),
+    "attention_mask": attention_mask.to(device),
+    "moe_token_types": moe_token_types.to(device),
+    "position_ids": position_ids.to(device),
+    "proprioception": proprioception.to(device).bfloat16(),
+    "agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
+    "dof_mask": dof_mask.to(device).bfloat16(),
+    "dataset_names": dataset_names,
+    "mode": "validate"
+}
+# Run inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    print(f"Output logits shape: {outputs.logits.shape}")
+```
+### Advanced Inference Scripts
+For production-ready inference and evaluation scripts:
+```bash
+# Basic inference test
+python ./scripts/fake_inference.py
+# Generate open-loop comparison plots
+python ./scripts/draw_openloop_plot.py
+```
+**📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)**
+## 📚 Complete Documentation
+For comprehensive setup, training, and inference instructions:
+### 🚀 **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)**
+The repository contains:
+- **Detailed Installation Guide**: Complete environment setup with all dependencies
+- **Training Tutorials**: Step-by-step SFT process with LeRobot datasets
+- **Inference Examples**: Multiple inference scripts and evaluation tools
+- **Configuration Templates**: Ready-to-use configs for different robot setups
+- **Troubleshooting Guide**: Common issues and solutions
+## 📄 Cite Us
+If you find WALL-OSS models useful, please cite:
+```bibtex
+@misc{walloss_paper_2025,
+  title        = {WALL-OSS: Igniting VLMs toward the Embodied Space},
+  author       = {X Square Robot},
+  year         = {2025},
+  howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
+  note         = {White paper}
+}
+```

assets/logo.png ADDED Viewed

Git LFS Details

SHA256: 721ada7f102cac8b9be8a006998e8248ee62075111bd8290896b7b4a9e12e55a
Pointer size: 131 Bytes
Size of remote file: 202 kB

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,106 @@

+{
+  "architectures": [
+    "Qwen2_5_VLForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "vision_start_token_id": 151652,
+  "vision_end_token_id": 151653,
+  "vision_token_id": 151654,
+  "image_token_id": 151655,
+  "video_token_id": 151656,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_position_embeddings": 128000,
+  "max_window_layers": 70,
+  "model_type": "qwen2_5_vl",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 2,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.41.2",
+  "_attn_implementation": "flash_attention_2",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vision_config": {
+    "depth": 32,
+    "hidden_act": "silu",
+    "hidden_size": 1280,
+    "intermediate_size": 3420,
+    "num_heads": 16,
+    "in_chans": 3,
+    "out_hidden_size": 2048,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
+    "spatial_patch_size": 14,
+    "window_size": 112,
+    "fullatt_block_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "tokens_per_second": 2,
+    "temporal_patch_size": 2
+  },
+  "rope_scaling": {
+    "type": "mrope",
+    "mrope_section": [
+      16,
+      24,
+      24
+    ]
+  },
+  "vocab_size": 151936,
+  "num_experts": 2,
+  "experts":[
+    {
+      "hidden_size": 2048,
+      "intermediate_size": 11008,
+      "hidden_act": "silu"
+    },
+    {
+      "hidden_size": 2048,
+      "intermediate_size": 2048,
+      "hidden_act": "silu"
+    }
+  ],
+  "dof_config": {
+      "follow_left_ee_cartesian_pos": 3,
+      "follow_left_ee_rotation": 3,
+      "follow_left_gripper": 1,
+      "follow_right_ee_cartesian_pos": 3,
+      "follow_right_ee_rotation": 3,
+      "follow_right_gripper": 1,
+      "head_actions": 2,
+      "height": 1,
+      "car_pose": 3
+  },
+  "agent_pos_config": {
+    "follow_left_ee_cartesian_pos": 3,
+    "follow_left_ee_rotation": 3,
+    "follow_left_gripper": 1,
+    "follow_right_ee_cartesian_pos": 3,
+    "follow_right_ee_rotation": 3,
+    "follow_right_gripper": 1,
+    "head_actions": 2,
+    "height": 1,
+    "car_pose": 3
+  },
+  "noise_scheduler": {
+    "beta_alpha": 1.5,
+    "beta_beta": 1.0,
+    "s": 0.999,
+    "num_inference_timesteps": 5
+  },
+  "dim_inputs": [2048,2048],
+  "attention_moe": false,
+  "mlp_moe": true
+}

configuration.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"framework": "pytorch", "task": "vision-understanding", "allow_remote": true}

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token_id": 151643,
+  "pad_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "repetition_penalty": 1.05,
+  "temperature": 0.000001,
+  "transformers_version": "4.49.0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff409a6f18b6ac70e115db3b80e5010c6044fe9afc938fe2a7788fd717eafaaa
+size 8448201904

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "min_pixels": 3136,
+  "max_pixels": 12845056,
+  "patch_size": 14,
+  "temporal_patch_size": 2,
+  "merge_size": 2,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "processor_class": "Qwen2_5_VLProcessor"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8a5df236d417e062783cda976a6c21955fe386a1dd8fb9aa06f29694a6d3a4de
+size 11826664

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff