qualiaadmin commited on
Commit
d8fb128
·
verified ·
1 Parent(s): 5cd0c1a

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/logo.png filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WALL-OSS
2
+
3
+ <div align="left">
4
+
5
+ <p align="center">
6
+ <img src="assets/logo.png" width="600"/>
7
+ <p>
8
+
9
+ <div align="center">
10
+
11
+ [![Paper](https://img.shields.io/badge/📄%20Paper-PDF-EA1B22?style=for-the-badge&logo=adobeacrobatreader&logoColor=fff)](https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf)
12
+ &nbsp;&nbsp;
13
+ [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-x--square--robot-FFB000?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/x-square-robot)
14
+ &nbsp;&nbsp;
15
+ [![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=fff)](https://github.com/X-Square-Robot/wall-x)
16
+ &nbsp;&nbsp;
17
+ [![Project Page](https://img.shields.io/badge/Project-1E90FF?style=for-the-badge&logo=google-chrome&logoColor=fff)](https://x2robot.com/en/research/68bc2cde8497d7f238dde690)
18
+
19
+ </div>
20
+
21
+ </div>
22
+
23
+ ## <a href="https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf" target="_blank"><strong>WALL-OSS: Igniting VLMs toward the Embodied Space</strong></a>
24
+
25
+ We introduce **WALL-OSS**, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision--language understanding, (2) strong language--action association, and (3) robust manipulation capability.
26
+ Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT—seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework.
27
+ Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.
28
+
29
+ ## 🎬 Video Demos
30
+
31
+ <div align="center">
32
+ <video width="80%" controls>
33
+ <source src="https://x2robot.com/api/videos/file/wall-oss_top_720p-1.mp4" type="video/mp4">
34
+ Your browser does not support the video tag.
35
+ </video>
36
+ <p><strong>WALL-OSS in Action: Demonstrating advanced manipulation capabilities and embodied AI performance</strong></p>
37
+ </div>
38
+
39
+
40
+
41
+ ## 🚀 Quick Start
42
+
43
+ ### Installation
44
+
45
+ ```bash
46
+ # Create conda environment
47
+ conda create --name wallx python=3.10
48
+ conda activate wallx
49
+
50
+ # Install base requirements
51
+ pip install torch torchvision transformers
52
+ pip install huggingface_hub
53
+
54
+ # Install Wall-X from GitHub
55
+ git clone https://github.com/X-Square-Robot/wall-x.git
56
+ cd wall-x
57
+ pip install -e .
58
+ ```
59
+
60
+ ### Basic Usage
61
+
62
+ ```python
63
+ import torch
64
+ from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction
65
+
66
+ # Load the model
67
+ model_path = "X-Square-Robot/wall-oss-flow" # or your local path
68
+ model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
69
+ model.eval()
70
+
71
+ # Configuration
72
+ device = "cuda" if torch.cuda.is_available() else "cpu"
73
+ model = model.to(device).bfloat16()
74
+
75
+ # Your inference code here...
76
+ ```
77
+
78
+ ## 🎯 Supervised Fine-Tuning (SFT)
79
+
80
+ For training Wall-X on your robotics datasets, please refer to our comprehensive training guide:
81
+
82
+ **📖 [Training Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/workspace/README.md)**
83
+
84
+ The training process includes:
85
+ - **Dataset Preparation**: How to prepare your robotics datasets in LeRobot format
86
+ - **Configuration Setup**: Detailed configuration for GPU setup, model paths, and robot DOF settings
87
+ - **Training Scripts**: Ready-to-use training scripts with proper hyperparameters
88
+
89
+ ### Quick Training Start
90
+
91
+ ```bash
92
+ # Run training (see workspace/README.md for detailed configuration)
93
+ bash ./workspace/lerobot_example/run.sh
94
+ ```
95
+
96
+ ## 🔮 Inference
97
+
98
+ For detailed inference examples and model evaluation:
99
+
100
+ **📖 [Inference Documentation](https://github.com/X-Square-Robot/wall-x/blob/main/scripts/)**
101
+
102
+ ### Basic Inference Example
103
+
104
+ ```python
105
+ import torch
106
+ from wall_x.model.qwen2_5_based.modeling_qwen2_5_vl_act import Qwen2_5_VLMoEForAction
107
+
108
+ # Load model
109
+ model_path = "X-Square-Robot/wall-x"
110
+ model = Qwen2_5_VLMoEForAction.from_pretrained(model_path)
111
+ model.eval()
112
+
113
+ # Setup
114
+ batch_size = 1
115
+ seq_length = 50
116
+ device = "cuda" if torch.cuda.is_available() else "cpu"
117
+ model = model.to(device).bfloat16()
118
+
119
+ # Prepare inputs (example with synthetic data)
120
+ torch.manual_seed(0)
121
+ input_ids = torch.randint(0, len(model.processor.tokenizer), (batch_size, seq_length), dtype=torch.long)
122
+ attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
123
+ moe_token_types = torch.zeros((batch_size, seq_length), dtype=torch.long)
124
+ position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0).expand(batch_size, -1)
125
+
126
+ # Robotics-specific inputs
127
+ proprioception = torch.randn((batch_size, 1, 20), dtype=torch.float32) # Joint states
128
+ agent_pos_mask = torch.ones((batch_size, 1, 20), dtype=torch.float32)
129
+ dof_mask = torch.ones((batch_size, 32, 20), dtype=torch.float32) # DOF mask
130
+ dataset_names = ["x2_normal"]
131
+
132
+ # Move to device
133
+ inputs = {
134
+ "input_ids": input_ids.to(device),
135
+ "attention_mask": attention_mask.to(device),
136
+ "moe_token_types": moe_token_types.to(device),
137
+ "position_ids": position_ids.to(device),
138
+ "proprioception": proprioception.to(device).bfloat16(),
139
+ "agent_pos_mask": agent_pos_mask.to(device).bfloat16(),
140
+ "dof_mask": dof_mask.to(device).bfloat16(),
141
+ "dataset_names": dataset_names,
142
+ "mode": "validate"
143
+ }
144
+
145
+ # Run inference
146
+ with torch.no_grad():
147
+ outputs = model(**inputs)
148
+ print(f"Output logits shape: {outputs.logits.shape}")
149
+ ```
150
+
151
+ ### Advanced Inference Scripts
152
+
153
+ For production-ready inference and evaluation scripts:
154
+
155
+ ```bash
156
+ # Basic inference test
157
+ python ./scripts/fake_inference.py
158
+
159
+ # Generate open-loop comparison plots
160
+ python ./scripts/draw_openloop_plot.py
161
+ ```
162
+
163
+ **📁 [View all inference scripts](https://github.com/X-Square-Robot/wall-x/tree/main/scripts)**
164
+
165
+ ## 📚 Complete Documentation
166
+
167
+ For comprehensive setup, training, and inference instructions:
168
+
169
+ ### 🚀 **[Visit our GitHub Repository](https://github.com/X-Square-Robot/wall-x)**
170
+
171
+ The repository contains:
172
+ - **Detailed Installation Guide**: Complete environment setup with all dependencies
173
+ - **Training Tutorials**: Step-by-step SFT process with LeRobot datasets
174
+ - **Inference Examples**: Multiple inference scripts and evaluation tools
175
+ - **Configuration Templates**: Ready-to-use configs for different robot setups
176
+ - **Troubleshooting Guide**: Common issues and solutions
177
+
178
+ ## 📄 Cite Us
179
+
180
+ If you find WALL-OSS models useful, please cite:
181
+
182
+ ```bibtex
183
+ @misc{walloss_paper_2025,
184
+ title = {WALL-OSS: Igniting VLMs toward the Embodied Space},
185
+ author = {X Square Robot},
186
+ year = {2025},
187
+ howpublished = {\url{https://x2robot.cn-wlcb.ufileos.com/wall_oss.pdf}},
188
+ note = {White paper}
189
+ }
190
+ ```
assets/logo.png ADDED

Git LFS Details

  • SHA256: 721ada7f102cac8b9be8a006998e8248ee62075111bd8290896b7b4a9e12e55a
  • Pointer size: 131 Bytes
  • Size of remote file: 202 kB
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "vision_start_token_id": 151652,
9
+ "vision_end_token_id": 151653,
10
+ "vision_token_id": 151654,
11
+ "image_token_id": 151655,
12
+ "video_token_id": 151656,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 2048,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 11008,
17
+ "max_position_embeddings": 128000,
18
+ "max_window_layers": 70,
19
+ "model_type": "qwen2_5_vl",
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 36,
22
+ "num_key_value_heads": 2,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_theta": 1000000.0,
25
+ "sliding_window": 32768,
26
+ "tie_word_embeddings": true,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.41.2",
29
+ "_attn_implementation": "flash_attention_2",
30
+ "use_cache": true,
31
+ "use_sliding_window": false,
32
+ "vision_config": {
33
+ "depth": 32,
34
+ "hidden_act": "silu",
35
+ "hidden_size": 1280,
36
+ "intermediate_size": 3420,
37
+ "num_heads": 16,
38
+ "in_chans": 3,
39
+ "out_hidden_size": 2048,
40
+ "patch_size": 14,
41
+ "spatial_merge_size": 2,
42
+ "spatial_patch_size": 14,
43
+ "window_size": 112,
44
+ "fullatt_block_indexes": [
45
+ 7,
46
+ 15,
47
+ 23,
48
+ 31
49
+ ],
50
+ "tokens_per_second": 2,
51
+ "temporal_patch_size": 2
52
+ },
53
+ "rope_scaling": {
54
+ "type": "mrope",
55
+ "mrope_section": [
56
+ 16,
57
+ 24,
58
+ 24
59
+ ]
60
+ },
61
+ "vocab_size": 151936,
62
+ "num_experts": 2,
63
+ "experts":[
64
+ {
65
+ "hidden_size": 2048,
66
+ "intermediate_size": 11008,
67
+ "hidden_act": "silu"
68
+ },
69
+ {
70
+ "hidden_size": 2048,
71
+ "intermediate_size": 2048,
72
+ "hidden_act": "silu"
73
+ }
74
+ ],
75
+ "dof_config": {
76
+ "follow_left_ee_cartesian_pos": 3,
77
+ "follow_left_ee_rotation": 3,
78
+ "follow_left_gripper": 1,
79
+ "follow_right_ee_cartesian_pos": 3,
80
+ "follow_right_ee_rotation": 3,
81
+ "follow_right_gripper": 1,
82
+ "head_actions": 2,
83
+ "height": 1,
84
+ "car_pose": 3
85
+ },
86
+ "agent_pos_config": {
87
+ "follow_left_ee_cartesian_pos": 3,
88
+ "follow_left_ee_rotation": 3,
89
+ "follow_left_gripper": 1,
90
+ "follow_right_ee_cartesian_pos": 3,
91
+ "follow_right_ee_rotation": 3,
92
+ "follow_right_gripper": 1,
93
+ "head_actions": 2,
94
+ "height": 1,
95
+ "car_pose": 3
96
+ },
97
+ "noise_scheduler": {
98
+ "beta_alpha": 1.5,
99
+ "beta_beta": 1.0,
100
+ "s": 0.999,
101
+ "num_inference_timesteps": 5
102
+ },
103
+ "dim_inputs": [2048,2048],
104
+ "attention_moe": false,
105
+ "mlp_moe": true
106
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework": "pytorch", "task": "vision-understanding", "allow_remote": true}
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.000001,
11
+ "transformers_version": "4.49.0"
12
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff409a6f18b6ac70e115db3b80e5010c6044fe9afc938fe2a7788fd717eafaaa
3
+ size 8448201904
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor",
18
+ "processor_class": "Qwen2_5_VLProcessor"
19
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a5df236d417e062783cda976a6c21955fe386a1dd8fb9aa06f29694a6d3a4de
3
+ size 11826664
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff