File size: 5,102 Bytes
66f7478
 
 
 
 
 
 
 
 
 
777b228
66f7478
 
 
777b228
66f7478
 
777b228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66f7478
 
777b228
66f7478
777b228
66f7478
777b228
66f7478
777b228
66f7478
777b228
 
 
 
 
 
66f7478
777b228
66f7478
777b228
 
 
 
 
 
 
 
 
66f7478
777b228
66f7478
777b228
66f7478
777b228
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66f7478
777b228
 
 
 
 
66f7478
 
 
 
 
 
 
777b228
 
 
66f7478
 
 
 
 
 
 
 
 
 
777b228
66f7478
 
 
 
 
 
 
 
 
 
777b228
66f7478
777b228
 
 
66f7478
777b228
66f7478
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3.5-4B
tags:
- multimodal
- vision-language-model
- 3d-spatial-reasoning
- geometry
- qwen3_5
- vggt
- image-text-to-text
- cvpr2026
language:
- en
model-index:
- name: SpatialStack-Qwen3.5-4B
  results:
  - task:
      type: visual-question-answering
      name: 3D Spatial Reasoning
    dataset:
      type: vsibench
      name: VSI-Bench
    metrics:
    - type: accuracy
      name: Average
      value: 67.5
  - task:
      type: visual-question-answering
      name: 3D Spatial Reasoning
    dataset:
      type: cvbench
      name: CV-Bench
    metrics:
    - type: accuracy
      name: Average
      value: 85.5
    - type: accuracy
      name: 3D
      value: 92.2
---

<div align="center">

# 🌐 SpatialStack-Qwen3.5-4B

### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

**CVPR 2026**

<a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/πŸ“„_Paper-arXiv-b31b1b.svg"></a>&nbsp;
<a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/🌐_Project-Site-4CAF50.svg"></a>&nbsp;
<a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/πŸ’»_Code-GitHub-181717.svg"></a>&nbsp;
<a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/πŸ“¦_Data-HuggingFace-FFD21E.svg"></a>&nbsp;
<a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/πŸ€—_Model-HuggingFace-FF9D00.svg"></a>&nbsp;
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">

</div>

---

<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
</div>

## πŸ“‹ Overview

**SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers.

> Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context.

## πŸ—οΈ Architecture

<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
</div>

<table>
<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
<tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
<tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
<tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
<tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
<tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
<tr><td>Geometry Merger</td><td>MLP</td></tr>
<tr><td>Precision</td><td>bfloat16</td></tr>
</table>

## πŸ“Š Benchmark Results

| Benchmark | Metric | Score |
|:---|:---|:---:|
| **VSI-Bench** | Average | **67.5** |
| **CV-Bench** | Average | **85.5** |
| **CV-Bench** | 3D | **92.2** |

> Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).

## πŸš€ Quick Start

### Installation

```bash
git clone https://github.com/jzh15/SpatialStack.git
cd SpatialStack
pip install -e . --no-deps
```

> For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).

### Single-Image Inference

```bash
python scripts/inference/infer.py \
  --model-path Journey9ni/SpatialStack-Qwen3.5-4B \
  --image assets/sofas.jpg \
  --prompt "Describe this scene in a few complete sentences." \
  --disable-thinking \
  --max-new-tokens 128
```

### VSI-Bench Evaluation

```bash
MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
MODEL_IMPL=qwen3_5 \
MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \
OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \
BENCHMARKS="vsibench" \
bash scripts/evaluation/eval.sh
```

## ⚠️ Limitations

- Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
- Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
- Not validated for safety-critical use, robotics deployment, or real-world decision making.

## πŸ“ Citation

```bibtex
@article{zhang2026spatialstack,
  title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning},
  author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen},
  journal={arXiv preprint arXiv:2603.27437},
  year={2026}
}
```