File size: 5,102 Bytes
66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 777b228 66f7478 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3.5-4B
tags:
- multimodal
- vision-language-model
- 3d-spatial-reasoning
- geometry
- qwen3_5
- vggt
- image-text-to-text
- cvpr2026
language:
- en
model-index:
- name: SpatialStack-Qwen3.5-4B
results:
- task:
type: visual-question-answering
name: 3D Spatial Reasoning
dataset:
type: vsibench
name: VSI-Bench
metrics:
- type: accuracy
name: Average
value: 67.5
- task:
type: visual-question-answering
name: 3D Spatial Reasoning
dataset:
type: cvbench
name: CV-Bench
metrics:
- type: accuracy
name: Average
value: 85.5
- type: accuracy
name: 3D
value: 92.2
---
<div align="center">
# π SpatialStack-Qwen3.5-4B
### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
**CVPR 2026**
<a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/π_Paper-arXiv-b31b1b.svg"></a>
<a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/π_Project-Site-4CAF50.svg"></a>
<a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/π»_Code-GitHub-181717.svg"></a>
<a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/π¦_Data-HuggingFace-FFD21E.svg"></a>
<a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/π€_Model-HuggingFace-FF9D00.svg"></a>
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">
</div>
---
<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
</div>
## π Overview
**SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers.
> Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context.
## ποΈ Architecture
<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
</div>
<table>
<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
<tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
<tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
<tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
<tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
<tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
<tr><td>Geometry Merger</td><td>MLP</td></tr>
<tr><td>Precision</td><td>bfloat16</td></tr>
</table>
## π Benchmark Results
| Benchmark | Metric | Score |
|:---|:---|:---:|
| **VSI-Bench** | Average | **67.5** |
| **CV-Bench** | Average | **85.5** |
| **CV-Bench** | 3D | **92.2** |
> Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).
## π Quick Start
### Installation
```bash
git clone https://github.com/jzh15/SpatialStack.git
cd SpatialStack
pip install -e . --no-deps
```
> For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).
### Single-Image Inference
```bash
python scripts/inference/infer.py \
--model-path Journey9ni/SpatialStack-Qwen3.5-4B \
--image assets/sofas.jpg \
--prompt "Describe this scene in a few complete sentences." \
--disable-thinking \
--max-new-tokens 128
```
### VSI-Bench Evaluation
```bash
MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
MODEL_IMPL=qwen3_5 \
MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \
OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \
BENCHMARKS="vsibench" \
bash scripts/evaluation/eval.sh
```
## β οΈ Limitations
- Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
- Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
- Not validated for safety-critical use, robotics deployment, or real-world decision making.
## π Citation
```bibtex
@article{zhang2026spatialstack,
title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning},
author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen},
journal={arXiv preprint arXiv:2603.27437},
year={2026}
}
```
|