--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text base_model: - Qwen/Qwen3.5-4B tags: - multimodal - vision-language-model - 3d-spatial-reasoning - geometry - qwen3_5 - vggt - image-text-to-text - cvpr2026 language: - en model-index: - name: SpatialStack-Qwen3.5-4B results: - task: type: visual-question-answering name: 3D Spatial Reasoning dataset: type: vsibench name: VSI-Bench metrics: - type: accuracy name: Average value: 67.5 - task: type: visual-question-answering name: 3D Spatial Reasoning dataset: type: cvbench name: CV-Bench metrics: - type: accuracy name: Average value: 85.5 - type: accuracy name: 3D value: 92.2 ---

# 🌐 SpatialStack-Qwen3.5-4B ### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning **CVPR 2026**

---

## 📋 Overview **SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers. > Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context. ## 🏗️ Architecture

Component	Detail
Base Model	Qwen/Qwen3.5-4B
Geometry Encoder	facebook/VGGT-1B
Encoder Layers	[11, 17, 23]
Fusion Layers	[0, 1, 2]
Fusion Method	DeepStack Language-Add
Geometry Merger	MLP
Precision	bfloat16

## 📊 Benchmark Results | Benchmark | Metric | Score | |:---|:---|:---:| | **VSI-Bench** | Average | **67.5** | | **CV-Bench** | Average | **85.5** | | **CV-Bench** | 3D | **92.2** | > Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437). ## 🚀 Quick Start ### Installation ```bash git clone https://github.com/jzh15/SpatialStack.git cd SpatialStack pip install -e . --no-deps ``` > For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup). ### Single-Image Inference ```bash python scripts/inference/infer.py \ --model-path Journey9ni/SpatialStack-Qwen3.5-4B \ --image assets/sofas.jpg \ --prompt "Describe this scene in a few complete sentences." \ --disable-thinking \ --max-new-tokens 128 ``` ### VSI-Bench Evaluation ```bash MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \ MODEL_IMPL=qwen3_5 \ MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \ OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \ BENCHMARKS="vsibench" \ bash scripts/evaluation/eval.sh ``` ## ⚠️ Limitations - Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone. - Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat. - Not validated for safety-critical use, robotics deployment, or real-world decision making. ## 📝 Citation ```bibtex @article{zhang2026spatialstack, title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning}, author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen}, journal={arXiv preprint arXiv:2603.27437}, year={2026} } ```