| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| base_model: |
| - Qwen/Qwen3.5-4B |
| tags: |
| - multimodal |
| - vision-language-model |
| - 3d-spatial-reasoning |
| - geometry |
| - qwen3_5 |
| - vggt |
| - image-text-to-text |
| - cvpr2026 |
| language: |
| - en |
| model-index: |
| - name: SpatialStack-Qwen3.5-4B |
| results: |
| - task: |
| type: visual-question-answering |
| name: 3D Spatial Reasoning |
| dataset: |
| type: vsibench |
| name: VSI-Bench |
| metrics: |
| - type: accuracy |
| name: Average |
| value: 67.5 |
| - task: |
| type: visual-question-answering |
| name: 3D Spatial Reasoning |
| dataset: |
| type: cvbench |
| name: CV-Bench |
| metrics: |
| - type: accuracy |
| name: Average |
| value: 85.5 |
| - type: accuracy |
| name: 3D |
| value: 92.2 |
| --- |
| |
| <div align="center"> |
|
|
| # π SpatialStack-Qwen3.5-4B |
|
|
| ### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning |
|
|
| **CVPR 2026** |
|
|
| <a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/π_Paper-arXiv-b31b1b.svg"></a> |
| <a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/π_Project-Site-4CAF50.svg"></a> |
| <a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/π»_Code-GitHub-181717.svg"></a> |
| <a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/π¦_Data-HuggingFace-FFD21E.svg"></a> |
| <a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/π€_Model-HuggingFace-FF9D00.svg"></a> |
| <img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg"> |
|
|
| </div> |
|
|
| --- |
|
|
| <div align="center"> |
| <img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%"> |
| </div> |
|
|
| ## π Overview |
|
|
| **SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers. |
|
|
| > Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context. |
|
|
| ## ποΈ Architecture |
|
|
| <div align="center"> |
| <img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%"> |
| </div> |
|
|
| <table> |
| <tr><td><b>Component</b></td><td><b>Detail</b></td></tr> |
| <tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr> |
| <tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr> |
| <tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr> |
| <tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr> |
| <tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr> |
| <tr><td>Geometry Merger</td><td>MLP</td></tr> |
| <tr><td>Precision</td><td>bfloat16</td></tr> |
| </table> |
|
|
| ## π Benchmark Results |
|
|
| | Benchmark | Metric | Score | |
| |:---|:---|:---:| |
| | **VSI-Bench** | Average | **67.5** | |
| | **CV-Bench** | Average | **85.5** | |
| | **CV-Bench** | 3D | **92.2** | |
|
|
| > Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437). |
|
|
| ## π Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/jzh15/SpatialStack.git |
| cd SpatialStack |
| pip install -e . --no-deps |
| ``` |
|
|
| > For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup). |
| |
| ### Single-Image Inference |
| |
| ```bash |
| python scripts/inference/infer.py \ |
| --model-path Journey9ni/SpatialStack-Qwen3.5-4B \ |
| --image assets/sofas.jpg \ |
| --prompt "Describe this scene in a few complete sentences." \ |
| --disable-thinking \ |
| --max-new-tokens 128 |
| ``` |
| |
| ### VSI-Bench Evaluation |
| |
| ```bash |
| MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \ |
| MODEL_IMPL=qwen3_5 \ |
| MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \ |
| OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \ |
| BENCHMARKS="vsibench" \ |
| bash scripts/evaluation/eval.sh |
| ``` |
| |
| ## β οΈ Limitations |
| |
| - Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone. |
| - Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat. |
| - Not validated for safety-critical use, robotics deployment, or real-world decision making. |
| |
| ## π Citation |
| |
| ```bibtex |
| @article{zhang2026spatialstack, |
| title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning}, |
| author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen}, |
| journal={arXiv preprint arXiv:2603.27437}, |
| year={2026} |
| } |
| ``` |
| |