Journey9ni's picture
Upload README.md with huggingface_hub
777b228 verified
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- Qwen/Qwen3.5-4B
tags:
- multimodal
- vision-language-model
- 3d-spatial-reasoning
- geometry
- qwen3_5
- vggt
- image-text-to-text
- cvpr2026
language:
- en
model-index:
- name: SpatialStack-Qwen3.5-4B
results:
- task:
type: visual-question-answering
name: 3D Spatial Reasoning
dataset:
type: vsibench
name: VSI-Bench
metrics:
- type: accuracy
name: Average
value: 67.5
- task:
type: visual-question-answering
name: 3D Spatial Reasoning
dataset:
type: cvbench
name: CV-Bench
metrics:
- type: accuracy
name: Average
value: 85.5
- type: accuracy
name: 3D
value: 92.2
---
<div align="center">
# 🌐 SpatialStack-Qwen3.5-4B
### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
**CVPR 2026**
<a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/πŸ“„_Paper-arXiv-b31b1b.svg"></a>&nbsp;
<a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/🌐_Project-Site-4CAF50.svg"></a>&nbsp;
<a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/πŸ’»_Code-GitHub-181717.svg"></a>&nbsp;
<a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/πŸ“¦_Data-HuggingFace-FFD21E.svg"></a>&nbsp;
<a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/πŸ€—_Model-HuggingFace-FF9D00.svg"></a>&nbsp;
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">
</div>
---
<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
</div>
## πŸ“‹ Overview
**SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers.
> Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context.
## πŸ—οΈ Architecture
<div align="center">
<img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
</div>
<table>
<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
<tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
<tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
<tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
<tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
<tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
<tr><td>Geometry Merger</td><td>MLP</td></tr>
<tr><td>Precision</td><td>bfloat16</td></tr>
</table>
## πŸ“Š Benchmark Results
| Benchmark | Metric | Score |
|:---|:---|:---:|
| **VSI-Bench** | Average | **67.5** |
| **CV-Bench** | Average | **85.5** |
| **CV-Bench** | 3D | **92.2** |
> Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).
## πŸš€ Quick Start
### Installation
```bash
git clone https://github.com/jzh15/SpatialStack.git
cd SpatialStack
pip install -e . --no-deps
```
> For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).
### Single-Image Inference
```bash
python scripts/inference/infer.py \
--model-path Journey9ni/SpatialStack-Qwen3.5-4B \
--image assets/sofas.jpg \
--prompt "Describe this scene in a few complete sentences." \
--disable-thinking \
--max-new-tokens 128
```
### VSI-Bench Evaluation
```bash
MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
MODEL_IMPL=qwen3_5 \
MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \
OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \
BENCHMARKS="vsibench" \
bash scripts/evaluation/eval.sh
```
## ⚠️ Limitations
- Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
- Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
- Not validated for safety-critical use, robotics deployment, or real-world decision making.
## πŸ“ Citation
```bibtex
@article{zhang2026spatialstack,
title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning},
author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen},
journal={arXiv preprint arXiv:2603.27437},
year={2026}
}
```