Upload README.md with huggingface_hub

777b228 verified 3 days ago

5.1 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model:
	- Qwen/Qwen3.5-4B
	tags:
	- multimodal
	- vision-language-model
	- 3d-spatial-reasoning
	- geometry
	- qwen3_5
	- vggt
	- image-text-to-text
	- cvpr2026
	language:
	- en
	model-index:
	- name: SpatialStack-Qwen3.5-4B
	results:
	- task:
	type: visual-question-answering
	name: 3D Spatial Reasoning
	dataset:
	type: vsibench
	name: VSI-Bench
	metrics:
	- type: accuracy
	name: Average
	value: 67.5
	- task:
	type: visual-question-answering
	name: 3D Spatial Reasoning
	dataset:
	type: cvbench
	name: CV-Bench
	metrics:
	- type: accuracy
	name: Average
	value: 85.5
	- type: accuracy
	name: 3D
	value: 92.2
	---

	<div align="center">

	# 🌐 SpatialStack-Qwen3.5-4B

	### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

	CVPR 2026

	<a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/📄_Paper-arXiv-b31b1b.svg"></a>
	<a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/🌐_Project-Site-4CAF50.svg"></a>
	<a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/💻_Code-GitHub-181717.svg"></a>
	<a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/📦_Data-HuggingFace-FFD21E.svg"></a>
	<a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/🤗_Model-HuggingFace-FF9D00.svg"></a>
	<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">

	</div>

	---

	<div align="center">
	<img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
	</div>

	## 📋 Overview

	SpatialStack-Qwen3.5-4B is a geometry-augmented vision-language model designed for 3D spatial reasoning. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel layered geometry-language fusion mechanism that progressively aligns multi-level geometric and language features across model layers.

	> Geometry features from encoder layers [11, 17, 23] are projected and injected into decoder layers [0, 1, 2], preserving both fine local structure and higher-level spatial context.

	## 🏗️ Architecture

	<div align="center">
	<img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
	</div>

	<table>
	<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
	<tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
	<tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
	<tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
	<tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
	<tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
	<tr><td>Geometry Merger</td><td>MLP</td></tr>
	<tr><td>Precision</td><td>bfloat16</td></tr>
	</table>

	## 📊 Benchmark Results

	\| Benchmark \| Metric \| Score \|
	\|:---\|:---\|:---:\|
	\| VSI-Bench \| Average \| 67.5 \|
	\| CV-Bench \| Average \| 85.5 \|
	\| CV-Bench \| 3D \| 92.2 \|

	> Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).

	## 🚀 Quick Start

	### Installation

	```bash
	git clone https://github.com/jzh15/SpatialStack.git
	cd SpatialStack
	pip install -e . --no-deps
	```

	> For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).

	### Single-Image Inference

	```bash
	python scripts/inference/infer.py \
	--model-path Journey9ni/SpatialStack-Qwen3.5-4B \
	--image assets/sofas.jpg \
	--prompt "Describe this scene in a few complete sentences." \
	--disable-thinking \
	--max-new-tokens 128
	```

	### VSI-Bench Evaluation

	```bash
	MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
	MODEL_IMPL=qwen3_5 \
	MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,use_flash_attention_2=true,max_num_frames=32,max_length=12800,geometry_encoder_path=facebook/VGGT-1B,disable_thinking=true" \
	OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \
	BENCHMARKS="vsibench" \
	bash scripts/evaluation/eval.sh
	```

	## ⚠️ Limitations

	- Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
	- Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
	- Not validated for safety-critical use, robotics deployment, or real-world decision making.

	## 📝 Citation

	```bibtex
	@article{zhang2026spatialstack,
	title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning},
	author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen},
	journal={arXiv preprint arXiv:2603.27437},
	year={2026}
	}
	```