Journey9ni
/

SpatialStack-Qwen3.5-4B

@@ -8,40 +8,100 @@ tags:
 - multimodal
 - vision-language-model
 - 3d-spatial-reasoning
 - qwen3_5
 - vggt
 - image-text-to-text
 language:
 - en
 ---
-# SpatialStack-Qwen3.5-4B
-[Project Page](https://spatial-stack.github.io/) | [Paper](https://arxiv.org/abs/2603.27437) | [Code](https://github.com/jzh15/SpatialStack)
-SpatialStack-Qwen3.5-4B is a geometry-augmented vision-language model for 3D spatial reasoning. It builds on `Qwen/Qwen3.5-4B` and adds a parallel `facebook/VGGT-1B` geometry stream with layered geometry-language fusion. In this checkpoint, geometry features from encoder layers `[11, 17, 23]` are projected and injected into decoder layers `[0, 1, 2]` to preserve both fine local structure and higher-level spatial context.
-The SpatialStack project page reports the Qwen3.5-based model at `67.5` average on VSI-Bench and `85.5` average / `92.2` 3D on CV-Bench.
-## Model Details
-- Base model: `Qwen/Qwen3.5-4B`
-- Geometry encoder: `facebook/VGGT-1B`
-- Geometry encoder layers: `[11, 17, 23]`
-- Geometry fusion layers: `[0, 1, 2]`
-- Fusion method: `deepstack_language_add`
-- Geometry merger: `mlp`
-- Precision: `bfloat16`
-## Intended Use
-This model is intended for research use in 3D spatial reasoning, geometry-aware multimodal understanding, and benchmark evaluation on tasks such as distance estimation, size estimation, route planning, and appearance order reasoning.
-It is not validated for safety-critical use, robotics deployment, or real-world decision making without additional task-specific evaluation.
-## Usage
-This checkpoint relies on the SpatialStack codebase and custom model classes. The recommended way to run it is through the repository:
 ```bash
 git clone https://github.com/jzh15/SpatialStack.git
@@ -49,7 +109,9 @@ cd SpatialStack
 pip install -e . --no-deps
 ```
-Single-image inference:
 ```bash
 python scripts/inference/infer.py \
@@ -60,7 +122,7 @@ python scripts/inference/infer.py \
   --max-new-tokens 128
 ```
-VSI-Bench evaluation:
 ```bash
 MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
@@ -71,23 +133,13 @@ BENCHMARKS="vsibench" \
 bash scripts/evaluation/eval.sh
 ```
-## Benchmark Snapshot
-| Benchmark | Metric | Score |
-| --- | --- | ---: |
-| VSI-Bench | Average | 67.5 |
-| CV-Bench | Average | 85.5 |
-| CV-Bench | 3D | 92.2 |
-These numbers come from the SpatialStack project page and paper.
-## Limitations
-- The model depends on a separate geometry encoder (`facebook/VGGT-1B`) in addition to the language-vision backbone.
-- It is optimized for spatial reasoning benchmarks and may not be the best choice for general multimodal chat workloads.
-- Reported benchmark scores depend on the evaluation setup described in the SpatialStack codebase and paper.
-## Citation
 ```bibtex
 @article{zhang2026spatialstack,

 - multimodal
 - vision-language-model
 - 3d-spatial-reasoning
+- geometry
 - qwen3_5
 - vggt
 - image-text-to-text
+- cvpr2026
 language:
 - en
+model-index:
+- name: SpatialStack-Qwen3.5-4B
+  results:
+  - task:
+      type: visual-question-answering
+      name: 3D Spatial Reasoning
+    dataset:
+      type: vsibench
+      name: VSI-Bench
+    metrics:
+    - type: accuracy
+      name: Average
+      value: 67.5
+  - task:
+      type: visual-question-answering
+      name: 3D Spatial Reasoning
+    dataset:
+      type: cvbench
+      name: CV-Bench
+    metrics:
+    - type: accuracy
+      name: Average
+      value: 85.5
+    - type: accuracy
+      name: 3D
+      value: 92.2
 ---
+<div align="center">
+# 🌐 SpatialStack-Qwen3.5-4B
+### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
+**CVPR 2026**
+<a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/📄_Paper-arXiv-b31b1b.svg"></a>&nbsp;
+<a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/🌐_Project-Site-4CAF50.svg"></a>&nbsp;
+<a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/💻_Code-GitHub-181717.svg"></a>&nbsp;
+<a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/📦_Data-HuggingFace-FFD21E.svg"></a>&nbsp;
+<a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/🤗_Model-HuggingFace-FF9D00.svg"></a>&nbsp;
+<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">
+</div>
+---
+<div align="center">
+<img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
+</div>
+## 📋 Overview
+**SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers.
+> Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context.
+## 🏗️ Architecture
+<div align="center">
+<img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
+</div>
+<table>
+<tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
+<tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
+<tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
+<tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
+<tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
+<tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
+<tr><td>Geometry Merger</td><td>MLP</td></tr>
+<tr><td>Precision</td><td>bfloat16</td></tr>
+</table>
+## 📊 Benchmark Results
+| Benchmark | Metric | Score |
+|:---|:---|:---:|
+| **VSI-Bench** | Average | **67.5** |
+| **CV-Bench** | Average | **85.5** |
+| **CV-Bench** | 3D | **92.2** |
+> Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).
+## 🚀 Quick Start
+### Installation
 ```bash
 git clone https://github.com/jzh15/SpatialStack.git
 pip install -e . --no-deps
 ```
+> For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).
+### Single-Image Inference
 ```bash
 python scripts/inference/infer.py \
   --max-new-tokens 128
 ```
+### VSI-Bench Evaluation
 ```bash
 MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
 bash scripts/evaluation/eval.sh
 ```
+## ⚠️ Limitations
+- Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
+- Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
+- Not validated for safety-critical use, robotics deployment, or real-world decision making.
+## 📝 Citation
 ```bibtex
 @article{zhang2026spatialstack,