Journey9ni commited on
Commit
777b228
Β·
verified Β·
1 Parent(s): 66f7478

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +86 -34
README.md CHANGED
@@ -8,40 +8,100 @@ tags:
8
  - multimodal
9
  - vision-language-model
10
  - 3d-spatial-reasoning
 
11
  - qwen3_5
12
  - vggt
13
  - image-text-to-text
 
14
  language:
15
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ---
17
 
18
- # SpatialStack-Qwen3.5-4B
19
 
20
- [Project Page](https://spatial-stack.github.io/) | [Paper](https://arxiv.org/abs/2603.27437) | [Code](https://github.com/jzh15/SpatialStack)
21
 
22
- SpatialStack-Qwen3.5-4B is a geometry-augmented vision-language model for 3D spatial reasoning. It builds on `Qwen/Qwen3.5-4B` and adds a parallel `facebook/VGGT-1B` geometry stream with layered geometry-language fusion. In this checkpoint, geometry features from encoder layers `[11, 17, 23]` are projected and injected into decoder layers `[0, 1, 2]` to preserve both fine local structure and higher-level spatial context.
23
 
24
- The SpatialStack project page reports the Qwen3.5-based model at `67.5` average on VSI-Bench and `85.5` average / `92.2` 3D on CV-Bench.
25
 
26
- ## Model Details
 
 
 
 
 
27
 
28
- - Base model: `Qwen/Qwen3.5-4B`
29
- - Geometry encoder: `facebook/VGGT-1B`
30
- - Geometry encoder layers: `[11, 17, 23]`
31
- - Geometry fusion layers: `[0, 1, 2]`
32
- - Fusion method: `deepstack_language_add`
33
- - Geometry merger: `mlp`
34
- - Precision: `bfloat16`
35
 
36
- ## Intended Use
 
 
 
 
 
 
 
 
37
 
38
- This model is intended for research use in 3D spatial reasoning, geometry-aware multimodal understanding, and benchmark evaluation on tasks such as distance estimation, size estimation, route planning, and appearance order reasoning.
39
 
40
- It is not validated for safety-critical use, robotics deployment, or real-world decision making without additional task-specific evaluation.
41
 
42
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- This checkpoint relies on the SpatialStack codebase and custom model classes. The recommended way to run it is through the repository:
 
 
 
 
45
 
46
  ```bash
47
  git clone https://github.com/jzh15/SpatialStack.git
@@ -49,7 +109,9 @@ cd SpatialStack
49
  pip install -e . --no-deps
50
  ```
51
 
52
- Single-image inference:
 
 
53
 
54
  ```bash
55
  python scripts/inference/infer.py \
@@ -60,7 +122,7 @@ python scripts/inference/infer.py \
60
  --max-new-tokens 128
61
  ```
62
 
63
- VSI-Bench evaluation:
64
 
65
  ```bash
66
  MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
@@ -71,23 +133,13 @@ BENCHMARKS="vsibench" \
71
  bash scripts/evaluation/eval.sh
72
  ```
73
 
74
- ## Benchmark Snapshot
75
-
76
- | Benchmark | Metric | Score |
77
- | --- | --- | ---: |
78
- | VSI-Bench | Average | 67.5 |
79
- | CV-Bench | Average | 85.5 |
80
- | CV-Bench | 3D | 92.2 |
81
-
82
- These numbers come from the SpatialStack project page and paper.
83
-
84
- ## Limitations
85
 
86
- - The model depends on a separate geometry encoder (`facebook/VGGT-1B`) in addition to the language-vision backbone.
87
- - It is optimized for spatial reasoning benchmarks and may not be the best choice for general multimodal chat workloads.
88
- - Reported benchmark scores depend on the evaluation setup described in the SpatialStack codebase and paper.
89
 
90
- ## Citation
91
 
92
  ```bibtex
93
  @article{zhang2026spatialstack,
 
8
  - multimodal
9
  - vision-language-model
10
  - 3d-spatial-reasoning
11
+ - geometry
12
  - qwen3_5
13
  - vggt
14
  - image-text-to-text
15
+ - cvpr2026
16
  language:
17
  - en
18
+ model-index:
19
+ - name: SpatialStack-Qwen3.5-4B
20
+ results:
21
+ - task:
22
+ type: visual-question-answering
23
+ name: 3D Spatial Reasoning
24
+ dataset:
25
+ type: vsibench
26
+ name: VSI-Bench
27
+ metrics:
28
+ - type: accuracy
29
+ name: Average
30
+ value: 67.5
31
+ - task:
32
+ type: visual-question-answering
33
+ name: 3D Spatial Reasoning
34
+ dataset:
35
+ type: cvbench
36
+ name: CV-Bench
37
+ metrics:
38
+ - type: accuracy
39
+ name: Average
40
+ value: 85.5
41
+ - type: accuracy
42
+ name: 3D
43
+ value: 92.2
44
  ---
45
 
46
+ <div align="center">
47
 
48
+ # 🌐 SpatialStack-Qwen3.5-4B
49
 
50
+ ### Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
51
 
52
+ **CVPR 2026**
53
 
54
+ <a href="https://arxiv.org/abs/2603.27437"><img src="https://img.shields.io/badge/πŸ“„_Paper-arXiv-b31b1b.svg"></a>&nbsp;
55
+ <a href="https://spatial-stack.github.io/"><img src="https://img.shields.io/badge/🌐_Project-Site-4CAF50.svg"></a>&nbsp;
56
+ <a href="https://github.com/jzh15/SpatialStack"><img src="https://img.shields.io/badge/πŸ’»_Code-GitHub-181717.svg"></a>&nbsp;
57
+ <a href="https://huggingface.co/datasets/Journey9ni/SpatialStackData"><img src="https://img.shields.io/badge/πŸ“¦_Data-HuggingFace-FFD21E.svg"></a>&nbsp;
58
+ <a href="https://huggingface.co/Journey9ni/SpatialStack-Qwen3.5-4B"><img src="https://img.shields.io/badge/πŸ€—_Model-HuggingFace-FF9D00.svg"></a>&nbsp;
59
+ <img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg">
60
 
61
+ </div>
 
 
 
 
 
 
62
 
63
+ ---
64
+
65
+ <div align="center">
66
+ <img src="https://spatial-stack.github.io/static/images/fig1_teaser_v6.png" alt="SpatialStack Teaser" width="85%">
67
+ </div>
68
+
69
+ ## πŸ“‹ Overview
70
+
71
+ **SpatialStack-Qwen3.5-4B** is a geometry-augmented vision-language model designed for **3D spatial reasoning**. It extends [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) with a parallel [VGGT-1B](https://huggingface.co/facebook/VGGT-1B) geometry stream, using a novel **layered geometry-language fusion** mechanism that progressively aligns multi-level geometric and language features across model layers.
72
 
73
+ > Geometry features from encoder layers **[11, 17, 23]** are projected and injected into decoder layers **[0, 1, 2]**, preserving both fine local structure and higher-level spatial context.
74
 
75
+ ## πŸ—οΈ Architecture
76
 
77
+ <div align="center">
78
+ <img src="https://spatial-stack.github.io/static/images/fig2_arch_v1.png" alt="SpatialStack Architecture" width="85%">
79
+ </div>
80
+
81
+ <table>
82
+ <tr><td><b>Component</b></td><td><b>Detail</b></td></tr>
83
+ <tr><td>Base Model</td><td><a href="https://huggingface.co/Qwen/Qwen3.5-4B">Qwen/Qwen3.5-4B</a></td></tr>
84
+ <tr><td>Geometry Encoder</td><td><a href="https://huggingface.co/facebook/VGGT-1B">facebook/VGGT-1B</a></td></tr>
85
+ <tr><td>Encoder Layers</td><td>[11, 17, 23]</td></tr>
86
+ <tr><td>Fusion Layers</td><td>[0, 1, 2]</td></tr>
87
+ <tr><td>Fusion Method</td><td>DeepStack Language-Add</td></tr>
88
+ <tr><td>Geometry Merger</td><td>MLP</td></tr>
89
+ <tr><td>Precision</td><td>bfloat16</td></tr>
90
+ </table>
91
+
92
+ ## πŸ“Š Benchmark Results
93
+
94
+ | Benchmark | Metric | Score |
95
+ |:---|:---|:---:|
96
+ | **VSI-Bench** | Average | **67.5** |
97
+ | **CV-Bench** | Average | **85.5** |
98
+ | **CV-Bench** | 3D | **92.2** |
99
 
100
+ > Results from the [SpatialStack project page](https://spatial-stack.github.io/) and [paper](https://arxiv.org/abs/2603.27437).
101
+
102
+ ## πŸš€ Quick Start
103
+
104
+ ### Installation
105
 
106
  ```bash
107
  git clone https://github.com/jzh15/SpatialStack.git
 
109
  pip install -e . --no-deps
110
  ```
111
 
112
+ > For full environment setup (PyTorch, flash_attn, Qwen3.5 dependencies), see the [repo README](https://github.com/jzh15/SpatialStack#setup).
113
+
114
+ ### Single-Image Inference
115
 
116
  ```bash
117
  python scripts/inference/infer.py \
 
122
  --max-new-tokens 128
123
  ```
124
 
125
+ ### VSI-Bench Evaluation
126
 
127
  ```bash
128
  MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
 
133
  bash scripts/evaluation/eval.sh
134
  ```
135
 
136
+ ## ⚠️ Limitations
 
 
 
 
 
 
 
 
 
 
137
 
138
+ - Requires a separate geometry encoder ([VGGT-1B](https://huggingface.co/facebook/VGGT-1B)) alongside the vision-language backbone.
139
+ - Optimized for spatial reasoning benchmarks; not intended for general-purpose multimodal chat.
140
+ - Not validated for safety-critical use, robotics deployment, or real-world decision making.
141
 
142
+ ## πŸ“ Citation
143
 
144
  ```bibtex
145
  @article{zhang2026spatialstack,