Add pipeline tag and improve model card
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,17 +1,52 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
datasets:
|
| 4 |
-
- nyu-visionx/VSI-Bench
|
| 5 |
-
- lmms-lab-si/EASI-Leaderboard-Requests
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen3-VL-8B-Instruct
|
| 8 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# GeoThinker
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen3-VL-8B-Instruct
|
| 4 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 5 |
+
datasets:
|
| 6 |
+
- nyu-visionx/VSI-Bench
|
| 7 |
+
- lmms-lab-si/EASI-Leaderboard-Requests
|
| 8 |
+
license: mit
|
| 9 |
+
pipeline_tag: video-text-to-text
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# GeoThinker: Active Geometry Integration for Spatial Reasoning
|
| 13 |
+
|
| 14 |
+
GeoThinker is a framework that shifts the paradigm of spatial reasoning in Multimodal Large Language Models (MLLMs) from passive fusion to active perception. Instead of indiscriminate feature mixing, GeoThinker enables models to selectively retrieve geometric evidence from 3D encoders conditioned on internal reasoning demands.
|
| 15 |
+
|
| 16 |
+
- **Paper:** [Thinking with Geometry: Active Geometry Integration for Spatial Reasoning](https://huggingface.co/papers/2602.06037)
|
| 17 |
+
- **Project Page:** [li-hao-yuan.github.io/GeoThinker/](https://li-hao-yuan.github.io/GeoThinker/)
|
| 18 |
+
- **Repository:** [github.com/Li-Hao-yuan/GeoThinker](https://github.com/Li-Hao-yuan/GeoThinker)
|
| 19 |
+
|
| 20 |
+
## Architecture Overview
|
| 21 |
+
|
| 22 |
+
GeoThinker empowers MLLMs to understand the 3D world through:
|
| 23 |
+
- **Dual-Encoder Processing:** Combines a 2D vision encoder for high-level semantic features with a 3D visual geometry encoder (VGGT) for capturing fine-grained spatial structures.
|
| 24 |
+
- **Spatial-Grounded Fusion (SGF):** Uses frame-strict cross-attention to ensure visual tokens query geometric cues from the corresponding frame, maintaining spatial consistency.
|
| 25 |
+
- **Importance Gating:** Calibrates per-frame attention toward task-relevant structures (like object boundaries) while filtering redundant noise.
|
| 26 |
+
|
| 27 |
+
## Performance Highlights
|
| 28 |
+
|
| 29 |
+
GeoThinker sets new benchmarks in spatial intelligence:
|
| 30 |
+
- **VSI-Bench:** Achieves a peak score of **72.6%**.
|
| 31 |
+
- **EASI-leaderboard:** Achieves an average score of **55.0%**, ranking 6th.
|
| 32 |
+
- Demonstrates robust generalization across complex scenarios, including embodied referring and autonomous driving.
|
| 33 |
+
|
| 34 |
+
## Model Variants
|
| 35 |
+
|
| 36 |
+
| Backbone | Training Regime | Access |
|
| 37 |
+
|---|---|---|
|
| 38 |
+
| Qwen2.5-VL-7B | Vanilla | [GeoThinker-Qwen2.5VL-7B-Vanilla](https://huggingface.co/lihy285/GeoThinker/tree/main/QeoThinker-VGGT-Qwen25VL-7B-Vanilla) |
|
| 39 |
+
| Qwen2.5-VL-7B | Scaled | [GeoThinker-Qwen2.5VL-7B-Scaled](https://huggingface.co/lihy285/GeoThinker/tree/main/QeoThinker-VGGT-Qwen25VL-7B-Scaled) |
|
| 40 |
+
| Qwen3-VL-8B | Scaled | [GeoThinker-Qwen3VL-8B-Scaled](https://huggingface.co/lihy285/GeoThinker/tree/main/QeoThinker-VGGT-Qwen3VL-8B-Scaled) |
|
| 41 |
|
| 42 |
+
## Citation
|
| 43 |
|
| 44 |
+
If you find this work useful, please consider citing:
|
| 45 |
|
| 46 |
+
```bibtex
|
| 47 |
+
@article{li2026thinking,
|
| 48 |
+
title={Thinking with Geometry: Active Geometry Integration for Spatial Reasoning},
|
| 49 |
+
author={Haoyuan, Li and Qihang, Cao and Tao, Tang and Kun, Xiang and Zihan, Guo and Jianhua, Han and JiaWang, Bian and Hang, Xu and Xiaodan, Liang},
|
| 50 |
+
year={2026}
|
| 51 |
+
}
|
| 52 |
+
```
|