metadata
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- geometry
- math
- vision-language
- reasoning
GeoFocus-3B
GeoFocus is a framework designed to enhance multimodal geometry reasoning in Large Multimodal Models (LMMs). It addresses the challenges of geometry problem-solving by focusing on both global shape recognition and intricate local relationships through two core modules:
- Critical Local Perceptor: Automatically identifies and emphasizes critical local structures (e.g., angles, parallel lines, comparative distances) using theory-based perception templates.
- VertexLang: A compact topology formal language that encodes global figures through vertex coordinates and connectivity relations, improving efficiency and accuracy compared to traditional code-based encodings.
Model Details
- Architecture: Based on Qwen2.5-VL
- Paper: GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving
- Repository: GitHub - dle666/GeoFocus
Evaluation Results
GeoFocus demonstrates significant improvements over specialized models across major geometry benchmarks:
| Model Name | Geo3K | GeoQA | Formalgeo7k |
|---|---|---|---|
| GeoFocus-3B | 50.4 | 64.3 | 55.4 |
| GeoFocus-7B | 55.3 | 71.9 | 63.5 |
Environment and Installation
To use this model, ensure you have the following requirements installed:
- Python 3.9+
- transformers>=4.51.0
- flash-attn>=2.4.3
- vllm>=0.8.3
pip install transformers>=4.51.0 flash-attn>=2.4.3 vllm>=0.8.3
Citation
@article{chen2026geofocus,
title={GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving},
author={Jiuhai Chen and Jianwei Yang and Haiping Wu and Dianqi Li and Jianfeng Gao and Tianyi Zhou and Bin Xiao},
journal={arXiv preprint arXiv:2602.08524},
year={2026}
}