Abstract
TerraScope is a unified vision-language model that enables pixel-grounded geospatial reasoning through modality-flexible and multi-temporal capabilities, evaluated on a new benchmark with detailed visual reasoning outputs.
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
Community
CVPR2026-Pixel-Grounded reasoning for Earth Observation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Perception-Aware Multimodal Spatial Reasoning from Monocular Images (2026)
- VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing (2026)
- WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation (2026)
- GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery (2026)
- OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents (2026)
- EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery (2026)
- GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
TerraScope not only improves VLM training with the Terra-CoT dataset, but also endows the model with pixel-grounded reasoning capabilities by aligning reasoning processes with pixel-level segmentation, thereby enabling multi-temporal change analysis and multimodal fusion.
Suggestions:
1、Although TerraScope achieves pixel-level grounding, its boundary precision could be further improved by incorporating boundary-aware losses or high-resolution feature fusion to reduce mask ambiguity.
2、While TerraScope supports adaptive multimodal fusion, explicit cross-modal alignment (e.g., contrastive learning or shared latent space) could reduce discrepancies between optical and SAR representations.
3、Incorporating uncertainty estimation (e.g., probabilistic masks or confidence scores) could improve reliability in complex geospatial scenarios.
TerraScope represents a significant step toward pixel-grounded geospatial reasoning, but there remains room for improvement in boundary precision, cross-modal alignment, temporal modeling, and fine-grained reasoning.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper