arxiv:2603.19039

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Published on Mar 19

· Submitted by

Yan Shu on Mar 23

#3 Paper of the day

Upvote

Authors:

Zhitong Xiong ,

Abstract

TerraScope is a unified vision-language model that enables pixel-grounded geospatial reasoning through modality-flexible and multi-temporal capabilities, evaluated on a new benchmark with detailed visual reasoning outputs.

AI-generated summary

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

View arXiv page View PDF Project page GitHub 110 Add to collection

Community

sy1998

Paper submitter about 22 hours ago

•

edited about 22 hours ago

CVPR2026-Pixel-Grounded reasoning for Earth Observation.

librarian-bot

about 4 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ShiHong8

about 3 hours ago

TerraScope not only improves VLM training with the Terra-CoT dataset, but also endows the model with pixel-grounded reasoning capabilities by aligning reasoning processes with pixel-level segmentation, thereby enabling multi-temporal change analysis and multimodal fusion.

Suggestions:
1、Although TerraScope achieves pixel-level grounding, its boundary precision could be further improved by incorporating boundary-aware losses or high-resolution feature fusion to reduce mask ambiguity.

2、While TerraScope supports adaptive multimodal fusion, explicit cross-modal alignment (e.g., contrastive learning or shared latent space) could reduce discrepancies between optical and SAR representations.

3、Incorporating uncertainty estimation (e.g., probabilistic masks or confidence scores) could improve reliability in complex geospatial scenarios.

TerraScope represents a significant step toward pixel-grounded geospatial reasoning, but there remains room for improvement in boundary precision, cross-modal alignment, temporal modeling, and fine-grained reasoning.