๐ŸŒ GroundSet Baseline: LLaVA-1.6 for Spatial Understanding in Earth Observation

Paper Dataset Code

This repository hosts the official baseline model for GroundSet, a large-scale Earth Observation dataset grounded in verifiable cadastral vector data.

This baseline is fine-tuned on 1.8 million instructions from GroundSet's finetuning dataset.

๐ŸŽฏ Supported Spatial Tasks

The model has been explicitly trained to handle highly granular semantic categories (135 classes, including specific crop types, heritage sites and civil infrastructure) across the following tasks:

  • Captioning: Generating coherent scene descriptions.
  • Localized Classification: Classifying given regions (bounding boxes or polygons).
  • Object Detection: Localizing specific classes using Horizontal Bounding Boxes (HBB).
  • Segmentation: Localizing target classes using polygonal masks.
  • Referring Expression Comprehension (REC): Localizing objects based on textual descriptions.
  • Visual Question Answering (VQA): Binary verification of object presence.

๐Ÿ—๏ธ Model Architecture & Training

The model is built upon the LLaVA-1.6 architecture (using the Vicuna 7B language model). This architecture relies on dynamic resolution (AnyRes) for processing high-resolution aerial imagery.

Training Details

  • Hardware: Fine-tuned on 8x A100 (80GB) GPUs for 1 epoch (approx. 72 hours).
  • Method: Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation).
  • LoRA Config: Rank $r=32$, alpha $\alpha=64$ and dropout $p=0.1$ applied to all linear layers of the language model.
  • Projector & Vision Tower: The multi-modal projector was fully fine-tuned, while the vision tower remained frozen.
  • Optimization: AdamW optimizer (batch size 128), cosine learning rate scheduler (peak LR $2e^{-4}$, warmup ratio 0.03), BFloat16 precision, DeepSpeed ZeRO-2 and FlashAttention-2.

๐Ÿ’ก Note: Pixel coordinates are discretized into [0-1000] bins.


๐Ÿ“Š Evaluation & Results

The GroundSet baseline establishes a robust standard for Earth Observation spatial understanding.

Key Performance Highlights

The model is evaluated zero-shot on GroundSet's test set, achieving the following results:

  • Classification: 94.18% (Acc@0.8), outperforming Gemini-2.5 Flash (49.84%) and LLaVA-1.6 base (29.20%).
  • Segmentation: 39.45 (F1@0.5), outperforming PaliGemma-2 (17.47) and Ferret (13.87).
  • Object Detection: 49.47 (F1@0.5), outperforming Remote Sensing specialists like GeoChat (5.52) and SkySenseGPT (3.13).

Cross-Dataset Generalization

To prove generalization capabilities, the model has been evaluated zero-shot on the VRSBench dataset. Despite operating strictly out-of-distribution (VRSBench focuses heavily on vehicles/planes, which are absent from GroundSet's cadastral data), the model still outperformed leading RS-specialists in core spatial grounding tasks like Detection and REC.


๐Ÿ’ป Usage Example

Because this model uses the LLaVA-1.6 architecture, it can be easily loaded using the standard Hugging Face transformers library via the LlavaNext implementation.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

# 1. Load the processor and model
model_id = "RogerFerrod/GroundSet-LLaVA-1.6-7B" 
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True
).to("cuda")

# 2. Load an image and define the question
image = Image.open("path_to_aerial_patch.png")
question = "Detect all instances of Building in this image and provide their bounding boxes."

# 3. Format the prompt using the official chat template
msgs = [
    {
        "role": "user", 
        "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]

prompt = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

# 4. Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# 5. Generate and decode the output
output = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.decode(output[0], skip_special_tokens=True)

print(generated_text)

๐Ÿ’ก Note: For a complete, reproducible inference pipeline and the exact scripts used to compute the benchmark metrics reported in the paper, please refer to the official GitHub repository.


๐Ÿ“ Citation

If you utilize this model or the associated dataset in your research, please consider citing the original work:

@article{groundset,
  title={GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data},
  author={Ferrod, Roger and Lecene, Ma{\"e}l and Sapkota, Krishna and Leifman, George and Silverman, Vered and Beryozkin, Genady and Lobry, Sylvain},
  journal={arXiv preprint},
  year={2026}
}

๐Ÿ™Œ Acknowledgements

This work was supported by Google under a research collaboration agreement with Universitรฉ Paris Citรฉ. The underlying GroundSet dataset leverages official data from IGN (French National Institute of Geographic and Forest Information), specifically BD ORTHOยฎ and BD TOPOยฎ, released under Open Licence 2.0.

Downloads last month
15
Safetensors
Model size
7B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RogerFerrod/GroundSet-LLaVA-1.6-7B

Adapter
(4)
this model

Dataset used to train RogerFerrod/GroundSet-LLaVA-1.6-7B