🌍 GroundSet Baseline: LLaVA-1.6 for Spatial Understanding in Earth Observation

This repository hosts the official baseline model for GroundSet, a large-scale Earth Observation dataset grounded in verifiable cadastral vector data.

This baseline is fine-tuned on 1.8 million instructions from GroundSet's finetuning dataset.

🎯 Supported Spatial Tasks

The model has been explicitly trained to handle highly granular semantic categories (135 classes, including specific crop types, heritage sites and civil infrastructure) across the following tasks:

Captioning: Generating coherent scene descriptions.
Localized Classification: Classifying given regions (bounding boxes or polygons).
Object Detection: Localizing specific classes using Horizontal Bounding Boxes (HBB).
Segmentation: Localizing target classes using polygonal masks.
Referring Expression Comprehension (REC): Localizing objects based on textual descriptions.
Visual Question Answering (VQA): Binary verification of object presence.

🏗️ Model Architecture & Training

The model is built upon the LLaVA-1.6 architecture (using the Vicuna 7B language model). This architecture relies on dynamic resolution (AnyRes) for processing high-resolution aerial imagery.

Training Details

Hardware: Fine-tuned on 8x A100 (80GB) GPUs for 1 epoch (approx. 72 hours).
Method: Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation).
LoRA Config: Rank $r=32$, alpha $\alpha=64$ and dropout $p=0.1$ applied to all linear layers of the language model.
Projector & Vision Tower: The multi-modal projector was fully fine-tuned, while the vision tower remained frozen.
Optimization: AdamW optimizer (batch size 128), cosine learning rate scheduler (peak LR $2e^{-4}$, warmup ratio 0.03), BFloat16 precision, DeepSpeed ZeRO-2 and FlashAttention-2.

💡 Note: Pixel coordinates are discretized into [0-1000] bins.

📊 Evaluation & Results

The GroundSet baseline establishes a robust standard for Earth Observation spatial understanding.

Key Performance Highlights

The model is evaluated zero-shot on GroundSet's test set, achieving the following results:

Classification: 94.18% (Acc@0.8), outperforming Gemini-2.5 Flash (49.84%) and LLaVA-1.6 base (29.20%).
Segmentation: 39.45 (F1@0.5), outperforming PaliGemma-2 (17.47) and Ferret (13.87).
Object Detection: 49.47 (F1@0.5), outperforming Remote Sensing specialists like GeoChat (5.52) and SkySenseGPT (3.13).

Cross-Dataset Generalization

To prove generalization capabilities, the model has been evaluated zero-shot on the VRSBench dataset. Despite operating strictly out-of-distribution (VRSBench focuses heavily on vehicles/planes, which are absent from GroundSet's cadastral data), the model still outperformed leading RS-specialists in core spatial grounding tasks like Detection and REC.

💻 Usage Example

Because this model uses the LLaVA-1.6 architecture, it can be easily loaded using the standard Hugging Face transformers library via the LlavaNext implementation.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

# 1. Load the processor and model
model_id = "RogerFerrod/GroundSet-LLaVA-1.6-7B" 
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True
).to("cuda")

# 2. Load an image and define the question
image = Image.open("path_to_aerial_patch.png")
question = "Detect all instances of Building in this image and provide their bounding boxes."

# 3. Format the prompt using the official chat template
msgs = [
    {
        "role": "user", 
        "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]

prompt = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)

# 4. Process the inputs
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

# 5. Generate and decode the output
output = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.decode(output[0], skip_special_tokens=True)

print(generated_text)

💡 Note: For a complete, reproducible inference pipeline and the exact scripts used to compute the benchmark metrics reported in the paper, please refer to the official GitHub repository.

📝 Citation

If you utilize this model or the associated dataset in your research, please consider citing the original work:

@article{groundset,
  title={GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data},
  author={Ferrod, Roger and Lecene, Ma{\"e}l and Sapkota, Krishna and Leifman, George and Silverman, Vered and Beryozkin, Genady and Lobry, Sylvain},
  journal={arXiv preprint},
  year={2026}
}

🙌 Acknowledgements

This work was supported by Google under a research collaboration agreement with Université Paris Cité. The underlying GroundSet dataset leverages official data from IGN (French National Institute of Geographic and Forest Information), specifically BD ORTHO® and BD TOPO®, released under Open Licence 2.0.

Downloads last month: 15

Safetensors

Model size

7B params

Tensor type

F16

Model tree for RogerFerrod/GroundSet-LLaVA-1.6-7B

Base model

liuhaotian/llava-v1.6-vicuna-7b

Adapter

(4)

this model

RogerFerrod
/

GroundSet-LLaVA-1.6-7B