LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence

[Paper (Under Review)]   |   [Dataset]

πŸ“– Introduction

LandAI-L1 is a multimodal large language model designed for verifiable land-use reasoning. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: "Visual Indexing、Geometric Localization and Language Reasoning".

By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations.

This model is built upon the Qwen2.5-VL-7B-Instruct architecture and trained using the GRPO-L1 algorithm.

πŸš€ Key Features

  • Explicit Geometric Grounding: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates.
  • Data Efficiency: Achieves SOTA performance (86.41% accuracy) using only 25% of the training data required by comparable models (e.g., LandGPT).
  • Hallucination Resistance: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline).
  • Standardized Architecture: Fully follows the Qwen2.5-VL inference architecture to minimize version conflicts and maximize ecosystem compatibility.
  • Reproducible Training: The training phase utilizes the ms-swift framework, facilitating easy fine-tuning and further research.

πŸ“Š Performance

LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models.

Model Architecture Training Samples Accuracy (%) Hallucination Resistance
LandAI-L1 (Ours) Qwen2.5-VL-7B ~20k 86.41 High
LandAI-L1-Zero (Baseline) Qwen2.5-VL-7B ~20k 72.21 Low
LandGPT InternVL2 ~80k 82.5 (approx) Low
Gemini 2.5 Pro Closed N/A 52.21 Medium

Note: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment).

πŸ› οΈ Quick Start

Since LandAI-L1 strictly follows the Qwen2.5-VL architecture, you can load it directly using transformers without custom modeling code.

Installation

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils

βš™οΈ Training & Fine-tuning

The model was trained using ms-swift, a lightweight and extensible framework for LLM/MLLM fine-tuning.

To reproduce the training or fine-tune on your own geospatial data:

Clone ms-swift: git clone https://github.com/modelscope/swift.git

Prepare your dataset in the standard format.

Run the training ms-swift script.

Downloads last month
17
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zhou777/LandAI-L1

Quantizations
2 models