---
license: apache-2.0
language:
- en
- zh
pipeline_tag: image-text-to-text
tags:
- remote-sensing
- land-use
- qwen2.5-vl
- multimodal
- ms-swift
---

# LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence

<div align="center">

**[Paper (Under Review)]** &nbsp; | &nbsp; **[Dataset]**

</div>

## 📖 Introduction

**LandAI-L1** is a multimodal large language model designed for **verifiable land-use reasoning**. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: **"Visual Indexing、Geometric Localization and Language Reasoning"**. 

By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations.

This model is built upon the **Qwen2.5-VL-7B-Instruct** architecture and trained using the **GRPO-L1** algorithm.

## 🚀 Key Features

- **Explicit Geometric Grounding**: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates.
- **Data Efficiency**: Achieves SOTA performance (86.41% accuracy) using only **25%** of the training data required by comparable models (e.g., LandGPT).
- **Hallucination Resistance**: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline).
- **Standardized Architecture**: Fully follows the **Qwen2.5-VL** inference architecture to minimize version conflicts and maximize ecosystem compatibility.
- **Reproducible Training**: The training phase utilizes the **[ms-swift](https://github.com/modelscope/swift)** framework, facilitating easy fine-tuning and further research.

## 📊 Performance

LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models.

| Model | Architecture | Training Samples | Accuracy (%) | Hallucination Resistance |
| :--- | :--- | :--- | :--- | :--- |
| **LandAI-L1 (Ours)** | **Qwen2.5-VL-7B** | **~20k** | **86.41** | **High** |
| LandAI-L1-Zero (Baseline) | Qwen2.5-VL-7B | ~20k | 72.21 | Low |
| LandGPT | InternVL2 | ~80k | 82.5 (approx) | Low |
| Gemini 2.5 Pro | Closed | N/A | 52.21 | Medium |

> **Note**: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment).

## 🛠️ Quick Start

Since LandAI-L1 strictly follows the **Qwen2.5-VL** architecture, you can load it directly using `transformers` without custom modeling code.

### Installation

```bash
pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
```
## ⚙️ Training & Fine-tuning
The model was trained using **[ms-swift](https://github.com/modelscope/swift)**, a lightweight and extensible framework for LLM/MLLM fine-tuning.

To reproduce the training or fine-tune on your own geospatial data:

Clone ms-swift: git clone https://github.com/modelscope/swift.git

Prepare your dataset in the standard format.

Run the training  ms-swift script.