zhou777
/

LandAI-L1

+---
+license: apache-2.0
+language:
+- en
+- zh
+pipeline_tag: image-text-to-text
+tags:
+- remote-sensing
+- land-use
+- qwen2.5-vl
+- multimodal
+- ms-swift
+---
+# LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence
+<div align="center">
+**[NMI Submission]** &nbsp; | &nbsp; **[Paper (Under Review)]** &nbsp; | &nbsp; **[Dataset]**
+</div>
+## 📖 Introduction
+**LandAI-L1** is a multimodal large language model designed for **verifiable land-use reasoning**. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: **"Visual Indexing、Geometric Localization and Language Reasoning"**.
+By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations.
+This model is built upon the **Qwen2.5-VL-7B-Instruct** architecture and trained using the **GRPO-L1** algorithm.
+## 🚀 Key Features
+- **Explicit Geometric Grounding**: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates.
+- **Data Efficiency**: Achieves SOTA performance (86.41% accuracy) using only **25%** of the training data required by comparable models (e.g., LandGPT).
+- **Hallucination Resistance**: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline).
+- **Standardized Architecture**: Fully follows the **Qwen2.5-VL** inference architecture to minimize version conflicts and maximize ecosystem compatibility.
+- **Reproducible Training**: The training phase utilizes the **[ms-swift](https://github.com/modelscope/swift)** framework, facilitating easy fine-tuning and further research.
+## 📊 Performance
+LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models.
+| Model | Architecture | Training Samples | Accuracy (%) | Hallucination Resistance |
+| :--- | :--- | :--- | :--- | :--- |
+| **LandAI-L1 (Ours)** | **Qwen2.5-VL-7B** | **~20k** | **86.41** | **High** |
+| LandAI-L1-Zero (Baseline) | Qwen2.5-VL-7B | ~20k | 72.21 | Low |
+| LandGPT | InternVL2 | ~80k | 82.5 (approx) | Low |
+| Gemini 2.5 Pro | Closed | N/A | 52.21 | Medium |
+> **Note**: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment).
+## 🛠️ Quick Start
+Since LandAI-L1 strictly follows the **Qwen2.5-VL** architecture, you can load it directly using `transformers` without custom modeling code.
+### Installation
+```bash
+pip install git+https://github.com/huggingface/transformers
+pip install qwen-vl-utils
+```
+## ⚙️ Training & Fine-tuning
+The model was trained using **[ms-swift](https://github.com/modelscope/swift)**, a lightweight and extensible framework for LLM/MLLM fine-tuning.
+To reproduce the training or fine-tune on your own geospatial data:
+Clone ms-swift: git clone https://github.com/modelscope/swift.git
+Prepare your dataset in the standard format.
+Run the training  ms-swift script.