zhou777
/

LandAI-L1

Image-Text-to-Text

Model card Files Files and versions

LandAI-L1 / README.md

zhou777's picture

Update README.md

4d4f80f verified about 1 month ago

|

history blame contribute delete

3.2 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	pipeline_tag: image-text-to-text
	tags:
	- remote-sensing
	- land-use
	- qwen2.5-vl
	- multimodal
	- ms-swift
	---

	# LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence

	<div align="center">

	[Paper (Under Review)]   \|   [Dataset]

	</div>

	## 📖 Introduction

	LandAI-L1 is a multimodal large language model designed for verifiable land-use reasoning. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: "Visual Indexing、Geometric Localization and Language Reasoning".

	By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations.

	This model is built upon the Qwen2.5-VL-7B-Instruct architecture and trained using the GRPO-L1 algorithm.

	## 🚀 Key Features

	- Explicit Geometric Grounding: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates.
	- Data Efficiency: Achieves SOTA performance (86.41% accuracy) using only 25% of the training data required by comparable models (e.g., LandGPT).
	- Hallucination Resistance: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline).
	- Standardized Architecture: Fully follows the Qwen2.5-VL inference architecture to minimize version conflicts and maximize ecosystem compatibility.
	- Reproducible Training: The training phase utilizes the [ms-swift](https://github.com/modelscope/swift) framework, facilitating easy fine-tuning and further research.

	## 📊 Performance

	LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models.

	\| Model \| Architecture \| Training Samples \| Accuracy (%) \| Hallucination Resistance \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| LandAI-L1 (Ours) \| Qwen2.5-VL-7B \| ~20k \| 86.41 \| High \|
	\| LandAI-L1-Zero (Baseline) \| Qwen2.5-VL-7B \| ~20k \| 72.21 \| Low \|
	\| LandGPT \| InternVL2 \| ~80k \| 82.5 (approx) \| Low \|
	\| Gemini 2.5 Pro \| Closed \| N/A \| 52.21 \| Medium \|

	> Note: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment).

	## 🛠️ Quick Start

	Since LandAI-L1 strictly follows the Qwen2.5-VL architecture, you can load it directly using `transformers` without custom modeling code.

	### Installation

	```bash
	pip install git+https://github.com/huggingface/transformers
	pip install qwen-vl-utils
	```
	## ⚙️ Training & Fine-tuning
	The model was trained using [ms-swift](https://github.com/modelscope/swift), a lightweight and extensible framework for LLM/MLLM fine-tuning.

	To reproduce the training or fine-tune on your own geospatial data:

	Clone ms-swift: git clone https://github.com/modelscope/swift.git

	Prepare your dataset in the standard format.

	Run the training ms-swift script.