LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence
[Paper (Under Review)] | [Dataset]
π Introduction
LandAI-L1 is a multimodal large language model designed for verifiable land-use reasoning. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: "Visual IndexingγGeometric Localization and Language Reasoning".
By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations.
This model is built upon the Qwen2.5-VL-7B-Instruct architecture and trained using the GRPO-L1 algorithm.
π Key Features
- Explicit Geometric Grounding: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates.
- Data Efficiency: Achieves SOTA performance (86.41% accuracy) using only 25% of the training data required by comparable models (e.g., LandGPT).
- Hallucination Resistance: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline).
- Standardized Architecture: Fully follows the Qwen2.5-VL inference architecture to minimize version conflicts and maximize ecosystem compatibility.
- Reproducible Training: The training phase utilizes the ms-swift framework, facilitating easy fine-tuning and further research.
π Performance
LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models.
| Model | Architecture | Training Samples | Accuracy (%) | Hallucination Resistance |
|---|---|---|---|---|
| LandAI-L1 (Ours) | Qwen2.5-VL-7B | ~20k | 86.41 | High |
| LandAI-L1-Zero (Baseline) | Qwen2.5-VL-7B | ~20k | 72.21 | Low |
| LandGPT | InternVL2 | ~80k | 82.5 (approx) | Low |
| Gemini 2.5 Pro | Closed | N/A | 52.21 | Medium |
Note: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment).
π οΈ Quick Start
Since LandAI-L1 strictly follows the Qwen2.5-VL architecture, you can load it directly using transformers without custom modeling code.
Installation
pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
βοΈ Training & Fine-tuning
The model was trained using ms-swift, a lightweight and extensible framework for LLM/MLLM fine-tuning.
To reproduce the training or fine-tune on your own geospatial data:
Clone ms-swift: git clone https://github.com/modelscope/swift.git
Prepare your dataset in the standard format.
Run the training ms-swift script.
- Downloads last month
- 17