--- license: apache-2.0 language: - en - zh pipeline_tag: image-text-to-text tags: - remote-sensing - land-use - qwen2.5-vl - multimodal - ms-swift --- # LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence
**[Paper (Under Review)]**   |   **[Dataset]**
## 📖 Introduction **LandAI-L1** is a multimodal large language model designed for **verifiable land-use reasoning**. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: **"Visual Indexing、Geometric Localization and Language Reasoning"**. By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations. This model is built upon the **Qwen2.5-VL-7B-Instruct** architecture and trained using the **GRPO-L1** algorithm. ## 🚀 Key Features - **Explicit Geometric Grounding**: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates. - **Data Efficiency**: Achieves SOTA performance (86.41% accuracy) using only **25%** of the training data required by comparable models (e.g., LandGPT). - **Hallucination Resistance**: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline). - **Standardized Architecture**: Fully follows the **Qwen2.5-VL** inference architecture to minimize version conflicts and maximize ecosystem compatibility. - **Reproducible Training**: The training phase utilizes the **[ms-swift](https://github.com/modelscope/swift)** framework, facilitating easy fine-tuning and further research. ## 📊 Performance LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models. | Model | Architecture | Training Samples | Accuracy (%) | Hallucination Resistance | | :--- | :--- | :--- | :--- | :--- | | **LandAI-L1 (Ours)** | **Qwen2.5-VL-7B** | **~20k** | **86.41** | **High** | | LandAI-L1-Zero (Baseline) | Qwen2.5-VL-7B | ~20k | 72.21 | Low | | LandGPT | InternVL2 | ~80k | 82.5 (approx) | Low | | Gemini 2.5 Pro | Closed | N/A | 52.21 | Medium | > **Note**: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment). ## 🛠️ Quick Start Since LandAI-L1 strictly follows the **Qwen2.5-VL** architecture, you can load it directly using `transformers` without custom modeling code. ### Installation ```bash pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils ``` ## ⚙️ Training & Fine-tuning The model was trained using **[ms-swift](https://github.com/modelscope/swift)**, a lightweight and extensible framework for LLM/MLLM fine-tuning. To reproduce the training or fine-tune on your own geospatial data: Clone ms-swift: git clone https://github.com/modelscope/swift.git Prepare your dataset in the standard format. Run the training ms-swift script.