| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - zh |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - remote-sensing |
| | - land-use |
| | - qwen2.5-vl |
| | - multimodal |
| | - ms-swift |
| | --- |
| | |
| | # LandAI-L1: Explicit geometric grounding enables data-efficient and interpretable geospatial intelligence |
| |
|
| | <div align="center"> |
| |
|
| | **[Paper (Under Review)]** | **[Dataset]** |
| |
|
| | </div> |
| |
|
| | ## 📖 Introduction |
| |
|
| | **LandAI-L1** is a multimodal large language model designed for **verifiable land-use reasoning**. Unlike traditional black-box classification models, LandAI-L1 enforces a strict cognitive path: **"Visual Indexing、Geometric Localization and Language Reasoning"**. |
| |
|
| | By compelling the model to explicitly localize visual evidence (bounding boxes) before drawing semantic conclusions, we achieve state-of-the-art accuracy in land-use classification while significantly mitigating multimodal hallucinations. |
| |
|
| | This model is built upon the **Qwen2.5-VL-7B-Instruct** architecture and trained using the **GRPO-L1** algorithm. |
| |
|
| | ## 🚀 Key Features |
| |
|
| | - **Explicit Geometric Grounding**: Mitigates "disembodied explanations" by anchoring reasoning steps in verifiable pixel coordinates. |
| | - **Data Efficiency**: Achieves SOTA performance (86.41% accuracy) using only **25%** of the training data required by comparable models (e.g., LandGPT). |
| | - **Hallucination Resistance**: Demonstrates superior resistance to text-based misinformation in visual-linguistic conflict scenarios (37.0% vision-adherence vs. 7.3% baseline). |
| | - **Standardized Architecture**: Fully follows the **Qwen2.5-VL** inference architecture to minimize version conflicts and maximize ecosystem compatibility. |
| | - **Reproducible Training**: The training phase utilizes the **[ms-swift](https://github.com/modelscope/swift)** framework, facilitating easy fine-tuning and further research. |
| |
|
| | ## 📊 Performance |
| |
|
| | LandAI-L1 establishes a new benchmark on the independent CN-MSLU test set, outperforming both open-source baselines and commercial models. |
| |
|
| | | Model | Architecture | Training Samples | Accuracy (%) | Hallucination Resistance | |
| | | :--- | :--- | :--- | :--- | :--- | |
| | | **LandAI-L1 (Ours)** | **Qwen2.5-VL-7B** | **~20k** | **86.41** | **High** | |
| | | LandAI-L1-Zero (Baseline) | Qwen2.5-VL-7B | ~20k | 72.21 | Low | |
| | | LandGPT | InternVL2 | ~80k | 82.5 (approx) | Low | |
| | | Gemini 2.5 Pro | Closed | N/A | 52.21 | Medium | |
| |
|
| | > **Note**: Hallucination resistance refers to the model's ability to reject misleading textual priors in favor of visual evidence (Visual-Linguistic Conflict Experiment). |
| |
|
| | ## 🛠️ Quick Start |
| |
|
| | Since LandAI-L1 strictly follows the **Qwen2.5-VL** architecture, you can load it directly using `transformers` without custom modeling code. |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install git+https://github.com/huggingface/transformers |
| | pip install qwen-vl-utils |
| | ``` |
| | ## ⚙️ Training & Fine-tuning |
| | The model was trained using **[ms-swift](https://github.com/modelscope/swift)**, a lightweight and extensible framework for LLM/MLLM fine-tuning. |
| |
|
| | To reproduce the training or fine-tune on your own geospatial data: |
| |
|
| | Clone ms-swift: git clone https://github.com/modelscope/swift.git |
| |
|
| | Prepare your dataset in the standard format. |
| |
|
| | Run the training ms-swift script. |