Improve model card: add pipeline tag, license, paper, code links and evaluation usage
Browse filesThis PR significantly enhances the model card for `initiacms/GeoLLaVA-8K` by:
- Adding the `pipeline_tag: image-text-to-text` to improve discoverability on the Hugging Face Hub.
- Specifying the `license: cc-by-nc-4.0`, referencing the likely license of the associated GeoLLaVA-Data dataset.
- Including a direct link to the paper: [GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution](https://huggingface.co/papers/2505.21375).
- Adding a link to the official GitHub repository: https://github.com/MiliLab/GeoLLaVA-8K.
- Providing a concrete `lmms-eval` code snippet for evaluation, copied directly from the GitHub README, to guide users on model usage.
- Adding the BibTeX citation for proper attribution.
Please review and merge this PR if everything looks good.
|
@@ -1,9 +1,66 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
base_model:
|
| 5 |
- lmms-lab/LongVA-7B
|
|
|
|
|
|
|
| 6 |
library_name: transformers
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- lmms-lab/LongVA-7B
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
library_name: transformers
|
| 7 |
+
license: cc-by-nc-4.0
|
| 8 |
+
pipeline_tag: image-text-to-text
|
| 9 |
---
|
| 10 |
|
| 11 |
+
<div align="center">
|
| 12 |
+
<h2><strong>GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution</strong></h2>
|
| 13 |
+
<h5>
|
| 14 |
+
<em>
|
| 15 |
+
Fengxiang Wang<sup>1</sup>, Mingshuo Chen<sup>2</sup>, Yueying Li<sup>1</sup>, Di Wang<sup>4,5</sup>, Haotian Wang<sup>1</sup>, <br/>
|
| 16 |
+
Zonghao Guo<sup>3</sup>, Zefan Wang<sup>3</sup>, Boqi Shan<sup>6</sup>, Long Lan<sup>1</sup>, Yuilin Wang<sup>3 †</sup>, <br/>
|
| 17 |
+
Hongzhen Wang<sup>3 †</sup>, Wenjing Yang<sup>1 †</sup>, Bo Du<sup>4</sup>, Jing Zhang<sup>4 †</sup>
|
| 18 |
+
</em>
|
| 19 |
+
<br/><br/>
|
| 20 |
+
<sup>1</sup> National University of Defense Technology, China<br/>
|
| 21 |
+
<sup>2</sup> Beijing University of Posts and Telecommunications, China<br/>
|
| 22 |
+
<sup>3</sup> Tsinghua University, China, <sup>4</sup> Wuhan University, China<br/>
|
| 23 |
+
<sup>5</sup> Zhongguancun Academy, China, <sup>6</sup> Beihang University, China
|
| 24 |
+
</h5>
|
| 25 |
+
<p>
|
| 26 |
+
📃 <a href="https://arxiv.org/abs/2505.21375" target="_blank">Paper</a> |
|
| 27 |
+
🤗 <a href="https://huggingface.co/initiacms/GeoLLaVA-8K" target="_blank">Model</a> |
|
| 28 |
+
🤗 <a href="https://huggingface.co/datasets/initiacms/GeoLLaVA-Data" target="_blank">Dataset</a>
|
| 29 |
+
</p>
|
| 30 |
+
</div>
|
| 31 |
+
|
| 32 |
+
GeoLLaVA-8K is the first remote-sensing-focused multimodal large language model capable of handling inputs up to 8K×8K resolution, built on the LLaVA framework. It addresses two key bottlenecks in processing ultra-high-resolution (UHR) remote sensing imagery: (1) limited availability of UHR training data, and (2) token explosion caused by large image sizes. To overcome these, GeoLLaVA-8K introduces novel UHR vision-language datasets (SuperRS-VQA and HighRS-VQA) and proposes strategies like Background Token Pruning and Anchored Token Selection.
|
| 33 |
+
|
| 34 |
+
This model was presented in the paper: [GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution](https://huggingface.co/papers/2505.21375)
|
| 35 |
+
Official GitHub Repository: [https://github.com/MiliLab/GeoLLaVA-8K](https://github.com/MiliLab/GeoLLaVA-8K)
|
| 36 |
+
|
| 37 |
+
## Usage
|
| 38 |
+
|
| 39 |
+
GeoLLaVA-8K is built upon [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA). For detailed installation and finetuning instructions, please refer to the [GitHub repository](https://github.com/MiliLab/GeoLLaVA-8K).
|
| 40 |
+
|
| 41 |
+
For evaluation, you can use the `lmms-eval` framework as demonstrated below:
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
CKPT_PATH=initiacms/GeoLLaVA-8K # or local path
|
| 45 |
+
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
|
| 46 |
+
--model "longva" \
|
| 47 |
+
--model_args "pretrained=${CKPT_PATH},use_flash_attention_2=True" \
|
| 48 |
+
--tasks xlrs-lite \
|
| 49 |
+
--batch_size 1 \
|
| 50 |
+
--log_samples \
|
| 51 |
+
--log_samples_suffix longva_xlrs_lite \
|
| 52 |
+
--output_path ./logs/
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Citation
|
| 56 |
+
|
| 57 |
+
If you find our work helpful, please consider citing:
|
| 58 |
+
|
| 59 |
+
```latex
|
| 60 |
+
@article{wang2025geollava8kscalingremotesensingmultimodal,
|
| 61 |
+
title={GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution},
|
| 62 |
+
author={Fengxiang Wang and Mingshuo Chen and Yueying Li and Di Wang and Haotian Wang and Zonghao Guo and Zefan Wang and Boqi Shan and Long Lan and Yulin Wang and Hongzhen Wang and Wenjing Yang and Bo Du and Jing Zhang},
|
| 63 |
+
journal={arXiv preprint arXiv:2505.21375},
|
| 64 |
+
year={2025},
|
| 65 |
+
}
|
| 66 |
+
```
|