|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- vision-language |
|
|
- vlm |
|
|
- grpo |
|
|
- earthmind |
|
|
- geospatial |
|
|
- remote-sensing |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# EarthMind-R1 |
|
|
|
|
|
EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model:** EarthMind-4B |
|
|
- **Training Method:** GRPO (Group Relative Policy Optimization) |
|
|
- **Training Data:** Geospatial instruction dataset |
|
|
- **Fine-tuning:** LoRA adapters merged into base weights |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_id = "aadex/Earthmind-R1" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
# Load an image |
|
|
image = Image.open("your_image.jpg").convert("RGB") |
|
|
|
|
|
# Ask a question |
|
|
question = "Describe what you see in this satellite image." |
|
|
|
|
|
# Use model's chat interface |
|
|
response = model.chat( |
|
|
tokenizer=tokenizer, |
|
|
question=question, |
|
|
images=[image], |
|
|
generation_config={ |
|
|
"max_new_tokens": 512, |
|
|
"temperature": 0.7, |
|
|
"do_sample": True, |
|
|
}, |
|
|
) |
|
|
|
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Expected Output Format |
|
|
|
|
|
The model is trained to provide structured responses: |
|
|
|
|
|
``` |
|
|
<think> |
|
|
[Reasoning about the image content] |
|
|
</think> |
|
|
<answer> |
|
|
[Final answer to the question] |
|
|
</answer> |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
``` |
|
|
torch>=2.0 |
|
|
transformers>=4.40 |
|
|
accelerate |
|
|
pillow |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
- **Minimum:** 16GB VRAM (with bfloat16) |
|
|
- **Recommended:** 24GB VRAM for comfortable inference |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Framework:** VLM-R1 + TRL |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** 1e-6 |
|
|
- **LoRA Configuration:** |
|
|
- r: 32 |
|
|
- alpha: 64 |
|
|
- dropout: 0.05 |
|
|
- **GRPO Settings:** |
|
|
- num_generations: 4 |
|
|
- num_iterations: 2 |
|
|
- beta: 0.01 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for geospatial/remote sensing imagery |
|
|
- May not perform as well on general domain images |
|
|
- Response quality depends on image resolution and clarity |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{earthmind-r1, |
|
|
title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|