---
license: apache-2.0
language:
- en
tags:
- vision-language
- vlm
- grpo
- earthmind
- geospatial
- remote-sensing
library_name: transformers
pipeline_tag: image-text-to-text
---

# EarthMind-R1

EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks.

## Model Description

- **Base Model:** EarthMind-4B
- **Training Method:** GRPO (Group Relative Policy Optimization)
- **Training Data:** Geospatial instruction dataset
- **Fine-tuning:** LoRA adapters merged into base weights

## Usage

### Quick Start

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "aadex/Earthmind-R1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load an image
image = Image.open("your_image.jpg").convert("RGB")

# Ask a question
question = "Describe what you see in this satellite image."

# Use model's chat interface
response = model.chat(
    tokenizer=tokenizer,
    question=question,
    images=[image],
    generation_config={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    },
)

print(response)
```

### Expected Output Format

The model is trained to provide structured responses:

```
<think>
[Reasoning about the image content]
</think>
<answer>
[Final answer to the question]
</answer>
```

## Requirements

```
torch>=2.0
transformers>=4.40
accelerate
pillow
```

## Hardware Requirements

- **Minimum:** 16GB VRAM (with bfloat16)
- **Recommended:** 24GB VRAM for comfortable inference

## Training Details

- **Framework:** VLM-R1 + TRL
- **Optimizer:** AdamW
- **Learning Rate:** 1e-6
- **LoRA Configuration:**
  - r: 32
  - alpha: 64
  - dropout: 0.05
- **GRPO Settings:**
  - num_generations: 4
  - num_iterations: 2
  - beta: 0.01

## Limitations

- Optimized for geospatial/remote sensing imagery
- May not perform as well on general domain images
- Response quality depends on image resolution and clarity

## Citation

If you use this model, please cite:

```bibtex
@misc{earthmind-r1,
  title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}
```

## License

Apache 2.0