--- license: apache-2.0 language: - en tags: - vision-language - vlm - grpo - earthmind - geospatial - remote-sensing library_name: transformers pipeline_tag: image-text-to-text --- # EarthMind-R1 EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks. ## Model Description - **Base Model:** EarthMind-4B - **Training Method:** GRPO (Group Relative Policy Optimization) - **Training Data:** Geospatial instruction dataset - **Fine-tuning:** LoRA adapters merged into base weights ## Usage ### Quick Start ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_id = "aadex/Earthmind-R1" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) # Load an image image = Image.open("your_image.jpg").convert("RGB") # Ask a question question = "Describe what you see in this satellite image." # Use model's chat interface response = model.chat( tokenizer=tokenizer, question=question, images=[image], generation_config={ "max_new_tokens": 512, "temperature": 0.7, "do_sample": True, }, ) print(response) ``` ### Expected Output Format The model is trained to provide structured responses: ``` [Reasoning about the image content] [Final answer to the question] ``` ## Requirements ``` torch>=2.0 transformers>=4.40 accelerate pillow ``` ## Hardware Requirements - **Minimum:** 16GB VRAM (with bfloat16) - **Recommended:** 24GB VRAM for comfortable inference ## Training Details - **Framework:** VLM-R1 + TRL - **Optimizer:** AdamW - **Learning Rate:** 1e-6 - **LoRA Configuration:** - r: 32 - alpha: 64 - dropout: 0.05 - **GRPO Settings:** - num_generations: 4 - num_iterations: 2 - beta: 0.01 ## Limitations - Optimized for geospatial/remote sensing imagery - May not perform as well on general domain images - Response quality depends on image resolution and clarity ## Citation If you use this model, please cite: ```bibtex @misc{earthmind-r1, title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding}, author={Your Name}, year={2024}, publisher={HuggingFace} } ``` ## License Apache 2.0