aadex commited on
Commit
6028ebb
·
verified ·
1 Parent(s): 44a2d50

Upload EarthMind-4B GRPO fine-tuned model

Browse files
README.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - vision-language
7
+ - vlm
8
+ - grpo
9
+ - earthmind
10
+ - geospatial
11
+ - remote-sensing
12
+ library_name: transformers
13
+ pipeline_tag: image-text-to-text
14
+ ---
15
+
16
+ # EarthMind-R1
17
+
18
+ EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks.
19
+
20
+ ## Model Description
21
+
22
+ - **Base Model:** EarthMind-4B
23
+ - **Training Method:** GRPO (Group Relative Policy Optimization)
24
+ - **Training Data:** Geospatial instruction dataset
25
+ - **Fine-tuning:** LoRA adapters merged into base weights
26
+
27
+ ## Usage
28
+
29
+ ### Quick Start
30
+
31
+ ```python
32
+ import torch
33
+ from PIL import Image
34
+ from transformers import AutoModelForCausalLM, AutoTokenizer
35
+
36
+ # Load model and tokenizer
37
+ model_id = "aadex/Earthmind-R1"
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
40
+ model = AutoModelForCausalLM.from_pretrained(
41
+ model_id,
42
+ trust_remote_code=True,
43
+ torch_dtype=torch.bfloat16,
44
+ device_map="auto",
45
+ )
46
+
47
+ # Load an image
48
+ image = Image.open("your_image.jpg").convert("RGB")
49
+
50
+ # Ask a question
51
+ question = "Describe what you see in this satellite image."
52
+
53
+ # Use model's chat interface
54
+ response = model.chat(
55
+ tokenizer=tokenizer,
56
+ question=question,
57
+ images=[image],
58
+ generation_config={
59
+ "max_new_tokens": 512,
60
+ "temperature": 0.7,
61
+ "do_sample": True,
62
+ },
63
+ )
64
+
65
+ print(response)
66
+ ```
67
+
68
+ ### Expected Output Format
69
+
70
+ The model is trained to provide structured responses:
71
+
72
+ ```
73
+ <think>
74
+ [Reasoning about the image content]
75
+ </think>
76
+ <answer>
77
+ [Final answer to the question]
78
+ </answer>
79
+ ```
80
+
81
+ ## Requirements
82
+
83
+ ```
84
+ torch>=2.0
85
+ transformers>=4.40
86
+ accelerate
87
+ pillow
88
+ ```
89
+
90
+ ## Hardware Requirements
91
+
92
+ - **Minimum:** 16GB VRAM (with bfloat16)
93
+ - **Recommended:** 24GB VRAM for comfortable inference
94
+
95
+ ## Training Details
96
+
97
+ - **Framework:** VLM-R1 + TRL
98
+ - **Optimizer:** AdamW
99
+ - **Learning Rate:** 1e-6
100
+ - **LoRA Configuration:**
101
+ - r: 32
102
+ - alpha: 64
103
+ - dropout: 0.05
104
+ - **GRPO Settings:**
105
+ - num_generations: 4
106
+ - num_iterations: 2
107
+ - beta: 0.01
108
+
109
+ ## Limitations
110
+
111
+ - Optimized for geospatial/remote sensing imagery
112
+ - May not perform as well on general domain images
113
+ - Response quality depends on image resolution and clarity
114
+
115
+ ## Citation
116
+
117
+ If you use this model, please cite:
118
+
119
+ ```bibtex
120
+ @misc{earthmind-r1,
121
+ title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding},
122
+ author={Your Name},
123
+ year={2024},
124
+ publisher={HuggingFace}
125
+ }
126
+ ```
127
+
128
+ ## License
129
+
130
+ Apache 2.0
model-00001-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8c6e07df93c166ba98e65cd235559e512d79b5a92aa3adb1d967c4d1f3a741d4
3
  size 4993044040
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97f3792a0d86308d529a858ac40fb0d704ffa3a4da4a042a6acb77b184e5eb97
3
  size 4993044040
model-00002-of-00002.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c305d389e82ee348d9dbbe3b8fd23637842a98663562bfa94a33aa74bc4554e2
3
  size 2890805372
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a5ada102da6c3ed05981f12a81dc425a3aa173c9e18778530ff3fab08ee9313
3
  size 2890805372
modeling_earthmind_chat.py CHANGED
@@ -38,7 +38,9 @@ from types import MethodType
38
  import torch.nn.functional as F
39
 
40
  try:
41
- from .flash_attention import FlashAttention
 
 
42
  has_flash_attn = True
43
  except:
44
  print('FlashAttention is not installed.')
 
38
  import torch.nn.functional as F
39
 
40
  try:
41
+ # flash_attention import removed for inference without flash_attn
42
+ # from .flash_attention import FlashAttention
43
+ FlashAttention = None
44
  has_flash_attn = True
45
  except:
46
  print('FlashAttention is not installed.')
modeling_intern_vit.py CHANGED
@@ -21,7 +21,9 @@ from transformers.utils import logging
21
  from .configuration_intern_vit import InternVisionConfig
22
 
23
  try:
24
- from .flash_attention import FlashAttention
 
 
25
  has_flash_attn = True
26
  except:
27
  print('FlashAttention is not installed.')
 
21
  from .configuration_intern_vit import InternVisionConfig
22
 
23
  try:
24
+ # flash_attention import removed for inference without flash_attn
25
+ # from .flash_attention import FlashAttention
26
+ FlashAttention = None
27
  has_flash_attn = True
28
  except:
29
  print('FlashAttention is not installed.')