Vebbern commited on
Commit
af89935
·
verified ·
1 Parent(s): decb06a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +273 -3
README.md CHANGED
@@ -1,3 +1,273 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-3B-Instruct
7
+ tags:
8
+ - R2R
9
+ - VLN
10
+ - Room-to-Room
11
+ - LVLM
12
+ ---
13
+
14
+ # Qwen2.5-VL-3B-R2R-panoramic
15
+
16
+ **Qwen2.5-VL-3B-R2R-panoramic** is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a panoramic action space, where the model recives a preprocessed panoramic image and a set of candidate views which each point towards a node in a Matterport3D simualtor environment.
17
+
18
+ Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.
19
+
20
+
21
+ ## 🧠 Model Summary
22
+
23
+ - **Base Model**: Qwen2.5-VL-3B-Instruct
24
+ - **Dataset**: Room-to-Room (R2R) via the Matterport3D simulator.
25
+ - **Image Resolution**: 320x240 for candidate images and 960×240 for panoramic images.
26
+ - **Action Space**: Panoramic.
27
+
28
+ ## 🧪 Training Setup
29
+
30
+ - **Frozen Modules**: Vision encoder and cross-modal projector
31
+ - **Fine-Tuned Module**: LLM decoder (Qwen2.5)
32
+ - **Optimizer**: AdamW
33
+ - **Batch Size**: `1` (with gradient accumulation over each episode)
34
+ - **Learning Rate**: `1e-5`
35
+ - **Weight Decay**: `0.1`
36
+ - **Precision**: `bfloat16`
37
+ - **LR Scheduler**: Linear scheduler with warmup (first 10% of steps)
38
+ - **Hardware**: Trained on a single NVIDIA A100 80GB GPU
39
+
40
+ Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, panoramic RGB image observations (960×240) of current view, as well as variable amount of candidate RGB iamges (320x240), and cumulative episode history including previosu panoramas. The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.
41
+
42
+
43
+ ## 📦 Usage
44
+ ```python
45
+ import torch
46
+ from torch.utils.data import Dataset, DataLoader
47
+ from datasets import Dataset as DT
48
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
49
+ from PIL import Image
50
+
51
+ lass CustomDataset(Dataset):
52
+ def __init__(self, data):
53
+ self.text = data["text"]
54
+ self.panoramas = data["panoramas"]
55
+ self.candidates = data["candidates"]
56
+
57
+ def __len__(self):
58
+ return len(self.text)
59
+
60
+ def __getitem__(self, index):
61
+ return self.text[index], self.panoramas[index], self.candidates[index]
62
+
63
+ # TODO: make the collatefunctor work with batches
64
+ class CollateFunctor:
65
+ # No batch, therefore no max length
66
+ def __init__(self, processor, width, height):
67
+ self.processor = processor
68
+ self.width = width
69
+ self.height = height
70
+
71
+ def __call__(self, batch):
72
+ text, panoramas, candidates = batch[0]
73
+ label_start = processor.tokenizer("<|im_start|>assistant\nCandidate: ", return_tensors="pt").input_ids
74
+
75
+ images = [Image.open(img) for img in panoramas]
76
+ candidate_images = [Image.open(img) for img in candidates]
77
+ #candidate_images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in candidates]
78
+ images.extend(candidate_images)
79
+
80
+ processed = processor(text=text, images=[images], return_tensors="pt")
81
+
82
+ prompt_input_ids = processed["input_ids"]
83
+ input_ids = torch.cat([prompt_input_ids, label_start], dim=1)
84
+
85
+ attention_mask = torch.ones(1, input_ids.shape[1])
86
+ processed["input_ids"] = input_ids
87
+ processed["attention_mask"] = attention_mask
88
+
89
+ return processed
90
+
91
+
92
+ def format_prompt(images_path, path_id, route_instruction, step_id, distance_traveled, candidates, processor, system_prompt):
93
+ # should be in the order: panorama_history, current_panorama, candidates views from left to right
94
+ images = os.listdir(images_path)
95
+ panoramas = [os.path.join(images_path, img) for img in images if img.startswith("pano")]
96
+ panoramas = sorted(panoramas, key=lambda x: int(x.split("_")[-1].split(".")[-2]))
97
+
98
+ # these are probably sorted by default, however you might need to check
99
+ candidate_images = [os.path.join(images_path, img) for img in images if img.startswith("pano") == False]
100
+ candidate_images = sorted(candidate_images, key=lambda x: int(x.split("_")[-1].split(".")[0]))
101
+
102
+ current_panorama = panoramas.pop(-1)
103
+
104
+ # route instruction, current step, cumulative distance
105
+ content = [
106
+ {
107
+ "type" : "text",
108
+ "text" : f"Route instruction: {route_instruction}\nCurrent step: {step_id}\nCumulative Distance Traveled: {distance_traveled} meters\n\nPanorama Images from Previous Steps:"
109
+ }
110
+ ]
111
+
112
+ # panorama from previous steps
113
+ for i, img in enumerate(panoramas):
114
+ content.append({
115
+ "type" : "text",
116
+ "text" : f"\n\tPanorama at step: {i}: "
117
+ })
118
+ content.append({
119
+ "type" : "image",
120
+ "image" : img
121
+ })
122
+
123
+ if len(panoramas) == 0:
124
+ content[0]["text"] += f"[]"
125
+
126
+ # current panorama
127
+ content.append({
128
+ "type" : "text",
129
+ "text" : f"\n\nCurrent Panorama Image:\n\t"
130
+ })
131
+
132
+ content.append({
133
+ "type" : "image",
134
+ "image" : current_panorama
135
+ })
136
+
137
+ # candidate directions
138
+ content.append({
139
+ "type" : "text",
140
+ "text" : "\n\nCandidate Directions:"
141
+ })
142
+
143
+ for i, candidate in enumerate(candidates):
144
+ relative_angle = round(candidate["relative_angle"], 0)
145
+ distance = round(candidate["distance"], 2)
146
+ direction = "Left" if relative_angle < 0 else "Right"
147
+
148
+ content.append({
149
+ "type" : "text",
150
+ "text" : f"\n\tCandidate: {i}:\n\t\tRelative angle: {abs(relative_angle)} degrees to the {direction}\n\t\tDistance: {distance} meters\n\t\tview: "
151
+ })
152
+
153
+ content.append({
154
+ "type" : "image",
155
+ "image" : candidate_images[i]
156
+ })
157
+
158
+
159
+ # adds candidate STOP and the select cnadidate view
160
+ content.append({
161
+ "type" : "text",
162
+ "text" : "\n\tCandidate: Stop\n\nNow, analyze the route instruction, your current position, and the available candidate directions. Select the candidate that best matches the instruction and helps you continue along the correct path. Answer on the format: Candidate: (and then the number)"
163
+ })
164
+
165
+ messages = [
166
+ {"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
167
+ {"role" : "user", "content" : content},
168
+ ]
169
+
170
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
171
+
172
+ panoramas.extend([current_panorama])
173
+
174
+ formatted_sample = {}
175
+ formatted_sample["text"] = text
176
+ formatted_sample["candidates"] = candidate_images
177
+ formatted_sample["panoramas"] = panoramas
178
+
179
+ formatted_data = [formatted_sample]
180
+ formatted_data = DT.from_list(formatted_data)
181
+ return formatted_data
182
+
183
+
184
+ processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-panoramic")
185
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
186
+ "Vebbern/Qwen2.5-VL-3B-R2R-panoramic",
187
+ torch_dtype=torch.bfloat16,
188
+ attn_implementation="flash_attention_2",
189
+ device_map="cuda"
190
+ )
191
+
192
+ # remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
193
+ collate_fn = CollateFunctor(processor, 320, 240)
194
+
195
+ # Load mandatory system prompt
196
+ with open("system_prompt.txt", "r") as f:
197
+ system_prompt = f.read()
198
+
199
+ path_id = 4332 # id for the R2R path
200
+ route_instruction = "Walk to the other end of the lobby and wait near the exit. "
201
+ images_path = f"./images/{path_id}"
202
+ step_id = 0
203
+ cumulative_distance = 0
204
+ candidates = {
205
+ "0" : {
206
+ "relative_angle" -60.62797609213225,
207
+ "relative_direction": "Left",
208
+ "distance": 2.3325929641723633
209
+ },
210
+ "1": {
211
+ "relative_angle": -0.00397697185949581,
212
+ "relative_direction": "Front",
213
+ "distance": 4.637096405029297
214
+ },
215
+ "2": {
216
+ "relative_angle": 25.24592108757226,
217
+ "relative_direction": "Front",
218
+ "distance": 3.3661904335021973
219
+ }
220
+ }
221
+
222
+ prompt = format_prompt(images_path, path_id, route_instruction, step_id, cumulative_distance, candidates, processor, system_prompt)
223
+
224
+ dataset = CustomDataset(prompt)
225
+ data_loader = DataLoader(
226
+ dataset,
227
+ batch_size=1,
228
+ collate_fn=collate_fn
229
+ )
230
+
231
+ # Run inference
232
+ for batch in data_loader:
233
+ batch.to("cuda")
234
+
235
+ outputs = model(**batch)
236
+ argmax = torch.argmax(outputs.logits, dim=2)[0]
237
+ model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
238
+ print(f"Predicted action: {model_prediction}")
239
+
240
+ ```
241
+
242
+ > ⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.
243
+
244
+
245
+ ## 📊 Evaluation Results
246
+
247
+ The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.
248
+
249
+ | Metric | Val Seen | Val Unseen | Test |
250
+ |-------------------------|----------|------------|-------|
251
+ | Path Length (↓) | 9.98 | 9.83 | 9.96 |
252
+ | Navigation Error (↓) | 5.69 | 6.65 | 6.53 |
253
+ | Oracle Success Rate (↑) | 56% | 46% | 50% |
254
+ | Success Rate (↑) | 50% | 38% | 41% |
255
+ | SPL (↑) | 47% | 35% | 38% |
256
+
257
+ ### 🧾 Metric Definitions
258
+ - **Navigation Error**: Mean distance from the goal when the agent stops.
259
+ - **Success Rate**: Percentage of episodes where the agent ends within 3 meters of the goal.
260
+ - **SPL (Success weighted by Path Length)**: Penalizes long or inefficient paths.
261
+ - **Oracle Success**: If the agent had stopped at its closest point to the goal.
262
+
263
+ ### 📝 Remarks
264
+
265
+ This model performs far behind R2R State-of-the-art models, likely due to a combination of factors such as underlying model archtiecture, training strategy, and panoramic representation.
266
+
267
+ ## 🔁 Related Models
268
+ There also exists a low-level action space eqivalent of this model.
269
+ - **Low-Level Action Space Version**: [Qwen2.5-VL-3B-R2R-low-level](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-low-level)
270
+
271
+ ## 🪪 License
272
+
273
+ This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).