File size: 9,107 Bytes
be50fb9
 
 
 
 
 
 
 
 
 
 
6ef1449
566d3a1
fc26d7c
566d3a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69471bb
566d3a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb14e7e
 
 
 
566d3a1
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- Room-to-Room
- R2R
- VLN
- Vision-and-Language-Navigation
---

# Qwen2.5-VL-3B-R2R-low-level

**Qwen2.5-VL-3B-R2R-low-level** is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240.  

Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.


## 🧠 Model Summary

- **Base Model**: Qwen2.5-VL-3B-Instruct
- **Dataset**: Room-to-Room (R2R) via the Matterport3D simulator.
- **Image Resolution**: 320x240.
- **Action Space**:
  - `Move`: Move to the adjacent node closest to the center of the field of view.
  - `Left`: Turn 30° to the left.
  - `Right`: Turn 30° to the right.
  - `Stop`: Select when the agent believes it has reached the goal.

## 🧪 Training Setup

- **Frozen Modules**: Vision encoder and cross-modal projector  
- **Fine-Tuned Module**: LLM decoder (Qwen2.5)  
- **Optimizer**: AdamW  
- **Batch Size**: `1` (with gradient accumulation over each episode)  
- **Learning Rate**: `1e-5`  
- **Weight Decay**: `0.1`  
- **Precision**: `bfloat16`  
- **LR Scheduler**:  Linear scheduler with warmup (first 10% of steps)  
- **Hardware**: Trained on a single NVIDIA A100 80GB GPU  

Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.


## 📦 Usage 
```python
import json
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset as DT
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image

class CustomDataset(Dataset):
    def __init__(self, data):
        self.text = data["text"]
        self.images = data["images"]
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        return self.text[index], self.images[index]

class CollateFunctor:
    # No batch, therefore no max length
    def __init__(self, processor, width, height):
        self.processor = processor
        self.width = width
        self.height = height

    def __call__(self, batch):
        text, images = batch[0]
        label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids

        images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images]

        processed = processor(text=text, images=[images], return_tensors="pt")

        prompt_input_ids = processed["input_ids"]
        input_ids = torch.cat([prompt_input_ids, label_start], dim=1)

        attention_mask = torch.ones(1, input_ids.shape[1])
        processed["input_ids"] = input_ids
        processed["attention_mask"] = attention_mask
        
        return processed

def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt):
    images = os.listdir(images_path)
    images = [os.path.join(images_path, img) for img in images]
    images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0]))

    current_image = images.pop(-1)
    
    content = [
            {
                "type" : "text", 
                #"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: "
                "text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: " 
            },
        ]

    for img in images:
        content.append({"type" : "image", "image" : img}) 

    if len(images) == 0:
        content[0]["text"] += f"[]"

    content.append(
            {
                "type" : "text", 
                "text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:"
            }
        )
    content.append(
            {
                "type" : "image", 
                "image" : current_image
            }
        )
    if move_possible:
        possible_actions = ["Left", "Right", "Move", "Stop"]

    else:
        possible_actions = ["Left", "Right", "Stop"]
        
    content.append(
            {
                "type" : "text", 
                "text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)"
            }
        )

    messages = [
            {"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
            {"role" : "user", "content" : content},
        ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    images.extend([current_image])
    
    formatted_sample = {}
    formatted_sample["text"] = text
    formatted_sample["images"] = images

    formatted_data = [formatted_sample] 
    formatted_data = DT.from_list(formatted_data)
    return formatted_data

# Load model and processor
processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Vebbern/Qwen2.5-VL-3B-R2R-low-level",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
collate_fn = CollateFunctor(processor, 320, 240)

# Load mandatory system prompt
with open("system_prompt.txt", "r") as f:
    system_prompt = f.read()

path_id = 1021 # id for the R2R path
route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. "
images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png....
step_id = 2
distance = 8.223
previous_actions = ["Left", "Move"]
move_possible = True # if there are no nodes within the field of view this should be set to False

# This code will load all images in the path from step 0 up to the current step.
prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt)

dataset = CustomDataset(prompt)
data_loader = DataLoader(
    dataset,
    batch_size=1,
    collate_fn=collate_fn
)

# Run inference
for batch in data_loader:
    batch.to("cuda")
            
    outputs = model(**batch)
    argmax = torch.argmax(outputs.logits, dim=2)[0]
    model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
    print(f"Predicted action: {model_prediction}")

```

> ⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.


## 📊 Evaluation Results

The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.

| Metric                  | Val Seen | Val Unseen | Test  |
|-------------------------|----------|------------|-------|
| Path Length (↓)         | 10.27    | 10.50      | 10.59 |
| Navigation Error (↓)    | 7.14     | 7.84       | 7.99  |
| Oracle Success Rate (↑) | 41%      | 34%        | 34%   |
| Success Rate (↑)        | 35%      | 27%        | 26%   |
| SPL (↑)                 | 32%      | 24%        | 24%   |

### 🧾 Metric Definitions
- **Navigation Error**: Mean distance from the goal when the agent stops.
- **Success Rate**: Percentage of episodes where the agent ends within 3 meters of the goal.
- **SPL (Success weighted by Path Length)**: Penalizes long or inefficient paths.
- **Oracle Success**: If the agent had stopped at its closest point to the goal.

### 📝 Remarks

While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space.

Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space.

## 🔁 Related Models
There also exists a panoramic action space eqivalent of this model.
- **Panoramic Action Space Version**: [Qwen2.5-VL-3B-R2R-panoramic](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-panoramic)

## 🪪 License

This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).