|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
tags: |
|
|
- Room-to-Room |
|
|
- R2R |
|
|
- VLN |
|
|
- Vision-and-Language-Navigation |
|
|
--- |
|
|
|
|
|
# Qwen2.5-VL-3B-R2R-low-level |
|
|
|
|
|
**Qwen2.5-VL-3B-R2R-low-level** is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240. |
|
|
|
|
|
Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen. |
|
|
|
|
|
|
|
|
## 🧠 Model Summary |
|
|
|
|
|
- **Base Model**: Qwen2.5-VL-3B-Instruct |
|
|
- **Dataset**: Room-to-Room (R2R) via the Matterport3D simulator. |
|
|
- **Image Resolution**: 320x240. |
|
|
- **Action Space**: |
|
|
- `Move`: Move to the adjacent node closest to the center of the field of view. |
|
|
- `Left`: Turn 30° to the left. |
|
|
- `Right`: Turn 30° to the right. |
|
|
- `Stop`: Select when the agent believes it has reached the goal. |
|
|
|
|
|
## 🧪 Training Setup |
|
|
|
|
|
- **Frozen Modules**: Vision encoder and cross-modal projector |
|
|
- **Fine-Tuned Module**: LLM decoder (Qwen2.5) |
|
|
- **Optimizer**: AdamW |
|
|
- **Batch Size**: `1` (with gradient accumulation over each episode) |
|
|
- **Learning Rate**: `1e-5` |
|
|
- **Weight Decay**: `0.1` |
|
|
- **Precision**: `bfloat16` |
|
|
- **LR Scheduler**: Linear scheduler with warmup (first 10% of steps) |
|
|
- **Hardware**: Trained on a single NVIDIA A100 80GB GPU |
|
|
|
|
|
Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset. |
|
|
|
|
|
|
|
|
## 📦 Usage |
|
|
```python |
|
|
import json |
|
|
import torch |
|
|
from torch.utils.data import Dataset, DataLoader |
|
|
from datasets import Dataset as DT |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
class CustomDataset(Dataset): |
|
|
def __init__(self, data): |
|
|
self.text = data["text"] |
|
|
self.images = data["images"] |
|
|
|
|
|
def __len__(self): |
|
|
return len(self.text) |
|
|
|
|
|
def __getitem__(self, index): |
|
|
return self.text[index], self.images[index] |
|
|
|
|
|
class CollateFunctor: |
|
|
# No batch, therefore no max length |
|
|
def __init__(self, processor, width, height): |
|
|
self.processor = processor |
|
|
self.width = width |
|
|
self.height = height |
|
|
|
|
|
def __call__(self, batch): |
|
|
text, images = batch[0] |
|
|
label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids |
|
|
|
|
|
images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images] |
|
|
|
|
|
processed = processor(text=text, images=[images], return_tensors="pt") |
|
|
|
|
|
prompt_input_ids = processed["input_ids"] |
|
|
input_ids = torch.cat([prompt_input_ids, label_start], dim=1) |
|
|
|
|
|
attention_mask = torch.ones(1, input_ids.shape[1]) |
|
|
processed["input_ids"] = input_ids |
|
|
processed["attention_mask"] = attention_mask |
|
|
|
|
|
return processed |
|
|
|
|
|
def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt): |
|
|
images = os.listdir(images_path) |
|
|
images = [os.path.join(images_path, img) for img in images] |
|
|
images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0])) |
|
|
|
|
|
current_image = images.pop(-1) |
|
|
|
|
|
content = [ |
|
|
{ |
|
|
"type" : "text", |
|
|
#"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: " |
|
|
"text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: " |
|
|
}, |
|
|
] |
|
|
|
|
|
for img in images: |
|
|
content.append({"type" : "image", "image" : img}) |
|
|
|
|
|
if len(images) == 0: |
|
|
content[0]["text"] += f"[]" |
|
|
|
|
|
content.append( |
|
|
{ |
|
|
"type" : "text", |
|
|
"text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:" |
|
|
} |
|
|
) |
|
|
content.append( |
|
|
{ |
|
|
"type" : "image", |
|
|
"image" : current_image |
|
|
} |
|
|
) |
|
|
if move_possible: |
|
|
possible_actions = ["Left", "Right", "Move", "Stop"] |
|
|
|
|
|
else: |
|
|
possible_actions = ["Left", "Right", "Stop"] |
|
|
|
|
|
content.append( |
|
|
{ |
|
|
"type" : "text", |
|
|
"text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)" |
|
|
} |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]}, |
|
|
{"role" : "user", "content" : content}, |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) |
|
|
images.extend([current_image]) |
|
|
|
|
|
formatted_sample = {} |
|
|
formatted_sample["text"] = text |
|
|
formatted_sample["images"] = images |
|
|
|
|
|
formatted_data = [formatted_sample] |
|
|
formatted_data = DT.from_list(formatted_data) |
|
|
return formatted_data |
|
|
|
|
|
# Load model and processor |
|
|
processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level") |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"Vebbern/Qwen2.5-VL-3B-R2R-low-level", |
|
|
torch_dtype=torch.bfloat16, |
|
|
attn_implementation="flash_attention_2", |
|
|
device_map="cuda" |
|
|
) |
|
|
|
|
|
# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained) |
|
|
collate_fn = CollateFunctor(processor, 320, 240) |
|
|
|
|
|
# Load mandatory system prompt |
|
|
with open("system_prompt.txt", "r") as f: |
|
|
system_prompt = f.read() |
|
|
|
|
|
path_id = 1021 # id for the R2R path |
|
|
route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. " |
|
|
images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png.... |
|
|
step_id = 2 |
|
|
distance = 8.223 |
|
|
previous_actions = ["Left", "Move"] |
|
|
move_possible = True # if there are no nodes within the field of view this should be set to False |
|
|
|
|
|
# This code will load all images in the path from step 0 up to the current step. |
|
|
prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt) |
|
|
|
|
|
dataset = CustomDataset(prompt) |
|
|
data_loader = DataLoader( |
|
|
dataset, |
|
|
batch_size=1, |
|
|
collate_fn=collate_fn |
|
|
) |
|
|
|
|
|
# Run inference |
|
|
for batch in data_loader: |
|
|
batch.to("cuda") |
|
|
|
|
|
outputs = model(**batch) |
|
|
argmax = torch.argmax(outputs.logits, dim=2)[0] |
|
|
model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more |
|
|
print(f"Predicted action: {model_prediction}") |
|
|
|
|
|
``` |
|
|
|
|
|
> ⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo. |
|
|
|
|
|
|
|
|
## 📊 Evaluation Results |
|
|
|
|
|
The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics. |
|
|
|
|
|
| Metric | Val Seen | Val Unseen | Test | |
|
|
|-------------------------|----------|------------|-------| |
|
|
| Path Length (↓) | 10.27 | 10.50 | 10.59 | |
|
|
| Navigation Error (↓) | 7.14 | 7.84 | 7.99 | |
|
|
| Oracle Success Rate (↑) | 41% | 34% | 34% | |
|
|
| Success Rate (↑) | 35% | 27% | 26% | |
|
|
| SPL (↑) | 32% | 24% | 24% | |
|
|
|
|
|
### 🧾 Metric Definitions |
|
|
- **Navigation Error**: Mean distance from the goal when the agent stops. |
|
|
- **Success Rate**: Percentage of episodes where the agent ends within 3 meters of the goal. |
|
|
- **SPL (Success weighted by Path Length)**: Penalizes long or inefficient paths. |
|
|
- **Oracle Success**: If the agent had stopped at its closest point to the goal. |
|
|
|
|
|
### 📝 Remarks |
|
|
|
|
|
While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space. |
|
|
|
|
|
Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space. |
|
|
|
|
|
## 🔁 Related Models |
|
|
There also exists a panoramic action space eqivalent of this model. |
|
|
- **Panoramic Action Space Version**: [Qwen2.5-VL-3B-R2R-panoramic](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-panoramic) |
|
|
|
|
|
## 🪪 License |
|
|
|
|
|
This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). |