File size: 9,107 Bytes

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- Room-to-Room
- R2R
- VLN
- Vision-and-Language-Navigation
---

# Qwen2.5-VL-3B-R2R-low-level

**Qwen2.5-VL-3B-R2R-low-level** is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240.  

Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.


## 🧠 Model Summary

- **Base Model**: Qwen2.5-VL-3B-Instruct
- **Dataset**: Room-to-Room (R2R) via the Matterport3D simulator.
- **Image Resolution**: 320x240.
- **Action Space**:
  - `Move`: Move to the adjacent node closest to the center of the field of view.
  - `Left`: Turn 30° to the left.
  - `Right`: Turn 30° to the right.
  - `Stop`: Select when the agent believes it has reached the goal.

## 🧪 Training Setup

- **Frozen Modules**: Vision encoder and cross-modal projector  
- **Fine-Tuned Module**: LLM decoder (Qwen2.5)  
- **Optimizer**: AdamW  
- **Batch Size**: `1` (with gradient accumulation over each episode)  
- **Learning Rate**: `1e-5`  
- **Weight Decay**: `0.1`  
- **Precision**: `bfloat16`  
- **LR Scheduler**:  Linear scheduler with warmup (first 10% of steps)  
- **Hardware**: Trained on a single NVIDIA A100 80GB GPU  

Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.


## 📦 Usage 
```python
import json
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset as DT
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image

class CustomDataset(Dataset):
    def __init__(self, data):
        self.text = data["text"]
        self.images = data["images"]
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        return self.text[index], self.images[index]

class CollateFunctor:
    # No batch, therefore no max length
    def __init__(self, processor, width, height):
        self.processor = processor
        self.width = width
        self.height = height

    def __call__(self, batch):
        text, images = batch[0]
        label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids

        images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images]

        processed = processor(text=text, images=[images], return_tensors="pt")

        prompt_input_ids = processed["input_ids"]
        input_ids = torch.cat([prompt_input_ids, label_start], dim=1)

        attention_mask = torch.ones(1, input_ids.shape[1])
        processed["input_ids"] = input_ids
        processed["attention_mask"] = attention_mask
        
        return processed

def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt):
    images = os.listdir(images_path)
    images = [os.path.join(images_path, img) for img in images]
    images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0]))

    current_image = images.pop(-1)
    
    content = [
            {
                "type" : "text", 
                #"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: "
                "text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: " 
            },
        ]

    for img in images:
        content.append({"type" : "image", "image" : img}) 

    if len(images) == 0:
        content[0]["text"] += f"[]"

    content.append(
            {
                "type" : "text", 
                "text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:"
            }
        )
    content.append(
            {
                "type" : "image", 
                "image" : current_image
            }
        )
    if move_possible:
        possible_actions = ["Left", "Right", "Move", "Stop"]

    else:
        possible_actions = ["Left", "Right", "Stop"]
        
    content.append(
            {
                "type" : "text", 
                "text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)"
            }
        )

    messages = [
            {"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
            {"role" : "user", "content" : content},
        ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    images.extend([current_image])
    
    formatted_sample = {}
    formatted_sample["text"] = text
    formatted_sample["images"] = images

    formatted_data = [formatted_sample] 
    formatted_data = DT.from_list(formatted_data)
    return formatted_data

# Load model and processor
processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Vebbern/Qwen2.5-VL-3B-R2R-low-level",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
collate_fn = CollateFunctor(processor, 320, 240)

# Load mandatory system prompt
with open("system_prompt.txt", "r") as f:
    system_prompt = f.read()

path_id = 1021 # id for the R2R path
route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. "
images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png....
step_id = 2
distance = 8.223
previous_actions = ["Left", "Move"]
move_possible = True # if there are no nodes within the field of view this should be set to False

# This code will load all images in the path from step 0 up to the current step.
prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt)

dataset = CustomDataset(prompt)
data_loader = DataLoader(
    dataset,
    batch_size=1,
    collate_fn=collate_fn
)

# Run inference
for batch in data_loader:
    batch.to("cuda")
            
    outputs = model(**batch)
    argmax = torch.argmax(outputs.logits, dim=2)[0]
    model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
    print(f"Predicted action: {model_prediction}")

```

> ⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.


## 📊 Evaluation Results

The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.

| Metric                  | Val Seen | Val Unseen | Test  |
|-------------------------|----------|------------|-------|
| Path Length (↓)         | 10.27    | 10.50      | 10.59 |
| Navigation Error (↓)    | 7.14     | 7.84       | 7.99  |
| Oracle Success Rate (↑) | 41%      | 34%        | 34%   |
| Success Rate (↑)        | 35%      | 27%        | 26%   |
| SPL (↑)                 | 32%      | 24%        | 24%   |

### 🧾 Metric Definitions
- **Navigation Error**: Mean distance from the goal when the agent stops.
- **Success Rate**: Percentage of episodes where the agent ends within 3 meters of the goal.
- **SPL (Success weighted by Path Length)**: Penalizes long or inefficient paths.
- **Oracle Success**: If the agent had stopped at its closest point to the goal.

### 📝 Remarks

While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space.

Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space.

## 🔁 Related Models
There also exists a panoramic action space eqivalent of this model.
- **Panoramic Action Space Version**: [Qwen2.5-VL-3B-R2R-panoramic](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-panoramic)

## 🪪 License

This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).