--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - Room-to-Room - R2R - VLN - Vision-and-Language-Navigation --- # Qwen2.5-VL-3B-R2R-low-level **Qwen2.5-VL-3B-R2R-low-level** is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240. Only the LLM component is fine-tuned โ€” the vision encoder and cross-modal projector are kept frozen. ## ๐Ÿง  Model Summary - **Base Model**: Qwen2.5-VL-3B-Instruct - **Dataset**: Room-to-Room (R2R) via the Matterport3D simulator. - **Image Resolution**: 320x240. - **Action Space**: - `Move`: Move to the adjacent node closest to the center of the field of view. - `Left`: Turn 30ยฐ to the left. - `Right`: Turn 30ยฐ to the right. - `Stop`: Select when the agent believes it has reached the goal. ## ๐Ÿงช Training Setup - **Frozen Modules**: Vision encoder and cross-modal projector - **Fine-Tuned Module**: LLM decoder (Qwen2.5) - **Optimizer**: AdamW - **Batch Size**: `1` (with gradient accumulation over each episode) - **Learning Rate**: `1e-5` - **Weight Decay**: `0.1` - **Precision**: `bfloat16` - **LR Scheduler**: Linear scheduler with warmup (first 10% of steps) - **Hardware**: Trained on a single NVIDIA A100 80GB GPU Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320ร—240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset. ## ๐Ÿ“ฆ Usage ```python import json import torch from torch.utils.data import Dataset, DataLoader from datasets import Dataset as DT from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from PIL import Image class CustomDataset(Dataset): def __init__(self, data): self.text = data["text"] self.images = data["images"] def __len__(self): return len(self.text) def __getitem__(self, index): return self.text[index], self.images[index] class CollateFunctor: # No batch, therefore no max length def __init__(self, processor, width, height): self.processor = processor self.width = width self.height = height def __call__(self, batch): text, images = batch[0] label_start = processor.tokenizer("<|im_start|>assistant\nAction: ", return_tensors="pt").input_ids images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images] processed = processor(text=text, images=[images], return_tensors="pt") prompt_input_ids = processed["input_ids"] input_ids = torch.cat([prompt_input_ids, label_start], dim=1) attention_mask = torch.ones(1, input_ids.shape[1]) processed["input_ids"] = input_ids processed["attention_mask"] = attention_mask return processed def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt): images = os.listdir(images_path) images = [os.path.join(images_path, img) for img in images] images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0])) current_image = images.pop(-1) content = [ { "type" : "text", #"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: " "text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: " }, ] for img in images: content.append({"type" : "image", "image" : img}) if len(images) == 0: content[0]["text"] += f"[]" content.append( { "type" : "text", "text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:" } ) content.append( { "type" : "image", "image" : current_image } ) if move_possible: possible_actions = ["Left", "Right", "Move", "Stop"] else: possible_actions = ["Left", "Right", "Stop"] content.append( { "type" : "text", "text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)" } ) messages = [ {"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]}, {"role" : "user", "content" : content}, ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) images.extend([current_image]) formatted_sample = {} formatted_sample["text"] = text formatted_sample["images"] = images formatted_data = [formatted_sample] formatted_data = DT.from_list(formatted_data) return formatted_data # Load model and processor processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level") model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Vebbern/Qwen2.5-VL-3B-R2R-low-level", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="cuda" ) # remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained) collate_fn = CollateFunctor(processor, 320, 240) # Load mandatory system prompt with open("system_prompt.txt", "r") as f: system_prompt = f.read() path_id = 1021 # id for the R2R path route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. " images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png.... step_id = 2 distance = 8.223 previous_actions = ["Left", "Move"] move_possible = True # if there are no nodes within the field of view this should be set to False # This code will load all images in the path from step 0 up to the current step. prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt) dataset = CustomDataset(prompt) data_loader = DataLoader( dataset, batch_size=1, collate_fn=collate_fn ) # Run inference for batch in data_loader: batch.to("cuda") outputs = model(**batch) argmax = torch.argmax(outputs.logits, dim=2)[0] model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more print(f"Predicted action: {model_prediction}") ``` > โš ๏ธ Sorry for the rough code โ€” the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo. ## ๐Ÿ“Š Evaluation Results The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics. | Metric | Val Seen | Val Unseen | Test | |-------------------------|----------|------------|-------| | Path Length (โ†“) | 10.27 | 10.50 | 10.59 | | Navigation Error (โ†“) | 7.14 | 7.84 | 7.99 | | Oracle Success Rate (โ†‘) | 41% | 34% | 34% | | Success Rate (โ†‘) | 35% | 27% | 26% | | SPL (โ†‘) | 32% | 24% | 24% | ### ๐Ÿงพ Metric Definitions - **Navigation Error**: Mean distance from the goal when the agent stops. - **Success Rate**: Percentage of episodes where the agent ends within 3 meters of the goal. - **SPL (Success weighted by Path Length)**: Penalizes long or inefficient paths. - **Oracle Success**: If the agent had stopped at its closest point to the goal. ### ๐Ÿ“ Remarks While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space. Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space. ## ๐Ÿ” Related Models There also exists a panoramic action space eqivalent of this model. - **Panoramic Action Space Version**: [Qwen2.5-VL-3B-R2R-panoramic](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-panoramic) ## ๐Ÿชช License This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).