Update README.md

fb14e7e verified 9 months ago

9.11 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	tags:
	- Room-to-Room
	- R2R
	- VLN
	- Vision-and-Language-Navigation
	---

	# Qwen2.5-VL-3B-R2R-low-level

	Qwen2.5-VL-3B-R2R-low-level is a Vision-and-Language Navigation (VLN) model fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the [Room-to-Room (R2R)](https://bringmeaspoon.org/) dataset using the Matterport3D (MP3D) simulator. The model is trained using a low-level action space, where it perceives the environment through egocentric RGB images at a resolution of 320x240.

	Only the LLM component is fine-tuned — the vision encoder and cross-modal projector are kept frozen.


	## 🧠 Model Summary

	- Base Model: Qwen2.5-VL-3B-Instruct
	- Dataset: Room-to-Room (R2R) via the Matterport3D simulator.
	- Image Resolution: 320x240.
	- Action Space:
	- `Move`: Move to the adjacent node closest to the center of the field of view.
	- `Left`: Turn 30° to the left.
	- `Right`: Turn 30° to the right.
	- `Stop`: Select when the agent believes it has reached the goal.

	## 🧪 Training Setup

	- Frozen Modules: Vision encoder and cross-modal projector
	- Fine-Tuned Module: LLM decoder (Qwen2.5)
	- Optimizer: AdamW
	- Batch Size: `1` (with gradient accumulation over each episode)
	- Learning Rate: `1e-5`
	- Weight Decay: `0.1`
	- Precision: `bfloat16`
	- LR Scheduler: Linear scheduler with warmup (first 10% of steps)
	- Hardware: Trained on a single NVIDIA A100 80GB GPU

	Training was done using supervised learning for next-action prediction. The model was conditioned at each step with a system prompt, egocentric RGB image observations (320×240), and cumulative episode history (images + actions). The model was trained offline (not in the MP3D simulator) using teacher-forcing on a preprocessed R2R dataset.


	## 📦 Usage
	```python
	import json
	import torch
	from torch.utils.data import Dataset, DataLoader
	from datasets import Dataset as DT
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from PIL import Image

	class CustomDataset(Dataset):
	def __init__(self, data):
	self.text = data["text"]
	self.images = data["images"]

	def __len__(self):
	return len(self.text)

	def __getitem__(self, index):
	return self.text[index], self.images[index]

	class CollateFunctor:
	# No batch, therefore no max length
	def __init__(self, processor, width, height):
	self.processor = processor
	self.width = width
	self.height = height

	def __call__(self, batch):
	text, images = batch[0]
	label_start = processor.tokenizer("<\|im_start\|>assistant\nAction: ", return_tensors="pt").input_ids

	images = [Image.open(img).resize((self.width, self.height), Image.Resampling.LANCZOS) for img in images]

	processed = processor(text=text, images=[images], return_tensors="pt")

	prompt_input_ids = processed["input_ids"]
	input_ids = torch.cat([prompt_input_ids, label_start], dim=1)

	attention_mask = torch.ones(1, input_ids.shape[1])
	processed["input_ids"] = input_ids
	processed["attention_mask"] = attention_mask

	return processed

	def format_prompt(images_path, step_id, route_instruction, distance_traveled, previous_actions, move_possible, processor, system_prompt):
	images = os.listdir(images_path)
	images = [os.path.join(images_path, img) for img in images]
	images = sorted(images, key=lambda x: int(x.split("_")[-1].split(".")[0]))

	current_image = images.pop(-1)

	content = [
	{
	"type" : "text",
	#"text" : f"Route instruction: {sample['instructions'][instruction_index]}\nPrevious images: "
	"text" : f"Route Instruction: {route_instruction}\nCurrent Step: {step_id}\nCummulative Distance Traveled: {distance_traveled}\nImages from Previous Steps: "
	},
	]

	for img in images:
	content.append({"type" : "image", "image" : img})

	if len(images) == 0:
	content[0]["text"] += f"[]"

	content.append(
	{
	"type" : "text",
	"text" : f"\nActions performed at Previous Steps: {previous_actions.__str__()}\nCurrent image:"
	}
	)
	content.append(
	{
	"type" : "image",
	"image" : current_image
	}
	)
	if move_possible:
	possible_actions = ["Left", "Right", "Move", "Stop"]

	else:
	possible_actions = ["Left", "Right", "Stop"]

	content.append(
	{
	"type" : "text",
	"text" : f"\nPossible actions: {possible_actions.__str__()}\nNow predict the next action based on the input you have recived. Answer on the format: Action: (an the action you choose)"
	}
	)

	messages = [
	{"role" : "system", "content" : [{"type" : "text", "text" : system_prompt}]},
	{"role" : "user", "content" : content},
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
	images.extend([current_image])

	formatted_sample = {}
	formatted_sample["text"] = text
	formatted_sample["images"] = images

	formatted_data = [formatted_sample]
	formatted_data = DT.from_list(formatted_data)
	return formatted_data

	# Load model and processor
	processor = AutoProcessor.from_pretrained("Vebbern/Qwen2.5-VL-3B-R2R-low-level")
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Vebbern/Qwen2.5-VL-3B-R2R-low-level",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="cuda"
	)

	# remember to set the correct image resolution (however a higher might still work as the vision encoder is not trained)
	collate_fn = CollateFunctor(processor, 320, 240)

	# Load mandatory system prompt
	with open("system_prompt.txt", "r") as f:
	system_prompt = f.read()

	path_id = 1021 # id for the R2R path
	route_instruction = "Turn around and keep walking on the hallway across the first doorway and wait at the top of some stairs. "
	images_path = f"./images/{path_id}" # paths to images for the whole episode, images are on the format: step_0.png, step_1.png....
	step_id = 2
	distance = 8.223
	previous_actions = ["Left", "Move"]
	move_possible = True # if there are no nodes within the field of view this should be set to False

	# This code will load all images in the path from step 0 up to the current step.
	prompt = format_prompt(images_path, step_id, route_instruction, distance, previous_actions, move_possible, processor, system_prompt)

	dataset = CustomDataset(prompt)
	data_loader = DataLoader(
	dataset,
	batch_size=1,
	collate_fn=collate_fn
	)

	# Run inference
	for batch in data_loader:
	batch.to("cuda")

	outputs = model(**batch)
	argmax = torch.argmax(outputs.logits, dim=2)[0]
	model_prediction = processor.decode(argmax[-1]) # is -1 because it does not predict one more
	print(f"Predicted action: {model_prediction}")

	```

	> ⚠️ Sorry for the rough code — the goal here is to show how the system prompt and inputs should be structured for inference. The system prompt is included in the repo.


	## 📊 Evaluation Results

	The model was evaluated on the standard Room-to-Room (R2R) validation sets using the Matterport3D simulator. Performance is measured using the standard VLN (Vision-and-Language Navigation) metrics.

	\| Metric \| Val Seen \| Val Unseen \| Test \|
	\|-------------------------\|----------\|------------\|-------\|
	\| Path Length (↓) \| 10.27 \| 10.50 \| 10.59 \|
	\| Navigation Error (↓) \| 7.14 \| 7.84 \| 7.99 \|
	\| Oracle Success Rate (↑) \| 41% \| 34% \| 34% \|
	\| Success Rate (↑) \| 35% \| 27% \| 26% \|
	\| SPL (↑) \| 32% \| 24% \| 24% \|

	### 🧾 Metric Definitions
	- Navigation Error: Mean distance from the goal when the agent stops.
	- Success Rate: Percentage of episodes where the agent ends within 3 meters of the goal.
	- SPL (Success weighted by Path Length): Penalizes long or inefficient paths.
	- Oracle Success: If the agent had stopped at its closest point to the goal.

	### 📝 Remarks

	While this model performs competitively compared to other low-level action space approaches on the R2R task, it still falls significantly short of the state-of-the-art methods that utilize a panoramic action space.

	Nonetheless, it provides a useful and interpretable Large Vision-Language Model baseline for VLN using a low-level action space.

	## 🔁 Related Models
	There also exists a panoramic action space eqivalent of this model.
	- Panoramic Action Space Version: [Qwen2.5-VL-3B-R2R-panoramic](https://huggingface.co/Vebbern/Qwen2.5-VL-3B-R2R-panoramic)

	## 🪪 License

	This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).