File size: 6,032 Bytes
22d5020 5f8da7f 22d5020 5f8da7f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Instruct
tags:
- reward-model
- robotics
- reinforcement-learning
- vision-language-model
- qwen3-vl
- robot-learning
library_name: transformers
---
# Large Reward Models (LRMs)
**Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models**
[Project Page](https://yanru-wu.github.io/Large-Reward-Models/) | [Paper](https://arxiv.org/abs/2603.16065)
**Authors:** Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao†, Yue Wang†
**Affiliations:** USC Physical Superintelligence Lab, Toyota Research Institute
## Overview
This repository contains three specialized Large Reward Models (LRMs) fine-tuned from [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) for generating reward signals in robot reinforcement learning. Each model serves a distinct role in the reward pipeline:
| Model | Path | Description |
|-------|------|-------------|
| **Temporal Contrastive** | `contrastive/` | Compares two observations to determine which is closer to task completion |
| **Absolute Progress** | `progress/` | Estimates the completion progress (0.0–1.0) from a single observation |
| **Task Completion** | `completion/` | Binary classifier for whether a task has been completed (yes/no) |
## Usage
### Requirements
```bash
pip install transformers torch pillow
```
### Temporal Contrastive Model
Given an initial observation and two later observations, predicts which is closer to task completion.
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
model_path = "USC-PSI-Lab/LRM-models"
subfolder = "contrastive"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path, subfolder=subfolder,
torch_dtype=torch.bfloat16, device_map="auto",
)
processor = AutoProcessor.from_pretrained(
model_path, subfolder=subfolder,
)
# Load images
initial_img = Image.open("initial.jpg").convert("RGB")
image_a = Image.open("image_a.jpg").convert("RGB")
image_b = Image.open("image_b.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "text", "text": "Task: Compare the completion progress.\n\nThe task is: Pick up the cup.\n\nYou are given:\n- Initial observation: "},
{"type": "image", "image": initial_img},
{"type": "text", "text": "\n- Later observation (Image A): "},
{"type": "image", "image": image_a},
{"type": "text", "text": "\n- Later observation (Image B): "},
{"type": "image", "image": image_b},
{"type": "text", "text": '\n\nQuestion: Which of Image A or Image B is closer to completing the task?\nSelect one value from the following list:\n["ImageA", "ImageB"]\n\nPlease provide a step-by-step visual analysis first, and then output your answer in the following JSON format:\n{ "more_complete_image": "selected_value" }'},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[initial_img, image_a, image_b], padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
response = processor.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# Output: { "more_complete_image": "ImageA" }
```
### Absolute Progress Model
Estimates completion progress as a value between 0.0 and 1.0.
```python
subfolder = "progress"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path, subfolder=subfolder,
torch_dtype=torch.bfloat16, device_map="auto",
)
processor = AutoProcessor.from_pretrained(
model_path, subfolder=subfolder,
)
observation = Image.open("observation.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "text", "text": "Task: Estimate the completion progress.\n\nThe task is: Pick up the cup.\n\nYou are given:\n- Current observation: "},
{"type": "image", "image": observation},
{"type": "text", "text": '\n\nEstimate the task completion progress from 0.0 (not started) to 1.0 (fully completed).\nOutput your answer in the following JSON format:\n{ "completion_progress": value }'},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[observation], padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
response = processor.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# Output: { "completion_progress": 0.7 }
```
### Task Completion Model
Binary prediction of whether a task has been completed.
```python
subfolder = "completion"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path, subfolder=subfolder,
torch_dtype=torch.bfloat16, device_map="auto",
)
processor = AutoProcessor.from_pretrained(
model_path, subfolder=subfolder,
)
observation = Image.open("observation.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "text", "text": "Task: Determine task completion.\n\nThe task is: Pick up the cup.\n\nYou are given:\n- Current observation: "},
{"type": "image", "image": observation},
{"type": "text", "text": '\n\nHas the task been completed?\nOutput your answer in the following JSON format:\n{ "task_completed": "yes" or "no" }'},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[observation], padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# Output: { "task_completed": "no" }
```
## License
This project is licensed under the Apache 2.0 License. |