Image-Text-to-Text
Transformers
Safetensors
qwen3_vl
robotics
reward-model
video-language-model
reasoning
reinforcement-learning
qwen3-vl
bf16
conversational
Instructions to use Philip-MIT/SOLE-R1-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Philip-MIT/SOLE-R1-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Philip-MIT/SOLE-R1-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Philip-MIT/SOLE-R1-8B") model = AutoModelForImageTextToText.from_pretrained("Philip-MIT/SOLE-R1-8B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Philip-MIT/SOLE-R1-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Philip-MIT/SOLE-R1-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Philip-MIT/SOLE-R1-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Philip-MIT/SOLE-R1-8B
- SGLang
How to use Philip-MIT/SOLE-R1-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Philip-MIT/SOLE-R1-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Philip-MIT/SOLE-R1-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Philip-MIT/SOLE-R1-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Philip-MIT/SOLE-R1-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Philip-MIT/SOLE-R1-8B with Docker Model Runner:
docker model run hf.co/Philip-MIT/SOLE-R1-8B
File size: 4,540 Bytes
179a97a f407c3f 179a97a f407c3f 179a97a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: mit
library_name: transformers
tags:
- robotics
- reward-model
- video-language-model
- reasoning
- reinforcement-learning
- qwen3-vl
- bf16
pipeline_tag: image-text-to-text
datasets:
- Philip-MIT/sole_training_data
---
# SOLE-R1-8B
SOLE-R1-8B is a video-language reward reasoning model for robotics. It is designed to estimate task progress from robot video frames and a natural-language task description, producing both per-timestep reasoning traces and scalar progress predictions that can be used as rewards for online robot reinforcement learning.
This model accompanies the paper **“SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL”** by Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, and Ondrej Biza.
- Paper: https://arxiv.org/abs/2603.28730
- Project page: https://philip-mit.github.io/sole-r1/
- Code: https://github.com/Philip-MIT/sole-r1-model
- Training data: https://huggingface.co/datasets/Philip-MIT/sole_training_data
## Model Description
SOLE-R1 predicts robot task progress from visual observations. Given a video and a task description, the model outputs a reasoning trace and a scalar progress estimate.
Expected output format:
<think>reasoning about task progress</think><answer>progress%</answer>
The progress estimate is intended to serve as a dense reward signal for robotic reinforcement learning, especially when manually engineered rewards are unavailable.
## Quick Start
The recommended interface for inference is [RoboReason](https://github.com/Philip-MIT/roboreason):
# pip install -U roboreason
import roboreason as rr
video_paths = [
"test_videos/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4"
]
task_description = "Pick up the cube from the table."
rewards, reasoning_traces = rr.generate(
model="SOLE-R1",
task_description=task_description,
video_paths=video_paths,
view_type_per_video=["external and wrist"],
verbose=False,
)
print(rewards)
print(reasoning_traces)
# Plotting with show_reasoning_traces=True
output_sole = {"model": "SOLE-R1", "rewards": rewards[0], "reasoning_traces": reasoning_traces[0]}
rr.video_plot(
outputs=[output_sole],
plot_save_path='model_outputs/sole-r1/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4',
video_path=video_paths[0],
show_reasoning_traces=True,
task_description=task_description,
verbose=False
)
Optional pre-download:
from roboreason.utils.model_utils import get_model_dir
get_model_dir("sole-r1")
## Input Format
The model is trained to reason over robot task progress using prompts that include:
- A robot task description
- The first timestep progress, typically `0%`
- The previous timestep progress
- Visual observations from the first, previous, and current timesteps
- Multiple camera views when available, such as external and wrist cameras
Example task description:
Pick up the cube from the table.
## Output Format
The expected output format is:
<think>[reasoning about visual task progress]</think><answer>[current task progress]%</answer>
Example:
<think>The gripper has moved closer to the cube but has not yet grasped or lifted it. This indicates incremental progress from the previous timestep.</think><answer>22%</answer>
Downstream systems should parse the numeric value inside `<answer>...</answer>` as the reward/progress estimate.
## Training Data
The model was trained on the [SOLE-R1-8B](https://huggingface.co/Philip-MIT/SOLE-R1-8B) training dataset.
The dataset contains robot task progress examples with images, prompts, reasoning completions, and progress labels. The full dataset is approximately 2TB.
Streaming example:
from datasets import load_dataset
ds = load_dataset(
"Philip-MIT/sole_training_data",
split="train",
streaming=True,
)
for row in ds:
print(row)
break
## Citation
BibTeX:
@misc{schroeder2026soler1,
title={SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL},
author={Philip Schroeder and Thomas Weng and Karl Schmeckpeper and Eric Rosen and Stephen Hart and Ondrej Biza},
year={2026},
eprint={2603.28730},
archivePrefix={arXiv},
primaryClass={cs.RO}
}
## License
This repository is released under the MIT License.
|