Instructions to use Philip-MIT/SOLE-R1-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Philip-MIT/SOLE-R1-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Philip-MIT/SOLE-R1-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Philip-MIT/SOLE-R1-8B")
model = AutoModelForImageTextToText.from_pretrained("Philip-MIT/SOLE-R1-8B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Philip-MIT/SOLE-R1-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Philip-MIT/SOLE-R1-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Philip-MIT/SOLE-R1-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Philip-MIT/SOLE-R1-8B

SGLang

How to use Philip-MIT/SOLE-R1-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Philip-MIT/SOLE-R1-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Philip-MIT/SOLE-R1-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Philip-MIT/SOLE-R1-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Philip-MIT/SOLE-R1-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Philip-MIT/SOLE-R1-8B with Docker Model Runner:
```
docker model run hf.co/Philip-MIT/SOLE-R1-8B
```

SOLE-R1-8B

File size: 5,432 Bytes

a90b872


---
license: mit
library_name: transformers
tags:
  - robotics
  - reward-model
  - video-language-model
  - reasoning
  - reinforcement-learning
  - qwen3-vl
  - bf16
pipeline_tag: image-text-to-text
datasets:
  - Philip-MIT/sole_training_data
---

# SOLE-R1-8B

SOLE-R1-8B is a video-language reward reasoning model for robotics. It is designed to estimate task progress from robot video frames and a natural-language task description, producing both per-timestep reasoning traces and scalar progress predictions that can be used as rewards for online robot reinforcement learning.

This model accompanies the paper **“SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL”** by Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, and Ondrej Biza.

- Paper: https://arxiv.org/abs/2603.28730
- Project page: https://philip-mit.github.io/sole-r1/
- Code: https://github.com/Philip-MIT/sole-r1-model
- Training data: https://huggingface.co/datasets/Philip-MIT/sole_training_data

## Model Description

SOLE-R1 predicts robot task progress from visual observations. Given an image or video-frame montage containing the first, previous, and current timestep views, plus a task description and prior progress value, the model outputs a reasoning trace and a scalar progress estimate.

Expected output format:

    <think>reasoning about task progress</think><answer>progress%</answer>

The progress estimate is intended to serve as a dense reward signal for robotic reinforcement learning, especially when manually engineered rewards are unavailable.

## Intended Use

SOLE-R1-8B is intended for:

- Robotics reward prediction
- Online robot RL reward generation
- Evaluating task progress from robot videos
- Interpretable video-language reasoning for manipulation tasks
- Research on learned reward models and robotic foundation models

It is not intended as a general-purpose safety-critical robotics controller. The model should be validated in the target environment before use in closed-loop robotic systems.

## Quick Start

The recommended interface is through RoboReason:

    # pip install -U roboreason

    import roboreason as rr

    video_paths = [
        "test_videos/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4"
    ]

    task_description = "Pick up the cube from the table."

    rewards, reasoning_traces = rr.generate(
        model="SOLE-R1",
        task_description=task_description,
        video_paths=video_paths,
        view_type_per_video=["external and wrist"],
        verbose=False,
    )

    print(rewards)
    print(reasoning_traces)

Optional pre-download:

    from roboreason.utils.model_utils import get_model_dir

    get_model_dir("sole-r1")

## Input Format

The model is trained to reason over robot task progress using prompts that include:

- A robot task description
- The first timestep progress, typically `0%`
- The previous timestep progress
- Visual observations from the first, previous, and current timesteps
- Multiple camera views when available, such as external and wrist cameras

Example task description:

    Pick up the cube from the table.

## Output Format

The expected output format is:

    <think>[reasoning about visual task progress]</think><answer>[current task progress]%</answer>

Example:

    <think>The gripper has moved closer to the cube but has not yet grasped or lifted it. This indicates incremental progress from the previous timestep.</think><answer>22%</answer>

Downstream systems should parse the numeric value inside `<answer>...</answer>` as the reward/progress estimate.

## Training Data

SOLE-R1-8B was trained on the SOLE-R1 training dataset, available at:

    Philip-MIT/sole_training_data

The dataset contains robot task progress examples with images, prompts, reasoning completions, and progress labels. The full dataset is approximately 2TB.

Streaming example:

    from datasets import load_dataset

    ds = load_dataset(
        "Philip-MIT/sole_training_data",
        split="train",
        streaming=True,
    )

    for row in ds:
        print(row)
        break

## Limitations

- Predictions may be unreliable outside the robot embodiments, tasks, camera views, and visual distributions represented in training.
- The model estimates progress rather than guaranteeing physical task success.
- Reasoning traces may be plausible but incorrect; use the parsed progress score as a model prediction, not ground truth.
- Closed-loop robot use should include safeguards, reward validation, and environment-specific testing.
- Performance can depend on prompt format, camera viewpoint, video sampling, and task wording.

## Ethical and Safety Considerations

This model is intended for robotics research. When using it in real robotic systems, users should account for hardware safety, collision risks, task constraints, and human supervision. Do not deploy the model as the sole safety mechanism for physical robots.

## Citation

BibTeX:

    @misc{schroeder2026soler1,
      title={SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL},
      author={Philip Schroeder and Thomas Weng and Karl Schmeckpeper and Eric Rosen and Stephen Hart and Ondrej Biza},
      year={2026},
      eprint={2603.28730},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
    }

## License

This repository is released under the MIT License.