File size: 4,540 Bytes
179a97a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f407c3f
179a97a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f407c3f
 
 
 
 
 
 
 
 
 
 
 
 
179a97a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: mit
library_name: transformers
tags:
  - robotics
  - reward-model
  - video-language-model
  - reasoning
  - reinforcement-learning
  - qwen3-vl
  - bf16
pipeline_tag: image-text-to-text
datasets:
  - Philip-MIT/sole_training_data
---

# SOLE-R1-8B

SOLE-R1-8B is a video-language reward reasoning model for robotics. It is designed to estimate task progress from robot video frames and a natural-language task description, producing both per-timestep reasoning traces and scalar progress predictions that can be used as rewards for online robot reinforcement learning.

This model accompanies the paper **“SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL”** by Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, and Ondrej Biza.

- Paper: https://arxiv.org/abs/2603.28730
- Project page: https://philip-mit.github.io/sole-r1/
- Code: https://github.com/Philip-MIT/sole-r1-model
- Training data: https://huggingface.co/datasets/Philip-MIT/sole_training_data

## Model Description

SOLE-R1 predicts robot task progress from visual observations. Given a video and a task description, the model outputs a reasoning trace and a scalar progress estimate.

Expected output format:

    <think>reasoning about task progress</think><answer>progress%</answer>

The progress estimate is intended to serve as a dense reward signal for robotic reinforcement learning, especially when manually engineered rewards are unavailable.


## Quick Start

The recommended interface for inference is [RoboReason](https://github.com/Philip-MIT/roboreason):

    # pip install -U roboreason

    import roboreason as rr

    video_paths = [
        "test_videos/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4"
    ]

    task_description = "Pick up the cube from the table."

    rewards, reasoning_traces = rr.generate(
        model="SOLE-R1",
        task_description=task_description,
        video_paths=video_paths,
        view_type_per_video=["external and wrist"],
        verbose=False,
    )
    print(rewards)
    print(reasoning_traces)

    # Plotting with show_reasoning_traces=True
    output_sole = {"model": "SOLE-R1", "rewards": rewards[0], "reasoning_traces": reasoning_traces[0]}
    rr.video_plot(
        outputs=[output_sole], 
        plot_save_path='model_outputs/sole-r1/robosuite/lift/unsuccessful/robosuite_lift_episode_12_unsuccessful_max_reward_38.mp4', 
        video_path=video_paths[0],
        show_reasoning_traces=True,
        task_description=task_description,
        verbose=False
    )




Optional pre-download:

    from roboreason.utils.model_utils import get_model_dir

    get_model_dir("sole-r1")

## Input Format

The model is trained to reason over robot task progress using prompts that include:

- A robot task description
- The first timestep progress, typically `0%`
- The previous timestep progress
- Visual observations from the first, previous, and current timesteps
- Multiple camera views when available, such as external and wrist cameras

Example task description:

    Pick up the cube from the table.

## Output Format

The expected output format is:

    <think>[reasoning about visual task progress]</think><answer>[current task progress]%</answer>

Example:

    <think>The gripper has moved closer to the cube but has not yet grasped or lifted it. This indicates incremental progress from the previous timestep.</think><answer>22%</answer>

Downstream systems should parse the numeric value inside `<answer>...</answer>` as the reward/progress estimate.

## Training Data

The model was trained on the [SOLE-R1-8B](https://huggingface.co/Philip-MIT/SOLE-R1-8B)  training dataset.

The dataset contains robot task progress examples with images, prompts, reasoning completions, and progress labels. The full dataset is approximately 2TB.

Streaming example:

    from datasets import load_dataset

    ds = load_dataset(
        "Philip-MIT/sole_training_data",
        split="train",
        streaming=True,
    )

    for row in ds:
        print(row)
        break

## Citation

BibTeX:

    @misc{schroeder2026soler1,
      title={SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot RL},
      author={Philip Schroeder and Thomas Weng and Karl Schmeckpeper and Eric Rosen and Stephen Hart and Ondrej Biza},
      year={2026},
      eprint={2603.28730},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
    }

## License

This repository is released under the MIT License.