File size: 5,701 Bytes
42c8776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c746ce
236051b
6c746ce
 
42c8776
 
 
 
 
 
 
a99747f
42c8776
 
 
a99747f
42c8776
 
 
 
 
a99747f
42c8776
 
 
 
 
 
 
 
 
 
 
4d7e589
42c8776
 
 
 
60409d1
 
42c8776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dfd311
 
42c8776
a6bbf87
42c8776
 
 
 
 
 
 
 
eb6137a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: apache-2.0
datasets:
- allenai/MolmoWeb-SyntheticTraj
- allenai/MolmoWeb-HumanTrajs
- allenai/MolmoWeb-HumanSkills
- allenai/MolmoWeb-SyntheticSkills
- allenai/MolmoWeb-SyntheticQA
- allenai/MolmoWeb-SyntheticGround
language:
- en
base_model:
- Qwen/Qwen3-8B
- google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- olmo
- molmo
- molmo2
---

<img src="molmoweb_logo.png" alt="Logo for the MolmoWeb Project" style="width: auto; height: 50px;">

# MolmoWeb-8B

<span style="color:red; font-weight: bold;">Important Update!</span> 
We made a few small but important updates to this HF/transformers-compatible checkpoint to ensure exact outputs to our native model checkpoint on **March 29, 2026 ~6PM PST**. 
If you downloaded this model checkpoint earlier than this time, we recommend re-downloading it. See PRs [2](https://huggingface.co/allenai/MolmoWeb-8B/discussions/2) and [3](https://huggingface.co/allenai/MolmoWeb-8B/discussions/3) for more details. Thanks for your understanding! 

MolmoWeb is a family of fully open multimodal web agents. MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only
models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks
(SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate
consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7%
and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1)on WebVoyager and Online-Mind2Web
respectively. 

**Learn more** about the MolmoWeb family in our announcement [blog post](https://allenai.org/blog/molmoweb) and [tech report](https://allenai.org/papers/molmoweb).

MolmoWeb-8B is based on [Molmo2](https://arxiv.org/abs/2601.10611) architecture, which uses [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) and [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone. 

Ai2 is committed to open science. The MolmoWeb datasets are available [here](https://huggingface.co/collections/allenai/molmoweb-data). 
All other artifacts used in creating MolmoWeb (training code, [evaluations](https://github.com/allenai/molmoweb), intermediate checkpoints) will be made available, furthering our commitment to open-source AI development and reproducibility.

Quick links:
- ๐Ÿ’ฌ [Demo](https://molmoweb.allen.ai/)
- ๐Ÿ“‚ [All Models](https://huggingface.co/collections/allenai/molmoweb)
- ๐Ÿ“š [All Data](https://huggingface.co/collections/allenai/molmoweb-data)
- ๐Ÿ“ƒ [Paper](https://allenai.org/papers/molmoweb)
- ๐ŸŽฅ [Blog with Videos](https://allenai.org/blog/molmoweb)

## Quick Start
```python
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
import torch
from jinja2 import Template

checkpoint_dir = "allenai/MolmoWeb-8B"

model = AutoModelForImageTextToText.from_pretrained(
    checkpoint_dir,
    trust_remote_code=True,
    torch_dtype=torch.float32, # we recommend using the default float32 precision 
    attn_implementation="sdpa",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    checkpoint_dir,
    trust_remote_code=True,
    padding_side="left",
)


MOLMOWEB_THINK_TEMPLATE = Template(
"""
# GOAL
{{ task_description }}

# PREVIOUS STEPS
{% for action in past_actions: -%}
## Step {{ action['index'] }}
THOUGHT: {{ action['thought'] }}
ACTION: {{ action['action'] }}
{% endfor %}
# CURRENTLY ACTIVE PAGE
Page {{ page_index }}: {{ page_title }} | {{ page_url }}

# NEXT STEP

"""
)

task_description = "Tell me about the Ai2 PIROR team's recent projects"
past_actions = []
user_message = MOLMOWEB_THINK_TEMPLATE.render(
    page_title=None,
    page_url="about:blank",
    page_index=0,
    task_description=task_description,
    past_actions=[]
)
system_message = "molmo_web_think"
prompt = f"{system_message}: {user_message}"

blank_image = Image.new("RGB", (1280, 720), color="white")

image_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image", "image": blank_image},
        ]
    }
]

inputs = processor.apply_chat_template(
    image_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    padding=True,
)

# Remove token_type_ids: HF uses it to enable bidirectional attention for image tokens; molmoweb is trained with causal attention only
inputs = {k: v.to("cuda") for k, v in inputs.items() if k != "token_type_ids"} 

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=200)

generated_tokens = output[0, inputs["input_ids"].size(1):]
print(processor.decode(generated_tokens, skip_special_tokens=True))
```

## License and Use

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2โ€™s [Responsible Use Guidelines](https://allenai.org/responsible-use).

## Citation

If you use this dataset, please cite:

[arXiv:2604.08516](https://arxiv.org/abs/2604.08516)

```bibtex
@misc{gupta2026molmowebopenvisualweb,
      title={MolmoWeb: Open Visual Web Agent and Open Data for the Open Web}, 
      author={Tanmay Gupta and Piper Wolters and Zixian Ma and Peter Sushko and Rock Yuren Pang and Diego Llanes and Yue Yang and Taira Anderson and Boyuan Zheng and Zhongzheng Ren and Harsh Trivedi and Taylor Blanton and Caleb Ouellette and Winson Han and Ali Farhadi and Ranjay Krishna},
      year={2026},
      eprint={2604.08516},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08516}, 
}