Image-Text-to-Text
Transformers
English
multimodal
olmo
molmo
molmo2
File size: 3,381 Bytes
bd06530
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ed1e15
 
720ec5a
bd06530
 
 
 
 
 
 
 
 
 
7ed1e15
bd06530
7ed1e15
bd06530
 
 
 
 
 
 
 
 
 
 
 
 
 
7623f71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
datasets:
- allenai/MolmoWeb-SyntheticTraj
- allenai/MolmoWeb-HumanTrajs
- allenai/MolmoWeb-HumanSkills
- allenai/MolmoWeb-SyntheticSkills
- allenai/MolmoWeb-SyntheticQA
- allenai/MolmoWeb-SyntheticGround
language:
- en
base_model:
- Qwen/Qwen3-8B
- google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- olmo
- molmo
- molmo2
---

<img src="molmoweb_logo.png" alt="Logo for the MolmoWeb Project" style="width: auto; height: 50px;">

# MolmoWeb-4B-Native 

**Note** that this is the molmo-native checkpoint, and it's NOT Huggingface/transformers-compatible. Check out [allenai/MolmoWeb-4B](https://huggingface.co/allenai/MolmoWeb-4B) for HF-compatible checkpoint.

MolmoWeb is a family of fully open multimodal web agents. MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only
models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks
(SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate
consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7%
and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1)on WebVoyager and Online-Mind2Web
respectively. 

**Learn more** about the MolmoWeb family in our announcement [blog post](https://allenai.org/blog/molmoweb) and [tech report](https://allenai.org/papers/molmoweb).

MolmoWeb-4B-Native is based on [Molmo2](https://arxiv.org/abs/2601.10611) architecture, which uses [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) and [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone. 

Ai2 is committed to open science. The MolmoWeb datasets are available [here](https://huggingface.co/collections/allenai/molmoweb-data). 
All other artifacts used in creating MolmoWeb (training code, [evaluations](https://github.com/allenai/molmoweb), intermediate checkpoints) will be made available, furthering our commitment to open-source AI development and reproducibility.

Quick links:
- ๐Ÿ’ฌ [Demo](https://molmoweb.allen.ai/)
- ๐Ÿ“‚ [All Models](https://huggingface.co/collections/allenai/molmoweb)
- ๐Ÿ“š [All Data](https://huggingface.co/collections/allenai/molmoweb-data)
- ๐Ÿ“ƒ [Paper](https://allenai.org/papers/molmoweb)
- ๐ŸŽฅ [Blog with Videos](https://allenai.org/blog/molmoweb)

## Usage
Please refer to our [Github repo](https://github.com/allenai/molmoweb/) for inference code.

## License and Use

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2โ€™s [Responsible Use Guidelines](https://allenai.org/responsible-use).

## Citation

If you use this dataset, please cite:

[arXiv:2604.08516](https://arxiv.org/abs/2604.08516)

```bibtex
@misc{gupta2026molmowebopenvisualweb,
      title={MolmoWeb: Open Visual Web Agent and Open Data for the Open Web}, 
      author={Tanmay Gupta and Piper Wolters and Zixian Ma and Peter Sushko and Rock Yuren Pang and Diego Llanes and Yue Yang and Taira Anderson and Boyuan Zheng and Zhongzheng Ren and Harsh Trivedi and Taylor Blanton and Caleb Ouellette and Winson Han and Ali Farhadi and Ranjay Krishna},
      year={2026},
      eprint={2604.08516},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08516}, 
}