Instructions to use ravilution/MolmoWeb-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ravilution/MolmoWeb-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ravilution/MolmoWeb-4B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("ravilution/MolmoWeb-4B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ravilution/MolmoWeb-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ravilution/MolmoWeb-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ravilution/MolmoWeb-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ravilution/MolmoWeb-4B

SGLang

How to use ravilution/MolmoWeb-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ravilution/MolmoWeb-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ravilution/MolmoWeb-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ravilution/MolmoWeb-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ravilution/MolmoWeb-4B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ravilution/MolmoWeb-4B with Docker Model Runner:
```
docker model run hf.co/ravilution/MolmoWeb-4B
```

MolmoWeb-4B / config.yaml

ravilution

Initial upload: MolmoWeb-4B with HF/vLLM compatibility patches

b6e1d00 verified about 2 months ago

raw

history blame contribute delete

15.4 kB

	run_name: train_hero_4b_03-14-06-17
	model:
	model_name: molmo
	llm:
	d_model: 2560
	n_heads: 32
	n_kv_heads: 8
	head_dim: 128
	qkv_bias: false
	clip_qkv: null
	n_layers: 36
	mlp_ratio: 4
	mlp_hidden_size: 19456
	activation_type: swiglu
	block_type: sequential
	rope: true
	rope_full_precision: true
	rope_theta: 1000000.0
	rope_type: default
	rope_factor: null
	rope_high_freq_factor: null
	rope_low_freq_factor: null
	rope_original_max_position_embeddings: null
	rope_attention_factor: null
	rope_beta_fast: null
	rope_beta_slow: null
	rope_mscale: null
	rope_mscale_all_dim: null
	rope_truncate: null
	attention_type: sdpa
	full_attention_layers: null
	sliding_attention_rope_scaling: false
	float32_attention: true
	attention_dropout: 0.0
	attention_layer_norm: true
	attention_layer_norm_type: qwen3
	residual_dropout: 0.1
	response_residual_dropout: 0.0
	layer_norm_type: rms
	layer_norm_with_affine: true
	layer_norm_eps: 1.0e-06
	attention_layer_norm_with_affine: true
	max_sequence_length: 10240
	max_position_embeddings: null
	include_bias: false
	bias_for_layer_norm: null
	norm_after: false
	moe_num_experts: 8
	moe_top_k: 2
	moe_mlp_impl: sparse
	moe_log_expert_assignment: false
	moe_shared_expert: false
	moe_lbl_in_fp32: false
	moe_interleave: false
	moe_loss_weight: 0.1
	moe_zloss_weight: null
	moe_dropless: true
	moe_capacity_factor: 1.25
	embedding_dropout: 0.0
	scale_logits: false
	vocab_size: 151936
	additional_vocab_size: 128
	weight_tying: true
	embedding_size: 151936
	use_position_ids: true
	tokenizer:
	identifier: Qwen/Qwen3-4B
	tokenizer_dir: null
	init_path: /weka/oe-training-default/mm-olmo/pretrained_llms/qwen3-4b.pt
	init_incremental: null
	new_embedding_init_range: 0.02
	initializer_range: 0.02
	normalize_input_embeds: false
	activation_checkpoint: whole_layer
	compile: blocks
	fix_pad_tokenizer: false
	init_std: 0.02
	init_fn: normal
	init_cutoff_factor: null
	vision_backbone:
	vit:
	image_model_type: siglip
	image_default_input_size:
	- 378
	- 378
	image_patch_size: 14
	image_pos_patch_size: 14
	image_emb_dim: 1152
	image_num_heads: 16
	image_num_key_value_heads: 16
	image_num_layers: 27
	image_head_dim: 72
	image_mlp_dim: 4304
	image_mlp_activations: gelu_pytorch_tanh
	image_dropout_rate: 0.0
	image_num_pos: 729
	image_norm_eps: 1.0e-06
	attention_dropout: 0.0
	residual_dropout: 0.0
	initializer_range: 0.02
	float32_attention: true
	attention_type: sdpa
	sdpa_backend: all
	activation_checkpointing: true
	init_path: /weka/oe-training-default/mm-olmo/pretrained_image_encoders/siglip2-so400m-14-384.pt
	resize_mode: siglip
	pad_value: 0.0
	normalize: siglip
	image_pooling_2d: attention_meanq
	pooling_attention_mask: true
	image_projector: mlp
	image_padding_embed: null
	vit_layers:
	- -3
	- -9
	skip_unused_layers: true
	use_deepstack: false
	share_connector: false
	image_feature_dropout: 0.0
	connector_activation_checkpointing: true
	compile_vit: blocks
	pool_size_embeds: null
	compile_connector: dynamic
	normalize_on_gpu: true
	data_formatter:
	prompt_templates: uber_model
	message_format: role
	system_prompt: demo_or_style
	always_start_with_space: false
	default_inference_len: 65
	select_answer: best
	debug: false
	image_last: false
	format_message_list: null
	p_one_message: 0.0
	eval_system_prompt_mapping: null
	p_choice_content_in_mc: 1.0
	template_video_mc_questions: true
	pointing_format: html-v2
	points_decimal_places: 1
	use_seperate_non_pointing_qa_style: false
	timestamp_mode: 50-percent-seconds
	output_timestamp_mode: seconds
	seconds_decimal_places: 1
	p_multi_point_all_image: 0.0
	use_seperate_count_without_pointing_style: false
	sample_random_initial_point: true
	mm_preprocessor:
	crop_mode: overlap-and-resize-c2
	use_col_tokens: true
	max_crops: 8
	high_res_max_crops: 24
	p_high_res: 0.0
	pooling_w: 2
	pooling_h: 2
	overlap_margins:
	- 4
	- 4
	max_images: null
	max_multi_image_crops: 4
	multi_image_pooling_w: 2
	multi_image_pooling_h: 2
	use_single_crop_col_tokens: null
	use_single_crop_start_token: false
	max_answer_len: null
	last_message_loss_only: false
	max_text_tokens: null
	loss_token_weighting: root_subsegments
	image_padding_mask: false
	legacy_image_mask: false
	bi_directional_attn: null
	parallelism:
	data_parallel_replicate_degree: 1
	enable_compiled_autograd: false
	data_parallel_shard_degree: -1
	fsdp_reshard_after_forward: default
	context_parallel_config:
	degree: 1
	attention_type: ulysses
	load_balancer: ulysses
	head_stride: 1
	tensor_parallel_config:
	degree: 1
	enable_async: false
	data_parallel_config:
	name: fsdp
	param_dtype: null
	reduce_dtype: float32
	num_replicas: null
	shard_degree: null
	wrapping_strategy: full
	prefetch_factor: 0
	context_parallel_rotate_method: allgather
	seed: 6198
	epoch: null
	dry_run: false
	ft_llm: true
	ft_vit: true
	ft_connector: true
	ft_embedding: lm_head
	optimizer:
	name: adamw
	learning_rate: 0.0001
	weight_decay: 0.01
	betas:
	- 0.9
	- 0.95
	eps: 1.0e-05
	connector_learning_rate: 5.0e-06
	vit_learning_rate: 5.0e-06
	llm_learning_rate: 1.0e-05
	frame_selector_learning_rate: 0.0001
	temporal_token_scorer_learning_rate: 0.0001
	connector_weight_decay: 0.0
	vit_weight_decay: 0.0
	llm_weight_decay: 0.0
	frame_selector_weight_decay: 0.01
	temporal_token_scorer_weight_decay: 0.01
	connector_betas:
	- 0.9
	- 0.95
	vit_betas:
	- 0.9
	- 0.95
	llm_betas:
	- 0.9
	- 0.95
	frame_selector_betas:
	- 0.9
	- 0.95
	temporal_token_scorer_betas:
	- 0.9
	- 0.95
	connector_eps: 1.0e-06
	vit_eps: 1.0e-06
	llm_eps: 1.0e-06
	frame_selector_eps: 1.0e-06
	temporal_token_scorer_eps: 1.0e-06
	metrics_log_interval: -1
	scheduler:
	name: multimodal
	units: steps
	t_warmup: 100
	t_max: null
	alpha_f: 0.1
	connector_t_warmup: 200
	vit_t_warmup: 200
	llm_t_warmup: 200
	frame_selector_t_warmup: 200
	temporal_token_scorer_t_warmup: 200
	grad_clip_warmup_steps: null
	grad_clip_warmup_factor: null
	warmup_min_lr: 0.0
	data:
	dataset: null
	mixture: null
	root_size_mixture: null
	kwargs_mixture:
	- rate: 0.2
	datasets:
	- dataset_name: webolmoSyntheticGround__v0__template
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	- dataset_name: webolmoSyntheticGround__v0__gpt
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	- dataset_name: pixmo_points_single_web
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	- dataset_name: screenshot_qa
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.05
	datasets:
	- dataset_name: webolmoSynthetic__train_gemini_3_v0_like_combined_postprocessed_version2__weighted_221__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.1
	datasets:
	- dataset_name: webolmoSynthetic__gemini_webvoyager_like_19k_feb20_version2__weighted_221__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.2
	datasets:
	- dataset_name: webolmoSynthetic__gemini_om2w_combined_33k__weighted_221__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.18
	datasets:
	- dataset_name: webolmoSynthetic__heuristic_filtered_multi_agent_combined_version2__goal__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.18
	datasets:
	- dataset_name: snorkel_0312_with_gemini_thoughts__weighted_1111__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.02
	datasets:
	- dataset_name: webolmoSynthetic__atomic_actions_find_and_open_successful__goal__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	- dataset_name: webolmoSynthetic__atomic_actions_fill_form_successful__goal__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.05
	datasets:
	- dataset_name: snorkel_0312_STEPS_with_gemini_thoughts__goal__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	- rate: 0.02
	datasets:
	- dataset_name: webolmoSynthetic__node_traversal_successful_ML_scroll_100_720_1280__weighted_221__random_gaussian__molmo_web_think__steps_10
	sampling_rate: null
	root_size_factor: null
	message_weight: null
	override_p_high_res: null
	name: null
	split: train
	seed: 50189
	pad: to_max
	sequence_length: 10240
	max_text_seq_len: null
	shuffle: true
	start_index: 0
	packing: null
	enable_variable_sized_token_pooling: true
	num_workers: 2
	drop_last: true
	pin_memory: true
	prefetch_factor: 4
	persistent_workers: false
	timeout: 0
	restore_dataloader: true
	fast_forward_batches: null
	evaluators: []
	eval_interval: 50000
	inf_evaluators:
	- label: webolmoSynthetic__train_gemini_3_v0_like_combined_postprocessed__HL__random_gaussian__molmo_web_think__steps_10
	data:
	dataset: webolmoSynthetic__train_gemini_3_v0_like_combined_postprocessed__HL__random_gaussian__molmo_web_think__steps_10
	mixture: null
	root_size_mixture: null
	kwargs_mixture: null
	split: val
	seed: 691203
	pad: to_max
	sequence_length: 10240
	max_text_seq_len: null
	shuffle: true
	start_index: 0
	packing: null
	enable_variable_sized_token_pooling: true
	num_workers: 2
	drop_last: true
	pin_memory: true
	prefetch_factor: 4
	persistent_workers: true
	timeout: 0
	evaluator:
	n_to_log: 0
	num_wandb_examples: 32
	save_predictions: null
	save_tokens: false
	vqa_eval: ''
	pointing_eval: false
	point_bench_eval: false
	count_eval: false
	point_count_eval: false
	android_eval: false
	clock_eval: false
	clock_bench_eval: false
	math_vista_eval: false
	temp_compass_eval: ''
	temp_compass_disable_api: false
	video_mme_eval: ''
	mme_videoocr_eval: false
	mlvu_gen_eval: false
	lvbench_eval: false
	long_video_bench_eval: false
	plm_fgqa_eval: false
	video_hallucer: false
	long_video_bench_caption_eval: false
	vinoground_eval: false
	vixmo_caption_eval: false
	vixmo_caption_eval2: false
	dream1k_caption_eval: false
	vixmo_point_count_eval: false
	vixmo_point_eval: false
	video_object_tracking_eval: ''
	video_single_point_prediction: ''
	video_point_tracking_eval: ''
	refexp_eval: false
	coco_caption_eval: false
	qv_highlights_eval: false
	tomato: false
	temporal_bench: false
	open_qa_eval: false
	mmiu_eval: false
	mulset_eval: false
	ego3d_bench_eval: false
	vsi_bench_eval: false
	uground_eval: false
	web_ground_eval: false
	web_trajs_eval: true
	screenshot_qa_eval: false
	websrc_eval: false
	max_new_tokens: 1024
	device_batch_size: 2
	sampling:
	temperature: 0.0
	top_p: 1.0
	top_k: null
	ngram_size: null
	repetition_penalty: null
	frequency_penalty: null
	subset_num_batches: null
	max_examples: 512
	console_log_interval: 20
	include_image: false
	inf_eval_interval: 50000
	eval_on_last_step: true
	eval_on_load: false
	eval_on: []
	save_folder: /weka/oe-training-default/webolmo/zixianm/checkpoints/train_hero_4b_03-14-06-17
	checkpointer_config:
	save_thread_count: null
	load_thread_count: null
	pre_download: false
	work_dir: null
	throttle_uploads: false
	canceled_check_interval: 50
	save_interval: 1000
	save_at: null
	save_final_optim: false
	save_num_checkpoints_to_keep: 31
	save_final_unsharded_checkpoint: false
	save_interval_ephemeral: null
	save_overwrite: true
	load_path: null
	reset_optimizer_state: false
	reset_trainer_state: false
	initial_model_checkpoint: /weka/oe-training-default/sanghol/molmo/models/uber-v1/uber3.4-synthetic-siglip2-qwen3_4b/step30000
	allow_resume: true
	max_duration: 50000
	global_train_batch_size: 128
	device_train_microbatch_size: 4
	max_grad_norm: 1.0
	multi_component_grad_norm: true
	batch_divisor: global_batch
	max_grad_norm_ratio: null
	precision: amp_bf16
	wandb:
	project: zixianm_webolmo
	entity: prior-ai2
	group: null
	name: train_hero_4b_03-14-06-17
	tags:
	- watching
	log_artifacts: false
	rank_zero_only: true
	log_interval: 20
	allow_resume: true
	finish_on_sigterm: true
	beaker_log_interval: 50
	speed_monitor:
	window_size: 20
	gpu_flops_available: null
	console_log_interval: 20
	enable_timing_logs: false
	gen1_gc_interval: 1
	compile: null
	activation_checkpointing: true
	fsdp:
	fsdp2: true
	precision: float
	use_orig_params: true
	wrapping_strategy: by_block_and_size
	sharding_strategy: FULL_SHARD
	hybrid_sharding_num_model_replicas: null
	softmax_auxiliary_loss: true
	softmax_auxiliary_loss_scale: 0.0001
	response_logits_only: true
	saliency_score_loss_wt: null
	frame_score_loss_wt: null
	frame_score_loss_type: mse
	frame_score_loss_target: 0.7
	time_limit: null
	extra_steps_after_cancel: 0
	python_profiling: false
	torch_profiling: false
	stop_at: 50000
	stop_after: null
	fused_loss: false
	compile_loss: true
	runtime_data:
	args: /gantry-runtime/launch_scripts/train_multitask_model.py hero /weka/oe-training-default/sanghol/molmo/models/uber-v1/uber3.4-synthetic-siglip2-qwen3_4b/step30000
	--save_folder=/weka/oe-training-default/webolmo/zixianm/checkpoints/train_hero_4b_03-14-06-17
	--run_name=train_hero_4b_03-14-06-17 --global_batch_size 128 --device_train_batch_size
	2 --device_eval_batch_size 2 --device_inf_batch_size 2 --duration 50000 --action_token_weight
	1 --num_checkpoints_to_keep 31 --max_crops 8 --seq_len 10240 --save_interval 1000
	--eval_interval 50000 --inf_eval_interval 50000 --save_overwrite
	hostname: jupiter-cs-aus-115.reviz.ai2.in
	date: 03/16/2026, 22:07
	world_size: 64
	resuming_from: /weka/oe-training-default/webolmo/zixianm/checkpoints/train_hero_4b_03-14-06-17/step32000
	beaker_experiment_id: 01KKNFX4X43ASSGM27F63TFDN2
	beaker_experiment_url: null
	wandb_id: jym0efjn
	wandb_url: https://wandb.ai/prior-ai2/zixianm_webolmo/runs/jym0efjn