Instructions to use ZYao720/WebArbiter-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ZYao720/WebArbiter-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ZYao720/WebArbiter-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ZYao720/WebArbiter-3B")
model = AutoModelForCausalLM.from_pretrained("ZYao720/WebArbiter-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ZYao720/WebArbiter-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ZYao720/WebArbiter-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ZYao720/WebArbiter-3B

SGLang

How to use ZYao720/WebArbiter-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ZYao720/WebArbiter-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ZYao720/WebArbiter-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ZYao720/WebArbiter-3B with Docker Model Runner:
```
docker model run hf.co/ZYao720/WebArbiter-3B
```

WebArbiter-3B / README.md

ZYao720

Upload README.md with huggingface_hub

04aeee2 verified 2 months ago

preview code

raw

history blame contribute delete

9.5 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- web-agent
	- process-reward-model
	- preference
	- reward-model
	- web-navigation
	- reasoning
	- grpo
	base_model: Qwen/Qwen2.5-3B-Instruct
	datasets:
	- ZYao720/WebArbiter-Data
	model-index:
	- name: WebArbiter-3B
	results:
	- task:
	type: text-generation
	name: Web Process Reward Modeling
	dataset:
	name: WebPRMBench
	type: ZYao720/WEBPRMBENCH
	metrics:
	- name: Avg Pairwise Accuracy
	type: accuracy
	value: 83.65
	- name: Avg BoN Accuracy
	type: accuracy
	value: 59.06
	---

	<div align="center">

	# WebArbiter-3B

	A principle-guided reasoning Process Reward Model for web agents

	Published at ICLR 2026

	[Paper](https://arxiv.org/abs/2601.21872) \| [Code](https://github.com/YaoZhang720/WebArbiter) \| [Website](https://yaozhang.ai/WebArbiter/) \| [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) \| [Demo](https://yaozhang.ai/WebArbiter/demo.html)

	</div>

	## Introduction

	WebArbiter-3B is a 3B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

	Despite its compact size, WebArbiter-3B achieves an Avg. BoN Acc of 59.06% on [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), outperforming the previous SOTA WebPRM (WebShepherd-3B) by 15.5 points and surpassing all open-source LLM-as-judge baselines up to 70B parameters. For the strongest variant, see [WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B).

	## Highlights

	- Reasoning as reward: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
	- Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
	- Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
	- Efficient and deployable: Strong performance at 3B parameters, suitable for resource-constrained deployment scenarios.

	## Results on WebPRMBench

	Models marked with ⋆ are ours. Bold = best at comparable scale.

	\| Model \| Mind2Web \| \| WebArena \| \| AssistantBench \| \| WorkArena \| \| Avg. \| \|
	\|-------\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \|
	\| Proprietary LLM-as-judge \| \| \| \| \| \| \| \| \| \| \|
	\| GPT-4o-mini \| 81.74 \| 50.92 \| 78.23 \| 56.72 \| 89.17 \| 73.33 \| 81.43 \| 46.70 \| 82.64 \| 56.92 \|
	\| GPT-4o \| 79.99 \| 52.62 \| 84.58 \| 66.67 \| 85.83 \| 66.67 \| 84.33 \| 55.19 \| 83.68 \| 60.29 \|
	\| GPT-5 \| 80.86 \| 62.39 \| 84.83 \| 71.64 \| 81.67 \| 63.33 \| 81.14 \| 64.62 \| 82.13 \| 65.50 \|
	\| Open-source LLM-as-judge \| \| \| \| \| \| \| \| \| \| \|
	\| Qwen2.5-3B-Instruct \| 76.46 \| 36.93 \| 60.32 \| 15.42 \| 75.83 \| 33.33 \| 64.45 \| 19.34 \| 69.27 \| 26.76 \|
	\| Qwen2.5-7B-Instruct \| 77.79 \| 39.18 \| 74.88 \| 42.79 \| 84.17 \| 53.33 \| 77.58 \| 35.85 \| 77.61 \| 42.78 \|
	\| Llama-3-70B-Instruct \| 80.55 \| 49.36 \| 77.36 \| 50.75 \| 85.83 \| 70.00 \| 79.08 \| 40.09 \| 80.71 \| 52.55 \|
	\| WebPRMs (3B) \| \| \| \| \| \| \| \| \| \| \|
	\| WebShepherd-3B \| 87.50 \| 65.21 \| 68.16 \| 41.29 \| 66.67 \| 46.67 \| 50.00 \| 21.23 \| 68.08 \| 43.60 \|
	\| ⋆ WebArbiter-3B \| 93.32 \| 78.42 \| 81.97 \| 56.22 \| 78.33 \| 46.67 \| 81.01 \| 54.81 \| 83.65 \| 59.06 \|
	\| WebPRMs (7B+) \| \| \| \| \| \| \| \| \| \| \|
	\| WebShepherd-8B \| 86.66 \| 73.69 \| 68.33 \| 43.88 \| 55.92 \| 30.00 \| 54.56 \| 25.53 \| 64.34 \| 43.28 \|
	\| ⋆ WebArbiter-7B \| 97.07 \| 89.53 \| 88.43 \| 68.66 \| 89.17 \| 70.00 \| 82.09 \| 70.19 \| 89.19 \| 74.60 \|

	WebArbiter-3B outperforms WebShepherd-8B (a much larger 8B model) on Avg. BoN Acc (59.06 vs 43.28), demonstrating the efficiency of the principle-guided reasoning approach.

	## Quick Start

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "ZYao720/WebArbiter-3B"

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	# Construct your prompt following the WebPRMBench format.
	# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
	user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses

	messages = [{"role": "user", "content": user_prompt}]
	input_ids = tokenizer.apply_chat_template(
	messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
	).to(model.device)

	with torch.no_grad():
	output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

	response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
	print(response)
	```

	Example output:
	```xml
	<State>The user is on the DuckDuckGo homepage with a search box visible.
	Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
	<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
	2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
	3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
	<Analysis>Response 1 directly fills the search query into the textbox, which is the
	most direct path to completing the search task. Response 2 clicks an irrelevant link
	that does not contribute to the search goal.</Analysis>
	<Answer>Response 1</Answer>
	```

	## Training Details

	\| \| Stage 1: Reasoning Distillation \| Stage 2: RLVR \|
	\|---\|---\|---\|
	\| Method \| Supervised fine-tuning (SFT) \| GRPO with binary verifiable rewards \|
	\| Data \| 9,642 teacher-distilled examples \| 18,921 preference pairs \|
	\| Teacher \| o3 \| — \|
	\| Base Model \| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) \| Stage 1 checkpoint \|
	\| Fine-tuning \| LoRA (rank 128, lr 8e-4) \| FSDP + LoRA (lr 9e-6) \|
	\| Framework \| [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) \| [veRL](https://github.com/volcengine/verl) \|
	\| Hardware \| 8 × NVIDIA A100-80GB \| 8 × NVIDIA A100-80GB \|
	\| Source Data \| [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) \|

	## Intended Uses

	WebArbiter-3B is designed to:
	- Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
	- Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
	- Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.
	- Resource-efficient deployment: Suitable for scenarios where 7B+ models are too large, while still significantly outperforming larger checklist-based WebPRMs.

	## Limitations

	- Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
	- English-only: Training and evaluation are conducted exclusively in English-language web environments.
	- Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
	- Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

	## License

	This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| WebArbiter-8B-Qwen3 (strongest) \| [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) \|
	\| WebArbiter-7B \| [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) \|
	\| WebArbiter-4B-Qwen3 \| [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) \|
	\| WEBPRMBENCH (benchmark) \| [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) \|
	\| Training Data \| [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) \|
	\| Search Trajectories \| [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) \|

	## Citation

	```bibtex
	@misc{zhang2026ZYao720principleguidedreasoningprocess,
	title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
	author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
	year={2026},
	eprint={2601.21872},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2601.21872},
	}
	```