Text Generation
Transformers
Safetensors
English
qwen2
web-agent
process-reward-model
preference
reward-model
web-navigation
reasoning
grpo
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use ZYao720/WebArbiter-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ZYao720/WebArbiter-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ZYao720/WebArbiter-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ZYao720/WebArbiter-3B") model = AutoModelForCausalLM.from_pretrained("ZYao720/WebArbiter-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ZYao720/WebArbiter-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ZYao720/WebArbiter-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ZYao720/WebArbiter-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ZYao720/WebArbiter-3B
- SGLang
How to use ZYao720/WebArbiter-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ZYao720/WebArbiter-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ZYao720/WebArbiter-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ZYao720/WebArbiter-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ZYao720/WebArbiter-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ZYao720/WebArbiter-3B with Docker Model Runner:
docker model run hf.co/ZYao720/WebArbiter-3B
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - web-agent | |
| - process-reward-model | |
| - preference | |
| - reward-model | |
| - web-navigation | |
| - reasoning | |
| - grpo | |
| base_model: Qwen/Qwen2.5-3B-Instruct | |
| datasets: | |
| - ZYao720/WebArbiter-Data | |
| model-index: | |
| - name: WebArbiter-3B | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Web Process Reward Modeling | |
| dataset: | |
| name: WebPRMBench | |
| type: ZYao720/WEBPRMBENCH | |
| metrics: | |
| - name: Avg Pairwise Accuracy | |
| type: accuracy | |
| value: 83.65 | |
| - name: Avg BoN Accuracy | |
| type: accuracy | |
| value: 59.06 | |
| <div align="center"> | |
| # WebArbiter-3B | |
| **A principle-guided reasoning Process Reward Model for web agents** | |
| **Published at ICLR 2026** | |
| [Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html) | |
| </div> | |
| ## Introduction | |
| **WebArbiter-3B** is a 3B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion. | |
| Despite its compact size, WebArbiter-3B achieves an **Avg. BoN Acc of 59.06%** on [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), outperforming the previous SOTA WebPRM (WebShepherd-3B) by **15.5 points** and surpassing all open-source LLM-as-judge baselines up to 70B parameters. For the strongest variant, see [WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B). | |
| ## Highlights | |
| - **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists. | |
| - **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments. | |
| - **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness. | |
| - **Efficient and deployable**: Strong performance at 3B parameters, suitable for resource-constrained deployment scenarios. | |
| ## Results on WebPRMBench | |
| Models marked with ⋆ are ours. **Bold** = best at comparable scale. | |
| | Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | | | |
| |-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | |
| | | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | | |
| | *Proprietary LLM-as-judge* | | | | | | | | | | | | |
| | GPT-4o-mini | 81.74 | 50.92 | 78.23 | 56.72 | 89.17 | 73.33 | 81.43 | 46.70 | 82.64 | 56.92 | | |
| | GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 | | |
| | GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 | | |
| | *Open-source LLM-as-judge* | | | | | | | | | | | | |
| | Qwen2.5-3B-Instruct | 76.46 | 36.93 | 60.32 | 15.42 | 75.83 | 33.33 | 64.45 | 19.34 | 69.27 | 26.76 | | |
| | Qwen2.5-7B-Instruct | 77.79 | 39.18 | 74.88 | 42.79 | 84.17 | 53.33 | 77.58 | 35.85 | 77.61 | 42.78 | | |
| | Llama-3-70B-Instruct | 80.55 | 49.36 | 77.36 | 50.75 | 85.83 | 70.00 | 79.08 | 40.09 | 80.71 | 52.55 | | |
| | *WebPRMs (3B)* | | | | | | | | | | | | |
| | WebShepherd-3B | 87.50 | 65.21 | 68.16 | 41.29 | 66.67 | 46.67 | 50.00 | 21.23 | 68.08 | 43.60 | | |
| | ⋆ **WebArbiter-3B** | **93.32** | **78.42** | **81.97** | **56.22** | **78.33** | 46.67 | **81.01** | **54.81** | **83.65** | **59.06** | | |
| | *WebPRMs (7B+)* | | | | | | | | | | | | |
| | WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 | | |
| | ⋆ WebArbiter-7B | 97.07 | 89.53 | 88.43 | 68.66 | 89.17 | 70.00 | 82.09 | 70.19 | 89.19 | 74.60 | | |
| WebArbiter-3B outperforms WebShepherd-8B (a much larger 8B model) on Avg. BoN Acc (59.06 vs 43.28), demonstrating the efficiency of the principle-guided reasoning approach. | |
| ## Quick Start | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "ZYao720/WebArbiter-3B" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| # Construct your prompt following the WebPRMBench format. | |
| # See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples. | |
| user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses | |
| messages = [{"role": "user", "content": user_prompt}] | |
| input_ids = tokenizer.apply_chat_template( | |
| messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", | |
| ).to(model.device) | |
| with torch.no_grad(): | |
| output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False) | |
| response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| **Example output:** | |
| ```xml | |
| <State>The user is on the DuckDuckGo homepage with a search box visible. | |
| Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State> | |
| <Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task? | |
| 2. Element reference accuracy (weight 0.25) — Is the referenced element correct? | |
| 3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria> | |
| <Analysis>Response 1 directly fills the search query into the textbox, which is the | |
| most direct path to completing the search task. Response 2 clicks an irrelevant link | |
| that does not contribute to the search goal.</Analysis> | |
| <Answer>Response 1</Answer> | |
| ``` | |
| ## Training Details | |
| | | Stage 1: Reasoning Distillation | Stage 2: RLVR | | |
| |---|---|---| | |
| | Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards | | |
| | Data | 9,642 teacher-distilled examples | 18,921 preference pairs | | |
| | Teacher | o3 | — | | |
| | Base Model | [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | Stage 1 checkpoint | | |
| | Fine-tuning | LoRA (rank 128, lr 8e-4) | FSDP + LoRA (lr 9e-6) | | |
| | Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) | | |
| | Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB | | |
| | Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) | | |
| ## Intended Uses | |
| WebArbiter-3B is designed to: | |
| - **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task. | |
| - **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution. | |
| - **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis. | |
| - **Resource-efficient deployment**: Suitable for scenarios where 7B+ models are too large, while still significantly outperforming larger checklist-based WebPRMs. | |
| ## Limitations | |
| - **Text-only observations**: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals. | |
| - **English-only**: Training and evaluation are conducted exclusively in English-language web environments. | |
| - **Safe-action bias**: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects. | |
| - **Element reference hallucination**: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references. | |
| ## License | |
| This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct). | |
| ## Related Resources | |
| | Resource | Link | | |
| |----------|------| | |
| | WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) | | |
| | WebArbiter-7B | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) | | |
| | WebArbiter-4B-Qwen3 | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) | | |
| | WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) | | |
| | Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) | | |
| | Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) | | |
| ## Citation | |
| ```bibtex | |
| @misc{zhang2026ZYao720principleguidedreasoningprocess, | |
| title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, | |
| author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp}, | |
| year={2026}, | |
| eprint={2601.21872}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2601.21872}, | |
| } | |
| ``` | |