File size: 2,427 Bytes
d4004a2 80369c1 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 af6377d d4004a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | ---
license: mit
base_model:
- Qwen/Qwen3-4B
pipeline_tag: text-classification
---
<div align="center">
<img src="https://raw.githubusercontent.com/OpenHands/docs/main/openhands/static/img/logo.png?raw=true" alt="Logo" width="200">
<h1 align="center">OpenHands Critic 4B v1.0</h1>
</div>
<p align="center">
<a href="https://arxiv.org/abs/2603.03800">Paper</a> •
<a href="https://github.com/OpenHands/critic-rubrics">Github (definitions & prompts)</a> <br>
</p>
A 4B parameter critic model for evaluating AI agent trajectories, trained to predict behavioral rubrics and task success.
- Docs (Use it in OpenHands Software Agent SDK): https://docs.openhands.dev/sdk/guides/critic
- Docs (Use it in OpenHands CLI): https://docs.openhands.dev/openhands/usage/cli/critic
## Model Details
- **Base Model**: Qwen/Qwen3-4B
- **Training**: Full-parameter fine-tuning with BCE loss
- **Context Length**: Trained on 64K, supports up to 256K tokens
- **Task**: Multi-label classification (26 labels: 25 rubric features + 1 success prediction)
## Serving (vLLM Classification API)
We serve this model using vLLM’s classification task:
```bash
vllm serve <MODEL_PATH> \
--host 0.0.0.0 \
--port 8000 \
--api-key <API_KEY> \
--served-model-name <MODEL_NAME> \
--task classify \
--max-model-len 262144 \
--dtype bfloat16 \
--trust-remote-code \
--enable-prefix-caching
```
## Usage
We recommend using the **OpenHands SDK** for inference instead of calling the vLLM classification endpoint directly.
Follow the SDK guide: https://docs.openhands.dev/sdk/guides/critic
In particular, reuse the SDK client implementation here (it already handles formatting and API calls): https://github.com/OpenHands/software-agent-sdk/blob/main/openhands-sdk/openhands/sdk/critic/impl/api/critic.py
At a high level, you will:
1. Start a critic server (see **Serving** section above)
2. Configure the SDK to point to your critic endpoint + API key
3. Call the SDK critic to score trajectories (returns rubric probabilities + success score)
## Citation
```bibtex
@misc{wang2026rubricsupervisedcriticsparserealworld,
title={A Rubric-Supervised Critic from Sparse Real-World Outcomes},
author={Xingyao Wang and Valerie Chen and Heng Ji and Graham Neubig},
year={2026},
eprint={2603.03800},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.03800},
}
``` |