--- license: mit base_model: - Qwen/Qwen3-4B pipeline_tag: text-classification ---

OpenHands Critic 4B v1.0

A 4B parameter critic model for evaluating AI agent trajectories, trained to predict behavioral rubrics and task success. - Docs (Use it in OpenHands Software Agent SDK): https://docs.openhands.dev/sdk/guides/critic - Docs (Use it in OpenHands CLI): https://docs.openhands.dev/openhands/usage/cli/critic ## Model Details - **Base Model**: Qwen/Qwen3-4B - **Training**: Full-parameter fine-tuning with BCE loss - **Context Length**: Trained on 64K, supports up to 256K tokens - **Task**: Multi-label classification (26 labels: 25 rubric features + 1 success prediction) ## Serving (vLLM Classification API) We serve this model using vLLM’s classification task: ```bash vllm serve \ --host 0.0.0.0 \ --port 8000 \ --api-key \ --served-model-name \ --task classify \ --max-model-len 262144 \ --dtype bfloat16 \ --trust-remote-code \ --enable-prefix-caching ``` ## Usage We recommend using the **OpenHands SDK** for inference instead of calling the vLLM classification endpoint directly. Follow the SDK guide: https://docs.openhands.dev/sdk/guides/critic In particular, reuse the SDK client implementation here (it already handles formatting and API calls): https://github.com/OpenHands/software-agent-sdk/blob/main/openhands-sdk/openhands/sdk/critic/impl/api/critic.py At a high level, you will: 1. Start a critic server (see **Serving** section above) 2. Configure the SDK to point to your critic endpoint + API key 3. Call the SDK critic to score trajectories (returns rubric probabilities + success score) ## Citation ```bibtex @misc{wang2026rubricsupervisedcriticsparserealworld, title={A Rubric-Supervised Critic from Sparse Real-World Outcomes}, author={Xingyao Wang and Valerie Chen and Heng Ji and Graham Neubig}, year={2026}, eprint={2603.03800}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2603.03800}, } ```