Stochastic-TBRM Agent Benchmark Checkpoints
This repository contains the final reported model artifacts for the WebShop, ScienceWorld, and ALFWorld experiments:
tbrm/: Stochastic-TBRM reward-model checkpoints. Each directory contains a Skywork reward-model LoRA adapter, tokenizer files, andtbrm_heads.ptwith the learned Q and xi heads.dmpo-qwen35/: Qwen3.5-4B LoRA adapters trained as DMPO baselines.
The artifacts are adapter/checkpoint files only; they require the corresponding base models listed above.
Held-out Results
| Method | Benchmark | Path | Episodes | Avg. reward | Success | Avg. steps |
|---|---|---|---|---|---|---|
| Stochastic-TBRM | WebShop | tbrm/webshop-no-scorer |
200 | 0.6823 | 1.0000 | 4.03 |
| Stochastic-TBRM | ScienceWorld | tbrm/scienceworld |
211 | 0.3230 | 0.3886 | 25.93 |
| Stochastic-TBRM | ALFWorld | tbrm/alfworld |
134 | 0.4701 | 0.4701 | 26.04 |
| DMPO-Qwen3.5-4B | WebShop | dmpo-qwen35/webshop |
200 | 0.6490 | 0.9550 | 3.97 |
| DMPO-Qwen3.5-4B | ScienceWorld | dmpo-qwen35/scienceworld |
211 | 0.4112 | 0.4218 | 30.66 |
| DMPO-Qwen3.5-4B | ALFWorld | dmpo-qwen35/alfworld |
134 | 0.5746 | 0.5746 | 25.37 |
Usage Notes
Stochastic-TBRM checkpoints are not standalone transformers causal language models. They should be loaded
with the project critic loader: the backbone/ LoRA adapter is attached to
Skywork/Skywork-Reward-V2-Llama-3.1-8B, and tbrm_heads.pt provides the Q and xi output heads over the
macro-action vocabulary. The Qwen DMPO adapters are standard PEFT LoRA adapters for Qwen/Qwen3.5-4B.
The WebShop Stochastic-TBRM result included here is the no-scorer variant: no structured product scorer, no structured option scorer, no learned click ranker, and no action repair.