Stochastic-TBRM Agent Benchmark Checkpoints

This repository contains the final reported model artifacts for the WebShop, ScienceWorld, and ALFWorld experiments:

  • tbrm/: Stochastic-TBRM reward-model checkpoints. Each directory contains a Skywork reward-model LoRA adapter, tokenizer files, and tbrm_heads.pt with the learned Q and xi heads.
  • dmpo-qwen35/: Qwen3.5-4B LoRA adapters trained as DMPO baselines.

The artifacts are adapter/checkpoint files only; they require the corresponding base models listed above.

Held-out Results

Method Benchmark Path Episodes Avg. reward Success Avg. steps
Stochastic-TBRM WebShop tbrm/webshop-no-scorer 200 0.6823 1.0000 4.03
Stochastic-TBRM ScienceWorld tbrm/scienceworld 211 0.3230 0.3886 25.93
Stochastic-TBRM ALFWorld tbrm/alfworld 134 0.4701 0.4701 26.04
DMPO-Qwen3.5-4B WebShop dmpo-qwen35/webshop 200 0.6490 0.9550 3.97
DMPO-Qwen3.5-4B ScienceWorld dmpo-qwen35/scienceworld 211 0.4112 0.4218 30.66
DMPO-Qwen3.5-4B ALFWorld dmpo-qwen35/alfworld 134 0.5746 0.5746 25.37

Usage Notes

Stochastic-TBRM checkpoints are not standalone transformers causal language models. They should be loaded with the project critic loader: the backbone/ LoRA adapter is attached to Skywork/Skywork-Reward-V2-Llama-3.1-8B, and tbrm_heads.pt provides the Q and xi output heads over the macro-action vocabulary. The Qwen DMPO adapters are standard PEFT LoRA adapters for Qwen/Qwen3.5-4B.

The WebShop Stochastic-TBRM result included here is the no-scorer variant: no structured product scorer, no structured option scorer, no learned click ranker, and no action repair.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for ehwkang/stochastic-tbrm-agent-benchmarks

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(213)
this model