Stochastic-TBRM Agent Benchmark Checkpoints

This repository contains the final reported model artifacts for the WebShop, ScienceWorld, and ALFWorld experiments:

tbrm/: Stochastic-TBRM reward-model checkpoints. Each directory contains a Skywork reward-model LoRA adapter, tokenizer files, and tbrm_heads.pt with the learned Q and xi heads.
dmpo-qwen35/: Qwen3.5-4B LoRA adapters trained as DMPO baselines.

The artifacts are adapter/checkpoint files only; they require the corresponding base models listed above.

Held-out Results

Method	Benchmark	Path	Episodes	Avg. reward	Success	Avg. steps
Stochastic-TBRM	WebShop	`tbrm/webshop-no-scorer`	200	0.6823	1.0000	4.03
Stochastic-TBRM	ScienceWorld	`tbrm/scienceworld`	211	0.3230	0.3886	25.93
Stochastic-TBRM	ALFWorld	`tbrm/alfworld`	134	0.4701	0.4701	26.04
DMPO-Qwen3.5-4B	WebShop	`dmpo-qwen35/webshop`	200	0.6490	0.9550	3.97
DMPO-Qwen3.5-4B	ScienceWorld	`dmpo-qwen35/scienceworld`	211	0.4112	0.4218	30.66
DMPO-Qwen3.5-4B	ALFWorld	`dmpo-qwen35/alfworld`	134	0.5746	0.5746	25.37

Usage Notes

Stochastic-TBRM checkpoints are not standalone transformers causal language models. They should be loaded with the project critic loader: the backbone/ LoRA adapter is attached to Skywork/Skywork-Reward-V2-Llama-3.1-8B, and tbrm_heads.pt provides the Q and xi output heads over the macro-action vocabulary. The Qwen DMPO adapters are standard PEFT LoRA adapters for Qwen/Qwen3.5-4B.

The WebShop Stochastic-TBRM result included here is the no-scorer variant: no structured product scorer, no structured option scorer, no learned click ranker, and no action repair.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for ehwkang/stochastic-tbrm-agent-benchmarks

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(213)

this model