sts-rl-agent — a learned non-combat policy for Slay the Spire

English | 中文

To our knowledge, the first working learned policy published for sts_lightspeed — the fast C++ Slay the Spire simulator, which ships a 412-dim neural-network observation interface (NNInterface) but no trained weights.

A small MLP (~100k params, [128,128], trained from scratch with REINFORCE) makes all non-combat decisions — map pathing, card rewards, shops, campfires, events — while the simulator's built-in MCTS plays combat.

Headline result

Same MCTS combat, same 50 held-out seeds, A0 Ironclad — only the non-combat "brain" differs:

non-combat decisions	combat	avg floor	win rate
stock bot heuristics (map = random)	MCTS @50000	31.2	6%
this model	MCTS @50000	42.5	14%

The learned non-combat layer is worth ~11 floors over the stock bot: its biggest weakness was never combat, it was walking the map at random.

Files

file	what
`armG_model_G128x128_15k.pt`	the non-combat policy behind the headline number (15k games)
`armG_model_G128x128.pt`	earlier 8k-game checkpoint
`armS_card_vocab.json`	card vocabulary (required to encode candidates)
`armB_model_B256x256.pt`	combat behavior-cloning model — negative result (floor ~12 vs teacher 23)
`armB_model_VAL256x256.pt`	combat value net — negative result (1-ply lookahead: floor ~8)
`armB_model_ATTN_d64L2.pt`	combat attention model — negative result (floor ~14)

The combat models are published on purpose: six different attempts to distill MCTS combat into a feed-forward network all failed the same way (imitation caps at 0.44 train accuracy — the MCTS teacher effectively sees the future draw order; a one-frame policy can't). Judgment-type decisions compress into small networks easily; planning-type decisions resist.

Usage

Input is obs(412) ⊕ candidate-descriptor(368) → scalar score per candidate; pick the argmax. You need the patched simulator and the encoding code — full code, sim patch, training scripts and eval protocol: github.com/valiant-wjl/sts-rl-agent.

import torch
from agent.armG_train import Scorer, build_choices, obs_vec   # from the GitHub repo
net = Scorer((128, 128))
net.load_state_dict(torch.load("armG_model_G128x128_15k.pt", weights_only=True))

Limitations

A0 (lowest difficulty), Ironclad only (the simulator only fully implements Ironclad).
Combat is still search (MCTS), not learned.
Single-run numbers on 50 fixed seeds, no confidence intervals.

Slay the Spire is a trademark of Mega Crit Games; this is an unaffiliated research project on a clean-room simulator (MIT).

中文说明

据我们所知,这是第一个针对 sts_lightspeed 神经网络接口公开发布的、能用的学习型策略——这个快速 C++《杀戮尖塔》模拟器自带 412 维观测接口(NNInterface),但作者从未公开过训好的权重。

一个从零训练的小 MLP(约 10 万参数,[128,128],REINFORCE)做全部非战斗决策——地图选路、奖励选卡、商店、篝火、事件;战斗由模拟器内置的 MCTS 执行。

核心结果

同样的 MCTS 战斗、同样的 50 个留出关卡(A0 铁甲),只换"非战斗的脑子":

非战斗决策	战斗	平均楼层	通关率
原生 bot 启发式(地图=随机)	MCTS @50000	31.2	6%
本模型	MCTS @50000	42.5	14%

学习型非战斗层比原生 bot 高出约 11 层——它最大的软肋从来不是打牌,而是选路基本靠随机。

文件说明

文件	说明
`armG_model_G128x128_15k.pt`	核心结果背后的非战斗策略(1.5 万局训练)
`armG_model_G128x128.pt`	更早的 8 千局 checkpoint
`armS_card_vocab.json`	卡牌词表(编码候选项必需)
`armB_model_B256x256.pt`	战斗行为克隆模型——负结果(实战 ~12 层 vs 老师 23)
`armB_model_VAL256x256.pt`	战斗价值网——负结果(1 步前瞻只有 ~8 层)
`armB_model_ATTN_d64L2.pt`	战斗注意力模型——负结果(~14 层)

战斗模型是有意公开的负结果:六种把 MCTS 战斗蒸馏进前馈网络的方法全部以同样方式失败(模仿训练准确率卡死 0.44——MCTS 模拟用的是真实未来抽牌顺序,老师"看得见未来",只看当前一帧的策略学不像)。判断型决策容易压进小网络;规划型决策会抵抗。

使用

输入 = obs(412) ⊕ 候选描述符(368) → 每个候选一个分,取 argmax。需要打过 patch 的模拟器和编码代码——完整代码、sim patch、训练脚本、评测协议见 github.com/valiant-wjl/sts-rl-agent(含中文 README 和架构图)。

局限

只测了 A0(最低难度)、只有铁甲(模拟器只完整实现了铁甲);
战斗仍是搜索(MCTS),不是学出来的;
50 个固定 seed 的单次结果,无置信区间。

《杀戮尖塔》(Slay the Spire)是 Mega Crit Games 的商标;本项目是基于净室模拟器的非官方研究项目(MIT)。

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning