sts-rl-agent — a learned non-combat policy for Slay the Spire
To our knowledge, the first working learned policy published for sts_lightspeed — the fast C++ Slay the Spire simulator, which ships a 412-dim neural-network observation interface (NNInterface) but no trained weights.
A small MLP (~100k params, [128,128], trained from scratch with REINFORCE) makes all non-combat decisions — map pathing, card rewards, shops, campfires, events — while the simulator's built-in MCTS plays combat.
Headline result
Same MCTS combat, same 50 held-out seeds, A0 Ironclad — only the non-combat "brain" differs:
| non-combat decisions | combat | avg floor | win rate |
|---|---|---|---|
| stock bot heuristics (map = random) | MCTS @50000 | 31.2 | 6% |
| this model | MCTS @50000 | 42.5 | 14% |
The learned non-combat layer is worth ~11 floors over the stock bot: its biggest weakness was never combat, it was walking the map at random.
Files
| file | what |
|---|---|
armG_model_G128x128_15k.pt |
the non-combat policy behind the headline number (15k games) |
armG_model_G128x128.pt |
earlier 8k-game checkpoint |
armS_card_vocab.json |
card vocabulary (required to encode candidates) |
armB_model_B256x256.pt |
combat behavior-cloning model — negative result (floor ~12 vs teacher 23) |
armB_model_VAL256x256.pt |
combat value net — negative result (1-ply lookahead: floor ~8) |
armB_model_ATTN_d64L2.pt |
combat attention model — negative result (floor ~14) |
The combat models are published on purpose: six different attempts to distill MCTS combat into a feed-forward network all failed the same way (imitation caps at 0.44 train accuracy — the MCTS teacher effectively sees the future draw order; a one-frame policy can't). Judgment-type decisions compress into small networks easily; planning-type decisions resist.
Usage
Input is obs(412) ⊕ candidate-descriptor(368) → scalar score per candidate; pick the argmax. You need the patched simulator and the encoding code — full code, sim patch, training scripts and eval protocol: github.com/valiant-wjl/sts-rl-agent.
import torch
from agent.armG_train import Scorer, build_choices, obs_vec # from the GitHub repo
net = Scorer((128, 128))
net.load_state_dict(torch.load("armG_model_G128x128_15k.pt", weights_only=True))
Limitations
- A0 (lowest difficulty), Ironclad only (the simulator only fully implements Ironclad).
- Combat is still search (MCTS), not learned.
- Single-run numbers on 50 fixed seeds, no confidence intervals.
Slay the Spire is a trademark of Mega Crit Games; this is an unaffiliated research project on a clean-room simulator (MIT).
中文说明
据我们所知,这是第一个针对 sts_lightspeed 神经网络接口公开发布的、能用的学习型策略——这个快速 C++《杀戮尖塔》模拟器自带 412 维观测接口(NNInterface),但作者从未公开过训好的权重。
一个从零训练的小 MLP(约 10 万参数,[128,128],REINFORCE)做全部非战斗决策——地图选路、奖励选卡、商店、篝火、事件;战斗由模拟器内置的 MCTS 执行。
核心结果
同样的 MCTS 战斗、同样的 50 个留出关卡(A0 铁甲),只换"非战斗的脑子":
| 非战斗决策 | 战斗 | 平均楼层 | 通关率 |
|---|---|---|---|
| 原生 bot 启发式(地图=随机) | MCTS @50000 | 31.2 | 6% |
| 本模型 | MCTS @50000 | 42.5 | 14% |
学习型非战斗层比原生 bot 高出约 11 层——它最大的软肋从来不是打牌,而是选路基本靠随机。
文件说明
| 文件 | 说明 |
|---|---|
armG_model_G128x128_15k.pt |
核心结果背后的非战斗策略(1.5 万局训练) |
armG_model_G128x128.pt |
更早的 8 千局 checkpoint |
armS_card_vocab.json |
卡牌词表(编码候选项必需) |
armB_model_B256x256.pt |
战斗行为克隆模型——负结果(实战 ~12 层 vs 老师 23) |
armB_model_VAL256x256.pt |
战斗价值网——负结果(1 步前瞻只有 ~8 层) |
armB_model_ATTN_d64L2.pt |
战斗注意力模型——负结果(~14 层) |
战斗模型是有意公开的负结果:六种把 MCTS 战斗蒸馏进前馈网络的方法全部以同样方式失败(模仿训练准确率卡死 0.44——MCTS 模拟用的是真实未来抽牌顺序,老师"看得见未来",只看当前一帧的策略学不像)。判断型决策容易压进小网络;规划型决策会抵抗。
使用
输入 = obs(412) ⊕ 候选描述符(368) → 每个候选一个分,取 argmax。需要打过 patch 的模拟器和编码代码——完整代码、sim patch、训练脚本、评测协议见 github.com/valiant-wjl/sts-rl-agent(含中文 README 和架构图)。
局限
- 只测了 A0(最低难度)、只有铁甲(模拟器只完整实现了铁甲);
- 战斗仍是搜索(MCTS),不是学出来的;
- 50 个固定 seed 的单次结果,无置信区间。
《杀戮尖塔》(Slay the Spire)是 Mega Crit Games 的商标;本项目是基于净室模拟器的非官方研究项目(MIT)。