YahtzeeRL Checkpoints
This repository hosts trained checkpoints for YahtzeeRL, a JAX/Flax/RLax self-play Yahtzee agent using Stochastic MuZero-style MCTS.
The current published checkpoint is a competitive two-player, head-to-head agent trained on simplified standard Yahtzee scoring: 13 categories plus upper bonus, without Joker rules or extra Yahtzee bonuses.
Best Checkpoint
The best competitive checkpoint is:
win_loss_margin_32simsrun4/step_011800
This checkpoint should be treated as the default competitive agent. A later
checkpoint, step_012800, regressed in direct greedy-policy comparison.
Download
Install the Hugging Face CLI, then download the checkpoint folder into the project's expected local checkpoint path:
hf download itxtx/yahtzee-rl-checkpoints \
--local-dir checkpoints/win_loss_margin_32simsrun4
The checkpoint directory includes the run-level config.json plus Orbax
checkpoint folders such as step_011800/. The YahtzeeRL loader needs both the
step folder and the adjacent config.json.
Usage
Clone and install the code:
git clone https://github.com/itxtx/yahtzeeRL.git
cd yahtzeeRL
uv sync
Evaluate the checkpoint against the hand-written heuristic baseline:
uv run python -m yahtzee_rl.evaluate \
--agent-a mcts \
--checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \
--agent-b heuristic \
--num-games 512 \
--sims-a 32
Play against the checkpoint:
uv run python -m yahtzee_rl.play_cli \
--checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \
--num-simulations 32
Evaluation
Best competitive checkpoint versus heuristic:
Agent A: mcts@step_11800
Agent B: heuristic
games: 512
A win: 0.811 | B win: 0.186 | draw: 0.004
mean score A: 204.28 | B: 164.48 | margin: 39.80
Search versus the same checkpoint's greedy policy head:
mcts@step_11800 vs greedy@step_11800, 512 games, sims=64:
A win 0.494 | B win 0.492 | draw 0.014 | margin +0.96
mcts@step_11800 vs greedy@step_11800, 512 games, sims=128:
A win 0.525 | B win 0.471 | draw 0.004 | margin +0.26
mcts@step_11800 vs greedy@step_11800, 512 games, sims=256:
A win 0.508 | B win 0.486 | draw 0.006 | margin -0.58
Checkpoint regression check:
Agent A: greedy@step_11800
Agent B: greedy@step_12800
games: 512
A win: 0.553 | B win: 0.436 | draw: 0.012
mean score A: 209.03 | B: 201.90 | margin: 7.13
These results suggest that step_011800 is the best competitive checkpoint
from the observed runs, and that the greedy policy has already absorbed most of
the useful shallow/medium search behavior.
Training Objective
The checkpoint was trained for two-player competitive play with margin-shaped terminal rewards:
sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50)
Training uses self-play, replay-buffer minibatches, and policy/value targets derived from MCTS search.
Intended Use
This checkpoint is intended for:
- evaluating a trained Yahtzee RL policy against baselines
- playing against the agent locally
- reproducing or extending the YahtzeeRL experiments
- studying a compact JAX/Flax self-play setup with exact dice chance nodes
Limitations
- This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee solver.
- The environment omits Joker rules and extra Yahtzee bonuses.
- The agent's average score is around the low 200s in the observed evaluations; it was not optimized to maximize solo final score.
- Later training on the same objective produced regressions, so
step_011800should be preserved as the default checkpoint unless a new run beats it in direct evaluation.
Future Direction
A separate score-maximizing agent would likely need a different terminal reward, for example:
tanh((own_score - 200) / 50)
That should be evaluated by greedy mean score over at least 1k games rather than head-to-head win rate.
- Downloads last month
- 36