YahtzeeRL Checkpoints

This repository hosts trained checkpoints for YahtzeeRL, a JAX/Flax/RLax self-play Yahtzee agent using Stochastic MuZero-style MCTS.

The current published checkpoint is a competitive two-player, head-to-head agent trained on simplified standard Yahtzee scoring: 13 categories plus upper bonus, without Joker rules or extra Yahtzee bonuses.

Best Checkpoint

The best competitive checkpoint is:

win_loss_margin_32simsrun4/step_011800

This checkpoint should be treated as the default competitive agent. A later checkpoint, step_012800, regressed in direct greedy-policy comparison.

Download

Install the Hugging Face CLI, then download the checkpoint folder into the project's expected local checkpoint path:

hf download itxtx/yahtzee-rl-checkpoints \
  --local-dir checkpoints/win_loss_margin_32simsrun4

The checkpoint directory includes the run-level config.json plus Orbax checkpoint folders such as step_011800/. The YahtzeeRL loader needs both the step folder and the adjacent config.json.

Usage

Clone and install the code:

git clone https://github.com/itxtx/yahtzeeRL.git
cd yahtzeeRL
uv sync

Evaluate the checkpoint against the hand-written heuristic baseline:

uv run python -m yahtzee_rl.evaluate \
  --agent-a mcts \
  --checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \
  --agent-b heuristic \
  --num-games 512 \
  --sims-a 32

Play against the checkpoint:

uv run python -m yahtzee_rl.play_cli \
  --checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \
  --num-simulations 32

Evaluation

Best competitive checkpoint versus heuristic:

Agent A: mcts@step_11800
Agent B: heuristic
games: 512
A win: 0.811 | B win: 0.186 | draw: 0.004
mean score A: 204.28 | B: 164.48 | margin: 39.80

Search versus the same checkpoint's greedy policy head:

mcts@step_11800 vs greedy@step_11800, 512 games, sims=64:
  A win 0.494 | B win 0.492 | draw 0.014 | margin +0.96

mcts@step_11800 vs greedy@step_11800, 512 games, sims=128:
  A win 0.525 | B win 0.471 | draw 0.004 | margin +0.26

mcts@step_11800 vs greedy@step_11800, 512 games, sims=256:
  A win 0.508 | B win 0.486 | draw 0.006 | margin -0.58

Checkpoint regression check:

Agent A: greedy@step_11800
Agent B: greedy@step_12800
games: 512
A win: 0.553 | B win: 0.436 | draw: 0.012
mean score A: 209.03 | B: 201.90 | margin: 7.13

These results suggest that step_011800 is the best competitive checkpoint from the observed runs, and that the greedy policy has already absorbed most of the useful shallow/medium search behavior.

Training Objective

The checkpoint was trained for two-player competitive play with margin-shaped terminal rewards:

sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50)

Training uses self-play, replay-buffer minibatches, and policy/value targets derived from MCTS search.

Intended Use

This checkpoint is intended for:

evaluating a trained Yahtzee RL policy against baselines
playing against the agent locally
reproducing or extending the YahtzeeRL experiments
studying a compact JAX/Flax self-play setup with exact dice chance nodes

Limitations

This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee solver.
The environment omits Joker rules and extra Yahtzee bonuses.
The agent's average score is around the low 200s in the observed evaluations; it was not optimized to maximize solo final score.
Later training on the same objective produced regressions, so step_011800 should be preserved as the default checkpoint unless a new run beats it in direct evaluation.

Future Direction

A separate score-maximizing agent would likely need a different terminal reward, for example:

tanh((own_score - 200) / 50)

That should be evaluated by greedy mean score over at least 1k games rather than head-to-head win rate.

Downloads last month: 36

Video Preview

Reinforcement Learning