---
license: mit
library_name: flax
tags:
  - reinforcement-learning
  - jax
  - flax
  - mcts
  - board-games
  - yahtzee
pipeline_tag: reinforcement-learning
---

# YahtzeeRL Checkpoints

This repository hosts trained checkpoints for
[YahtzeeRL](https://github.com/itxtx/yahtzeeRL), a JAX/Flax/RLax self-play
Yahtzee agent using Stochastic MuZero-style MCTS.

The current published checkpoint is a competitive two-player, head-to-head
agent trained on simplified standard Yahtzee scoring: 13 categories plus upper
bonus, without Joker rules or extra Yahtzee bonuses.

## Best Checkpoint

The best competitive checkpoint is:

```text
win_loss_margin_32simsrun4/step_011800
```

This checkpoint should be treated as the default competitive agent. A later
checkpoint, `step_012800`, regressed in direct greedy-policy comparison.

## Download

Install the Hugging Face CLI, then download the checkpoint folder into the
project's expected local checkpoint path:

```bash
hf download itxtx/yahtzee-rl-checkpoints \
  --local-dir checkpoints/win_loss_margin_32simsrun4
```

The checkpoint directory includes the run-level `config.json` plus Orbax
checkpoint folders such as `step_011800/`. The YahtzeeRL loader needs both the
step folder and the adjacent `config.json`.

## Usage

Clone and install the code:

```bash
git clone https://github.com/itxtx/yahtzeeRL.git
cd yahtzeeRL
uv sync
```

Evaluate the checkpoint against the hand-written heuristic baseline:

```bash
uv run python -m yahtzee_rl.evaluate \
  --agent-a mcts \
  --checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \
  --agent-b heuristic \
  --num-games 512 \
  --sims-a 32
```

Play against the checkpoint:

```bash
uv run python -m yahtzee_rl.play_cli \
  --checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \
  --num-simulations 32
```

## Evaluation

Best competitive checkpoint versus heuristic:

```text
Agent A: mcts@step_11800
Agent B: heuristic
games: 512
A win: 0.811 | B win: 0.186 | draw: 0.004
mean score A: 204.28 | B: 164.48 | margin: 39.80
```

Search versus the same checkpoint's greedy policy head:

```text
mcts@step_11800 vs greedy@step_11800, 512 games, sims=64:
  A win 0.494 | B win 0.492 | draw 0.014 | margin +0.96

mcts@step_11800 vs greedy@step_11800, 512 games, sims=128:
  A win 0.525 | B win 0.471 | draw 0.004 | margin +0.26

mcts@step_11800 vs greedy@step_11800, 512 games, sims=256:
  A win 0.508 | B win 0.486 | draw 0.006 | margin -0.58
```

Checkpoint regression check:

```text
Agent A: greedy@step_11800
Agent B: greedy@step_12800
games: 512
A win: 0.553 | B win: 0.436 | draw: 0.012
mean score A: 209.03 | B: 201.90 | margin: 7.13
```

These results suggest that `step_011800` is the best competitive checkpoint
from the observed runs, and that the greedy policy has already absorbed most of
the useful shallow/medium search behavior.

## Training Objective

The checkpoint was trained for two-player competitive play with margin-shaped
terminal rewards:

```text
sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50)
```

Training uses self-play, replay-buffer minibatches, and policy/value targets
derived from MCTS search.

## Intended Use

This checkpoint is intended for:

- evaluating a trained Yahtzee RL policy against baselines
- playing against the agent locally
- reproducing or extending the YahtzeeRL experiments
- studying a compact JAX/Flax self-play setup with exact dice chance nodes

## Limitations

- This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee
  solver.
- The environment omits Joker rules and extra Yahtzee bonuses.
- The agent's average score is around the low 200s in the observed evaluations;
  it was not optimized to maximize solo final score.
- Later training on the same objective produced regressions, so `step_011800`
  should be preserved as the default checkpoint unless a new run beats it in
  direct evaluation.

## Future Direction

A separate score-maximizing agent would likely need a different terminal reward,
for example:

```text
tanh((own_score - 200) / 50)
```

That should be evaluated by greedy mean score over at least 1k games rather than
head-to-head win rate.