| --- |
| license: mit |
| library_name: flax |
| tags: |
| - reinforcement-learning |
| - jax |
| - flax |
| - mcts |
| - board-games |
| - yahtzee |
| pipeline_tag: reinforcement-learning |
| --- |
| |
| # YahtzeeRL Checkpoints |
|
|
| This repository hosts trained checkpoints for |
| [YahtzeeRL](https://github.com/itxtx/yahtzeeRL), a JAX/Flax/RLax self-play |
| Yahtzee agent using Stochastic MuZero-style MCTS. |
|
|
| The current published checkpoint is a competitive two-player, head-to-head |
| agent trained on simplified standard Yahtzee scoring: 13 categories plus upper |
| bonus, without Joker rules or extra Yahtzee bonuses. |
|
|
| ## Best Checkpoint |
|
|
| The best competitive checkpoint is: |
|
|
| ```text |
| win_loss_margin_32simsrun4/step_011800 |
| ``` |
|
|
| This checkpoint should be treated as the default competitive agent. A later |
| checkpoint, `step_012800`, regressed in direct greedy-policy comparison. |
|
|
| ## Download |
|
|
| Install the Hugging Face CLI, then download the checkpoint folder into the |
| project's expected local checkpoint path: |
|
|
| ```bash |
| hf download itxtx/yahtzee-rl-checkpoints \ |
| --local-dir checkpoints/win_loss_margin_32simsrun4 |
| ``` |
|
|
| The checkpoint directory includes the run-level `config.json` plus Orbax |
| checkpoint folders such as `step_011800/`. The YahtzeeRL loader needs both the |
| step folder and the adjacent `config.json`. |
|
|
| ## Usage |
|
|
| Clone and install the code: |
|
|
| ```bash |
| git clone https://github.com/itxtx/yahtzeeRL.git |
| cd yahtzeeRL |
| uv sync |
| ``` |
|
|
| Evaluate the checkpoint against the hand-written heuristic baseline: |
|
|
| ```bash |
| uv run python -m yahtzee_rl.evaluate \ |
| --agent-a mcts \ |
| --checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \ |
| --agent-b heuristic \ |
| --num-games 512 \ |
| --sims-a 32 |
| ``` |
|
|
| Play against the checkpoint: |
|
|
| ```bash |
| uv run python -m yahtzee_rl.play_cli \ |
| --checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \ |
| --num-simulations 32 |
| ``` |
|
|
| ## Evaluation |
|
|
| Best competitive checkpoint versus heuristic: |
|
|
| ```text |
| Agent A: mcts@step_11800 |
| Agent B: heuristic |
| games: 512 |
| A win: 0.811 | B win: 0.186 | draw: 0.004 |
| mean score A: 204.28 | B: 164.48 | margin: 39.80 |
| ``` |
|
|
| Search versus the same checkpoint's greedy policy head: |
|
|
| ```text |
| mcts@step_11800 vs greedy@step_11800, 512 games, sims=64: |
| A win 0.494 | B win 0.492 | draw 0.014 | margin +0.96 |
| |
| mcts@step_11800 vs greedy@step_11800, 512 games, sims=128: |
| A win 0.525 | B win 0.471 | draw 0.004 | margin +0.26 |
| |
| mcts@step_11800 vs greedy@step_11800, 512 games, sims=256: |
| A win 0.508 | B win 0.486 | draw 0.006 | margin -0.58 |
| ``` |
|
|
| Checkpoint regression check: |
|
|
| ```text |
| Agent A: greedy@step_11800 |
| Agent B: greedy@step_12800 |
| games: 512 |
| A win: 0.553 | B win: 0.436 | draw: 0.012 |
| mean score A: 209.03 | B: 201.90 | margin: 7.13 |
| ``` |
|
|
| These results suggest that `step_011800` is the best competitive checkpoint |
| from the observed runs, and that the greedy policy has already absorbed most of |
| the useful shallow/medium search behavior. |
|
|
| ## Training Objective |
|
|
| The checkpoint was trained for two-player competitive play with margin-shaped |
| terminal rewards: |
|
|
| ```text |
| sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50) |
| ``` |
|
|
| Training uses self-play, replay-buffer minibatches, and policy/value targets |
| derived from MCTS search. |
|
|
| ## Intended Use |
|
|
| This checkpoint is intended for: |
|
|
| - evaluating a trained Yahtzee RL policy against baselines |
| - playing against the agent locally |
| - reproducing or extending the YahtzeeRL experiments |
| - studying a compact JAX/Flax self-play setup with exact dice chance nodes |
|
|
| ## Limitations |
|
|
| - This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee |
| solver. |
| - The environment omits Joker rules and extra Yahtzee bonuses. |
| - The agent's average score is around the low 200s in the observed evaluations; |
| it was not optimized to maximize solo final score. |
| - Later training on the same objective produced regressions, so `step_011800` |
| should be preserved as the default checkpoint unless a new run beats it in |
| direct evaluation. |
|
|
| ## Future Direction |
|
|
| A separate score-maximizing agent would likely need a different terminal reward, |
| for example: |
|
|
| ```text |
| tanh((own_score - 200) / 50) |
| ``` |
|
|
| That should be evaluated by greedy mean score over at least 1k games rather than |
| head-to-head win rate. |
|
|