--- license: mit library_name: flax tags: - reinforcement-learning - jax - flax - mcts - board-games - yahtzee pipeline_tag: reinforcement-learning --- # YahtzeeRL Checkpoints This repository hosts trained checkpoints for [YahtzeeRL](https://github.com/itxtx/yahtzeeRL), a JAX/Flax/RLax self-play Yahtzee agent using Stochastic MuZero-style MCTS. The current published checkpoint is a competitive two-player, head-to-head agent trained on simplified standard Yahtzee scoring: 13 categories plus upper bonus, without Joker rules or extra Yahtzee bonuses. ## Best Checkpoint The best competitive checkpoint is: ```text win_loss_margin_32simsrun4/step_011800 ``` This checkpoint should be treated as the default competitive agent. A later checkpoint, `step_012800`, regressed in direct greedy-policy comparison. ## Download Install the Hugging Face CLI, then download the checkpoint folder into the project's expected local checkpoint path: ```bash hf download itxtx/yahtzee-rl-checkpoints \ --local-dir checkpoints/win_loss_margin_32simsrun4 ``` The checkpoint directory includes the run-level `config.json` plus Orbax checkpoint folders such as `step_011800/`. The YahtzeeRL loader needs both the step folder and the adjacent `config.json`. ## Usage Clone and install the code: ```bash git clone https://github.com/itxtx/yahtzeeRL.git cd yahtzeeRL uv sync ``` Evaluate the checkpoint against the hand-written heuristic baseline: ```bash uv run python -m yahtzee_rl.evaluate \ --agent-a mcts \ --checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \ --agent-b heuristic \ --num-games 512 \ --sims-a 32 ``` Play against the checkpoint: ```bash uv run python -m yahtzee_rl.play_cli \ --checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \ --num-simulations 32 ``` ## Evaluation Best competitive checkpoint versus heuristic: ```text Agent A: mcts@step_11800 Agent B: heuristic games: 512 A win: 0.811 | B win: 0.186 | draw: 0.004 mean score A: 204.28 | B: 164.48 | margin: 39.80 ``` Search versus the same checkpoint's greedy policy head: ```text mcts@step_11800 vs greedy@step_11800, 512 games, sims=64: A win 0.494 | B win 0.492 | draw 0.014 | margin +0.96 mcts@step_11800 vs greedy@step_11800, 512 games, sims=128: A win 0.525 | B win 0.471 | draw 0.004 | margin +0.26 mcts@step_11800 vs greedy@step_11800, 512 games, sims=256: A win 0.508 | B win 0.486 | draw 0.006 | margin -0.58 ``` Checkpoint regression check: ```text Agent A: greedy@step_11800 Agent B: greedy@step_12800 games: 512 A win: 0.553 | B win: 0.436 | draw: 0.012 mean score A: 209.03 | B: 201.90 | margin: 7.13 ``` These results suggest that `step_011800` is the best competitive checkpoint from the observed runs, and that the greedy policy has already absorbed most of the useful shallow/medium search behavior. ## Training Objective The checkpoint was trained for two-player competitive play with margin-shaped terminal rewards: ```text sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50) ``` Training uses self-play, replay-buffer minibatches, and policy/value targets derived from MCTS search. ## Intended Use This checkpoint is intended for: - evaluating a trained Yahtzee RL policy against baselines - playing against the agent locally - reproducing or extending the YahtzeeRL experiments - studying a compact JAX/Flax self-play setup with exact dice chance nodes ## Limitations - This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee solver. - The environment omits Joker rules and extra Yahtzee bonuses. - The agent's average score is around the low 200s in the observed evaluations; it was not optimized to maximize solo final score. - Later training on the same objective produced regressions, so `step_011800` should be preserved as the default checkpoint unless a new run beats it in direct evaluation. ## Future Direction A separate score-maximizing agent would likely need a different terminal reward, for example: ```text tanh((own_score - 200) / 50) ``` That should be evaluated by greedy mean score over at least 1k games rather than head-to-head win rate.