Create README.md

03c6a54 verified 18 days ago

4.18 kB

	---
	license: mit
	library_name: flax
	tags:
	- reinforcement-learning
	- jax
	- flax
	- mcts
	- board-games
	- yahtzee
	pipeline_tag: reinforcement-learning
	---

	# YahtzeeRL Checkpoints

	This repository hosts trained checkpoints for
	[YahtzeeRL](https://github.com/itxtx/yahtzeeRL), a JAX/Flax/RLax self-play
	Yahtzee agent using Stochastic MuZero-style MCTS.

	The current published checkpoint is a competitive two-player, head-to-head
	agent trained on simplified standard Yahtzee scoring: 13 categories plus upper
	bonus, without Joker rules or extra Yahtzee bonuses.

	## Best Checkpoint

	The best competitive checkpoint is:

	```text
	win_loss_margin_32simsrun4/step_011800
	```

	This checkpoint should be treated as the default competitive agent. A later
	checkpoint, `step_012800`, regressed in direct greedy-policy comparison.

	## Download

	Install the Hugging Face CLI, then download the checkpoint folder into the
	project's expected local checkpoint path:

	```bash
	hf download itxtx/yahtzee-rl-checkpoints \
	--local-dir checkpoints/win_loss_margin_32simsrun4
	```

	The checkpoint directory includes the run-level `config.json` plus Orbax
	checkpoint folders such as `step_011800/`. The YahtzeeRL loader needs both the
	step folder and the adjacent `config.json`.

	## Usage

	Clone and install the code:

	```bash
	git clone https://github.com/itxtx/yahtzeeRL.git
	cd yahtzeeRL
	uv sync
	```

	Evaluate the checkpoint against the hand-written heuristic baseline:

	```bash
	uv run python -m yahtzee_rl.evaluate \
	--agent-a mcts \
	--checkpoint-a checkpoints/win_loss_margin_32simsrun4/step_011800 \
	--agent-b heuristic \
	--num-games 512 \
	--sims-a 32
	```

	Play against the checkpoint:

	```bash
	uv run python -m yahtzee_rl.play_cli \
	--checkpoint checkpoints/win_loss_margin_32simsrun4/step_011800 \
	--num-simulations 32
	```

	## Evaluation

	Best competitive checkpoint versus heuristic:

	```text
	Agent A: mcts@step_11800
	Agent B: heuristic
	games: 512
	A win: 0.811 \| B win: 0.186 \| draw: 0.004
	mean score A: 204.28 \| B: 164.48 \| margin: 39.80
	```

	Search versus the same checkpoint's greedy policy head:

	```text
	mcts@step_11800 vs greedy@step_11800, 512 games, sims=64:
	A win 0.494 \| B win 0.492 \| draw 0.014 \| margin +0.96

	mcts@step_11800 vs greedy@step_11800, 512 games, sims=128:
	A win 0.525 \| B win 0.471 \| draw 0.004 \| margin +0.26

	mcts@step_11800 vs greedy@step_11800, 512 games, sims=256:
	A win 0.508 \| B win 0.486 \| draw 0.006 \| margin -0.58
	```

	Checkpoint regression check:

	```text
	Agent A: greedy@step_11800
	Agent B: greedy@step_12800
	games: 512
	A win: 0.553 \| B win: 0.436 \| draw: 0.012
	mean score A: 209.03 \| B: 201.90 \| margin: 7.13
	```

	These results suggest that `step_011800` is the best competitive checkpoint
	from the observed runs, and that the greedy policy has already absorbed most of
	the useful shallow/medium search behavior.

	## Training Objective

	The checkpoint was trained for two-player competitive play with margin-shaped
	terminal rewards:

	```text
	sign(score_margin) * (1 - 0.25) + 0.25 * tanh(score_margin / 50)
	```

	Training uses self-play, replay-buffer minibatches, and policy/value targets
	derived from MCTS search.

	## Intended Use

	This checkpoint is intended for:

	- evaluating a trained Yahtzee RL policy against baselines
	- playing against the agent locally
	- reproducing or extending the YahtzeeRL experiments
	- studying a compact JAX/Flax self-play setup with exact dice chance nodes

	## Limitations

	- This is a competitive head-to-head agent, not a pure score-maximizing Yahtzee
	solver.
	- The environment omits Joker rules and extra Yahtzee bonuses.
	- The agent's average score is around the low 200s in the observed evaluations;
	it was not optimized to maximize solo final score.
	- Later training on the same objective produced regressions, so `step_011800`
	should be preserved as the default checkpoint unless a new run beats it in
	direct evaluation.

	## Future Direction

	A separate score-maximizing agent would likely need a different terminal reward,
	for example:

	```text
	tanh((own_score - 200) / 50)
	```

	That should be evaluated by greedy mean score over at least 1k games rather than
	head-to-head win rate.