Beasty Bar PPO Agent

A neural network trained with PPO (Proximal Policy Optimization) to play Beasty Bar, a strategic card game where animals compete to enter Heaven (the bar) while avoiding Hell.

GitHub: diegooprime/beastybar

Latest Model

Recommended: v4/final.pt - trained with diverse opponent pool for robust play.

Performance

Evaluated with 500 games per opponent (both sides), greedy action selection.

Opponent	Win Rate	95% CI
Random	93.4%	[0.91, 0.95]
Defensive	81.0%	[0.77, 0.84]
Heuristic	76.8%	[0.73, 0.80]
Queue	75.6%	[0.72, 0.79]
Skunk	75.6%	[0.72, 0.79]
Noisy	75.6%	[0.72, 0.79]
Aggressive	75.0%	[0.71, 0.79]
Online	70.2%	[0.66, 0.74]
Distilled Outcome	67.4%	[0.63, 0.71]
Outcome Heuristic	66.0%	[0.62, 0.70]

Overall: 75.7% win rate across 5000 games, ~1379 ELO

Quick Start

import torch
from huggingface_hub import hf_hub_download

# Download the latest model
checkpoint_path = hf_hub_download(
    repo_id="shiptoday101/beastybar-ppo",
    filename="v4/final.pt"
)

# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)

# Access the network weights
state_dict = checkpoint["model_state_dict"]
config = checkpoint["config"]

print(f"Iteration: {checkpoint['iteration']}")
print(f"Network config: {config['network_config']}")

Full Integration (with game repo)

# Clone the game repo first: git clone https://github.com/diegooprime/beastybar
from _02_agents.neural.network import BeastyBarNetwork
from _02_agents.neural.utils import NetworkConfig
from _03_training.checkpoint_manager import load_for_inference

# Load for inference (smaller footprint)
state_dict, config = load_for_inference("path/to/v4_final.pt")
network = BeastyBarNetwork(NetworkConfig.from_dict(config))
network.load_state_dict(state_dict)
network.eval()

Architecture

Type: Transformer policy-value network
Parameters: ~1.3M
Input: 988-dimensional observation vector
Output: 124-dim action logits + scalar value [-1, 1]

Network Details

Component	Specification
Hidden dimension	256
Attention heads	8
Transformer layers	4
Species embedding	64-dim

Training Details

Algorithm: PPO with GAE
Method: Self-play with opponent diversity
Hardware: RunPod A100/H200 GPU
Games: ~5M+ games across 600 iterations

Opponent Pool

Type	Weight
Current policy (self-play)	60%
Historical checkpoints	20%
Random agent	10%
Heuristic variants	10%

PPO Hyperparameters

Parameter	Value
Learning rate	0.0001 (cosine decay)
Clip epsilon	0.2
Value coefficient	0.5
Entropy coefficient	0.04 -> 0.01
GAE lambda	0.95
PPO epochs	4
Minibatch size	2048

Available Checkpoints

Model	Description
`v4/final.pt`	Latest - 600 iterations, opponent pool diversity
`v4/iter_*.pt`	Intermediate v4 checkpoints
`v3/final.pt`	Previous training run
`v2/iter_*.pt`	V2 checkpoints
`v1/iter_074.pt`	Early experiment

Tablebase: shiptoday101/beastybar-tablebase - Endgame lookup table for perfect play in 4-card positions
Game Rules: Beasty Bar PDF

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

shiptoday101
/

beastybar-ppo