Beasty Bar PPO Agent

A neural network trained with PPO (Proximal Policy Optimization) to play Beasty Bar, a strategic card game where animals compete to enter Heaven (the bar) while avoiding Hell.

GitHub: diegooprime/beastybar

Latest Model

Recommended: v4/final.pt - trained with diverse opponent pool for robust play.

Performance

Evaluated with 500 games per opponent (both sides), greedy action selection.

Opponent Win Rate 95% CI
Random 93.4% [0.91, 0.95]
Defensive 81.0% [0.77, 0.84]
Heuristic 76.8% [0.73, 0.80]
Queue 75.6% [0.72, 0.79]
Skunk 75.6% [0.72, 0.79]
Noisy 75.6% [0.72, 0.79]
Aggressive 75.0% [0.71, 0.79]
Online 70.2% [0.66, 0.74]
Distilled Outcome 67.4% [0.63, 0.71]
Outcome Heuristic 66.0% [0.62, 0.70]

Overall: 75.7% win rate across 5000 games, ~1379 ELO

Quick Start

import torch
from huggingface_hub import hf_hub_download

# Download the latest model
checkpoint_path = hf_hub_download(
    repo_id="shiptoday101/beastybar-ppo",
    filename="v4/final.pt"
)

# Load checkpoint
checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)

# Access the network weights
state_dict = checkpoint["model_state_dict"]
config = checkpoint["config"]

print(f"Iteration: {checkpoint['iteration']}")
print(f"Network config: {config['network_config']}")

Full Integration (with game repo)

# Clone the game repo first: git clone https://github.com/diegooprime/beastybar
from _02_agents.neural.network import BeastyBarNetwork
from _02_agents.neural.utils import NetworkConfig
from _03_training.checkpoint_manager import load_for_inference

# Load for inference (smaller footprint)
state_dict, config = load_for_inference("path/to/v4_final.pt")
network = BeastyBarNetwork(NetworkConfig.from_dict(config))
network.load_state_dict(state_dict)
network.eval()

Architecture

  • Type: Transformer policy-value network
  • Parameters: ~1.3M
  • Input: 988-dimensional observation vector
  • Output: 124-dim action logits + scalar value [-1, 1]

Network Details

Component Specification
Hidden dimension 256
Attention heads 8
Transformer layers 4
Species embedding 64-dim

Training Details

  • Algorithm: PPO with GAE
  • Method: Self-play with opponent diversity
  • Hardware: RunPod A100/H200 GPU
  • Games: ~5M+ games across 600 iterations

Opponent Pool

Type Weight
Current policy (self-play) 60%
Historical checkpoints 20%
Random agent 10%
Heuristic variants 10%

PPO Hyperparameters

Parameter Value
Learning rate 0.0001 (cosine decay)
Clip epsilon 0.2
Value coefficient 0.5
Entropy coefficient 0.04 -> 0.01
GAE lambda 0.95
PPO epochs 4
Minibatch size 2048

Available Checkpoints

Model Description
v4/final.pt Latest - 600 iterations, opponent pool diversity
v4/iter_*.pt Intermediate v4 checkpoints
v3/final.pt Previous training run
v2/iter_*.pt V2 checkpoints
v1/iter_074.pt Early experiment

Related

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading