UTDG MaskablePPO Agent

A trained reinforcement learning agent for the Untitled Tower Defense Game using MaskablePPO.

Model Details

Description

This model is a MaskablePPO (Proximal Policy Optimization with invalid action masking) agent trained on the UTDG (Untitled Tower Defense Game) environment. The agent learns to strategically place and upgrade towers to defend against waves of enemies.

Model Architecture

Algorithm: MaskablePPO from sb3-contrib
Policy Network: MlpPolicy (Multi-layer Perceptron)
Framework: Stable-Baselines3
Environment: Custom UTDG Gymnasium environment with action masking

Training Hyperparameters

Parameter	Value
Total Timesteps	0
Learning Rate	0.0003
N Steps	2048
Batch Size	64
N Epochs	10
Gamma (γ)	0.99
GAE Lambda (λ)	0.95
Clip Range	0.2
Entropy Coefficient	0.001
Value Function Coefficient	0.5

Usage

Quick Start

from huggingface_hub import hf_hub_download
from sb3_contrib import MaskablePPO

# Download the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="chrisjcc/utdg-maskableppo-policy",
    filename="model_policy_v0.3.5.zip"
)

# Load the trained model
model = MaskablePPO.load(model_path)

Inference with Action Masking

import gymnasium as gym
from sb3_contrib import MaskablePPO

# Assuming you have the UTDG environment installed
# from utdg_env import UTDGEnv

# Load model
model = MaskablePPO.load(model_path)

# Create environment
env = gym.make("UTDGEnv-v0")
obs, info = env.reset()

# Run inference loop
done = False
total_reward = 0

while not done:
    # Get action mask from environment info
    action_masks = info.get("action_mask", None)

    # Predict action with masking
    action, _states = model.predict(
        obs,
        action_masks=action_masks,
        deterministic=True  # Set False for stochastic behavior
    )

    # Step environment
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    total_reward += reward

print(f"Episode reward: {total_reward}")
env.close()

Load Specific Revision

from sb3_contrib import MaskablePPO

# Load from a specific branch/revision
model = MaskablePPO.load(
    "chrisjcc/utdg-maskableppo-policy",
    revision="production"  # or "main", specific commit hash, etc.
)

Environment

UTDG (Untitled Tower Defense Game)

The agent is trained on a custom tower defense environment with the following characteristics:

Observation Space

Grid-based game state representation
Tower positions and types
Enemy positions and health
Player resources (gold, lives)
Wave information

Action Space

Discrete action space with invalid action masking
Actions include: place tower, upgrade tower, sell tower, skip turn
Action masking prevents invalid actions (e.g., placing towers on occupied tiles)

Reward Structure

Positive rewards for defeating enemies
Negative rewards for losing lives
Bonus rewards for completing waves
Efficiency bonuses for resource management

Training

Methodology

The model was trained using the MaskablePPO algorithm, which extends standard PPO with support for invalid action masking. This is crucial for the tower defense domain where many actions are contextually invalid (e.g., placing a tower on an occupied cell).

Key Features

Action Masking: Prevents the agent from selecting invalid actions, improving sample efficiency
Curriculum Learning: Progressive difficulty increase through wave complexity
Reward Shaping: Carefully designed reward function to encourage strategic play

Training Infrastructure

Trained using Stable-Baselines3 and sb3-contrib
Configuration managed via Hydra
Experiment tracking and model versioning via Hugging Face Hub

Repository Contents

File	Description
`model_policy_v0.3.5.zip`	Trained MaskablePPO model checkpoint (SB3 format)
`README.md`	This model card with full documentation
`config.yaml`	Hydra configuration snapshot (if included)

Limitations and Intended Use

Intended Use

Research and experimentation with RL agents in game environments
Baseline comparisons for tower defense AI development
Educational purposes for understanding action-masked RL

Limitations

Trained on a specific map configuration; may not generalize to significantly different layouts
Performance may vary with different enemy compositions not seen during training
Requires the UTDG environment to be installed for inference

Ethical Considerations

This model is designed for entertainment and research purposes in a game simulation context.

Citation

If you use this model in your research, please cite:

@misc{utdg-maskableppo,
  author = {Chris Cadonic},
  title = {UTDG MaskablePPO Agent},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/chrisjcc/utdg-maskableppo-policy}}
}

Acknowledgments

Stable-Baselines3 team for the RL framework
sb3-contrib for MaskablePPO implementation
Hugging Face for model hosting infrastructure

Generated on 2025-12-27T18:33:32.663706 UTC

Downloads last month: 25

Video Preview

Reinforcement Learning

Evaluation results

Mean Episode Reward on UTDG Environment
self-reported

TBD