Spaces:
Running
Running
Add OpenRA-Bench leaderboard, evaluation harness, and rubrics
Browse files- Gradio leaderboard app with agent ranking, type/opponent filters, search
- OpenEnv-compatible rubrics (win/loss, military efficiency, economy)
- CLI evaluation harness for running N-game benchmarks
- Seed data with ScriptedBot and LLM-Agent baselines
- GitHub Actions workflow for HuggingFace Space sync
- HF Space-compatible README with YAML frontmatter
- .github/workflows/sync-to-hf.yml +40 -0
- .gitignore +9 -0
- README.md +69 -0
- app.py +305 -0
- data/results.csv +6 -0
- data/schema.md +17 -0
- evaluate.py +261 -0
- requirements.txt +3 -0
- rubrics.py +152 -0
.github/workflows/sync-to-hf.yml
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: Sync to HuggingFace Space
|
| 2 |
+
|
| 3 |
+
on:
|
| 4 |
+
push:
|
| 5 |
+
branches:
|
| 6 |
+
- main
|
| 7 |
+
paths:
|
| 8 |
+
- 'app.py'
|
| 9 |
+
- 'requirements.txt'
|
| 10 |
+
- 'data/**'
|
| 11 |
+
- 'README.md'
|
| 12 |
+
|
| 13 |
+
jobs:
|
| 14 |
+
sync:
|
| 15 |
+
runs-on: ubuntu-latest
|
| 16 |
+
steps:
|
| 17 |
+
- uses: actions/checkout@v4
|
| 18 |
+
with:
|
| 19 |
+
fetch-depth: 0
|
| 20 |
+
lfs: true
|
| 21 |
+
|
| 22 |
+
- name: Push to HuggingFace Hub
|
| 23 |
+
env:
|
| 24 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 25 |
+
run: |
|
| 26 |
+
# Only sync if HF_TOKEN is configured
|
| 27 |
+
if [ -z "$HF_TOKEN" ]; then
|
| 28 |
+
echo "HF_TOKEN not set, skipping sync"
|
| 29 |
+
exit 0
|
| 30 |
+
fi
|
| 31 |
+
|
| 32 |
+
git config user.name "GitHub Actions"
|
| 33 |
+
git config user.email "actions@github.com"
|
| 34 |
+
|
| 35 |
+
# Add HF remote and push
|
| 36 |
+
git remote add hf https://huggingface.co/spaces/yxc20089/OpenRA-Bench || true
|
| 37 |
+
git remote set-url hf https://x-access-token:${HF_TOKEN}@huggingface.co/spaces/yxc20089/OpenRA-Bench
|
| 38 |
+
|
| 39 |
+
# Push main branch to HF
|
| 40 |
+
git push hf main --force
|
.gitignore
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__/
|
| 2 |
+
*.pyc
|
| 3 |
+
*.egg-info/
|
| 4 |
+
dist/
|
| 5 |
+
build/
|
| 6 |
+
.env
|
| 7 |
+
.venv/
|
| 8 |
+
*.orarep
|
| 9 |
+
flagged/
|
README.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: OpenRA-Bench
|
| 3 |
+
emoji: 🎮
|
| 4 |
+
colorFrom: red
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "5.12.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
+
license: gpl-3.0
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# OpenRA-Bench
|
| 14 |
+
|
| 15 |
+
Standardized benchmark and leaderboard for AI agents playing Red Alert through [OpenRA-RL](https://openra-rl.dev).
|
| 16 |
+
|
| 17 |
+
## Features
|
| 18 |
+
|
| 19 |
+
- **Leaderboard**: Ranked agent comparison with composite scoring
|
| 20 |
+
- **Filtering**: By agent type (Scripted/LLM/RL) and opponent difficulty
|
| 21 |
+
- **Evaluation harness**: Automated N-game benchmarking with metrics collection
|
| 22 |
+
- **OpenEnv rubrics**: Composable scoring (win/loss, military efficiency, economy)
|
| 23 |
+
- **Replay verification**: Replay files linked to leaderboard entries
|
| 24 |
+
|
| 25 |
+
## Quick Start
|
| 26 |
+
|
| 27 |
+
### View the leaderboard
|
| 28 |
+
|
| 29 |
+
```bash
|
| 30 |
+
pip install -r requirements.txt
|
| 31 |
+
python app.py
|
| 32 |
+
# Opens at http://localhost:7860
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### Run an evaluation
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
# Start OpenRA-RL server
|
| 39 |
+
cd /path/to/OpenRA-RL
|
| 40 |
+
docker compose up openra-rl
|
| 41 |
+
|
| 42 |
+
# Run benchmark
|
| 43 |
+
cd /path/to/OpenRA-Bench
|
| 44 |
+
python evaluate.py \
|
| 45 |
+
--agent scripted \
|
| 46 |
+
--agent-name "MyBot-v1" \
|
| 47 |
+
--opponent Normal \
|
| 48 |
+
--games 10
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Submit results
|
| 52 |
+
|
| 53 |
+
1. Fork this repo
|
| 54 |
+
2. Run evaluation (appends to `data/results.csv`)
|
| 55 |
+
3. Open a PR with your results
|
| 56 |
+
|
| 57 |
+
## Scoring
|
| 58 |
+
|
| 59 |
+
| Component | Weight | Description |
|
| 60 |
+
|-----------|--------|-------------|
|
| 61 |
+
| Win Rate | 50% | Games won / total games |
|
| 62 |
+
| Military Efficiency | 25% | Kill/death cost ratio (normalized) |
|
| 63 |
+
| Economy | 25% | Final asset value (normalized) |
|
| 64 |
+
|
| 65 |
+
## Links
|
| 66 |
+
|
| 67 |
+
- [OpenRA-RL Documentation](https://openra-rl.dev)
|
| 68 |
+
- [OpenRA-RL GitHub](https://github.com/yxc20089/OpenRA-RL)
|
| 69 |
+
- [OpenEnv Framework](https://huggingface.co/openenv)
|
app.py
ADDED
|
@@ -0,0 +1,305 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""OpenRA-Bench: Agent Leaderboard for OpenRA-RL.
|
| 2 |
+
|
| 3 |
+
A Gradio app that displays agent rankings, supports filtering by type
|
| 4 |
+
and opponent difficulty, and provides submission instructions.
|
| 5 |
+
|
| 6 |
+
Run locally:
|
| 7 |
+
python app.py
|
| 8 |
+
|
| 9 |
+
Deploy on HuggingFace Spaces:
|
| 10 |
+
Push app.py, requirements.txt, data/, and README.md to your HF Space.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import os
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
|
| 16 |
+
import gradio as gr
|
| 17 |
+
import pandas as pd
|
| 18 |
+
|
| 19 |
+
# ── Data Loading ──────────────────────────────────────────────────────────────
|
| 20 |
+
|
| 21 |
+
DATA_PATH = Path(__file__).parent / "data" / "results.csv"
|
| 22 |
+
|
| 23 |
+
AGENT_TYPE_COLORS = {
|
| 24 |
+
"Scripted": "#ffcd75", # Gold
|
| 25 |
+
"LLM": "#7497db", # Blue
|
| 26 |
+
"RL": "#75809c", # Gray-blue
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
DISPLAY_COLUMNS = [
|
| 30 |
+
"Rank",
|
| 31 |
+
"Agent",
|
| 32 |
+
"Type",
|
| 33 |
+
"Opponent",
|
| 34 |
+
"Games",
|
| 35 |
+
"Win Rate (%)",
|
| 36 |
+
"Score",
|
| 37 |
+
"K/D Ratio",
|
| 38 |
+
"Avg Kills",
|
| 39 |
+
"Avg Deaths",
|
| 40 |
+
"Avg Economy",
|
| 41 |
+
"Avg Game Length",
|
| 42 |
+
"Date",
|
| 43 |
+
]
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def load_data() -> pd.DataFrame:
|
| 47 |
+
"""Load leaderboard data from CSV."""
|
| 48 |
+
if not DATA_PATH.exists():
|
| 49 |
+
return pd.DataFrame(columns=DISPLAY_COLUMNS)
|
| 50 |
+
|
| 51 |
+
df = pd.read_csv(DATA_PATH)
|
| 52 |
+
df = df.sort_values("score", ascending=False).reset_index(drop=True)
|
| 53 |
+
df.insert(0, "Rank", range(1, len(df) + 1))
|
| 54 |
+
|
| 55 |
+
# Rename for display
|
| 56 |
+
df = df.rename(columns={
|
| 57 |
+
"agent_name": "Agent",
|
| 58 |
+
"agent_type": "Type",
|
| 59 |
+
"opponent": "Opponent",
|
| 60 |
+
"games": "Games",
|
| 61 |
+
"win_rate": "Win Rate (%)",
|
| 62 |
+
"score": "Score",
|
| 63 |
+
"kd_ratio": "K/D Ratio",
|
| 64 |
+
"avg_kills": "Avg Kills",
|
| 65 |
+
"avg_deaths": "Avg Deaths",
|
| 66 |
+
"avg_economy": "Avg Economy",
|
| 67 |
+
"avg_game_length": "Avg Game Length",
|
| 68 |
+
"timestamp": "Date",
|
| 69 |
+
})
|
| 70 |
+
|
| 71 |
+
return df[DISPLAY_COLUMNS]
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def add_type_badges(df: pd.DataFrame) -> pd.DataFrame:
|
| 75 |
+
"""Add color-coded HTML badges to the Type column."""
|
| 76 |
+
def badge(agent_type: str) -> str:
|
| 77 |
+
color = AGENT_TYPE_COLORS.get(agent_type, "#ccc")
|
| 78 |
+
text_color = "#fff" if agent_type != "Scripted" else "#333"
|
| 79 |
+
return (
|
| 80 |
+
f'<span style="background:{color};color:{text_color};'
|
| 81 |
+
f'padding:2px 8px;border-radius:4px;font-size:0.85em">'
|
| 82 |
+
f"{agent_type}</span>"
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
df = df.copy()
|
| 86 |
+
df["Type"] = df["Type"].apply(badge)
|
| 87 |
+
return df
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
# ── Filtering ─────────────────────────────────────────────────────────────────
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def filter_leaderboard(
|
| 94 |
+
search: str,
|
| 95 |
+
agent_types: list[str],
|
| 96 |
+
opponent: str,
|
| 97 |
+
) -> pd.DataFrame:
|
| 98 |
+
"""Filter leaderboard by search, agent type, and opponent."""
|
| 99 |
+
df = load_data()
|
| 100 |
+
|
| 101 |
+
# Filter by agent type
|
| 102 |
+
if agent_types:
|
| 103 |
+
df = df[df["Type"].isin(agent_types)]
|
| 104 |
+
|
| 105 |
+
# Filter by opponent
|
| 106 |
+
if opponent and opponent != "All":
|
| 107 |
+
df = df[df["Opponent"] == opponent]
|
| 108 |
+
|
| 109 |
+
# Search by agent name (regex)
|
| 110 |
+
if search and search.strip():
|
| 111 |
+
patterns = [p.strip() for p in search.split(",") if p.strip()]
|
| 112 |
+
mask = pd.Series([False] * len(df), index=df.index)
|
| 113 |
+
for pattern in patterns:
|
| 114 |
+
mask |= df["Agent"].str.contains(pattern, case=False, regex=True, na=False)
|
| 115 |
+
df = df[mask]
|
| 116 |
+
|
| 117 |
+
# Re-rank after filtering
|
| 118 |
+
df = df.reset_index(drop=True)
|
| 119 |
+
df["Rank"] = range(1, len(df) + 1)
|
| 120 |
+
|
| 121 |
+
return add_type_badges(df)
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
# ── UI ────────────────────────────────────────────────────────────────────────
|
| 125 |
+
|
| 126 |
+
ABOUT_MD = """
|
| 127 |
+
## What is OpenRA-Bench?
|
| 128 |
+
|
| 129 |
+
**OpenRA-Bench** is a standardized benchmark for evaluating AI agents playing
|
| 130 |
+
[Red Alert](https://www.openra.net/) through the
|
| 131 |
+
[OpenRA-RL](https://openra-rl.dev) environment.
|
| 132 |
+
|
| 133 |
+
### Evaluation Protocol
|
| 134 |
+
|
| 135 |
+
- **Game**: Red Alert (OpenRA engine)
|
| 136 |
+
- **Format**: 1v1 agent vs built-in AI
|
| 137 |
+
- **Opponents**: Easy, Normal, Hard difficulty
|
| 138 |
+
- **Games per entry**: Minimum 10 games per configuration
|
| 139 |
+
- **Metrics**: Win rate, composite score, K/D ratio, economy
|
| 140 |
+
|
| 141 |
+
### Composite Score
|
| 142 |
+
|
| 143 |
+
The benchmark score combines three components:
|
| 144 |
+
|
| 145 |
+
| Component | Weight | Description |
|
| 146 |
+
|-----------|--------|-------------|
|
| 147 |
+
| Win Rate | 50% | Percentage of games won |
|
| 148 |
+
| Military Efficiency | 25% | Kill/death cost ratio (normalized) |
|
| 149 |
+
| Economy | 25% | Final asset value (normalized) |
|
| 150 |
+
|
| 151 |
+
### Agent Types
|
| 152 |
+
|
| 153 |
+
- **Scripted**: Rule-based bots with hardcoded strategies
|
| 154 |
+
- **LLM**: Language model agents (Claude, GPT, etc.)
|
| 155 |
+
- **RL**: Reinforcement learning policies (PPO, SAC, etc.)
|
| 156 |
+
|
| 157 |
+
### Links
|
| 158 |
+
|
| 159 |
+
- [OpenRA-RL Documentation](https://openra-rl.dev)
|
| 160 |
+
- [GitHub Repository](https://github.com/yxc20089/OpenRA-RL)
|
| 161 |
+
- [OpenRA-Bench Source](https://github.com/yxc20089/OpenRA-Bench)
|
| 162 |
+
- [OpenEnv Framework](https://huggingface.co/openenv)
|
| 163 |
+
"""
|
| 164 |
+
|
| 165 |
+
SUBMIT_MD = """
|
| 166 |
+
## How to Submit Results
|
| 167 |
+
|
| 168 |
+
### 1. Set up the environment
|
| 169 |
+
|
| 170 |
+
```bash
|
| 171 |
+
git clone --recursive https://github.com/yxc20089/OpenRA-RL.git
|
| 172 |
+
cd OpenRA-RL
|
| 173 |
+
pip install -e .
|
| 174 |
+
docker compose up openra-rl
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### 2. Run the evaluation
|
| 178 |
+
|
| 179 |
+
```bash
|
| 180 |
+
cd /path/to/OpenRA-Bench
|
| 181 |
+
|
| 182 |
+
python evaluate.py \\
|
| 183 |
+
--agent scripted \\
|
| 184 |
+
--agent-name "MyBot-v1" \\
|
| 185 |
+
--agent-type Scripted \\
|
| 186 |
+
--opponent Normal \\
|
| 187 |
+
--games 10 \\
|
| 188 |
+
--server http://localhost:8000
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### 3. Submit via Pull Request
|
| 192 |
+
|
| 193 |
+
1. Fork [OpenRA-Bench](https://github.com/yxc20089/OpenRA-Bench)
|
| 194 |
+
2. Run the evaluation (results append to `data/results.csv`)
|
| 195 |
+
3. Commit and open a PR with:
|
| 196 |
+
- Your updated `data/results.csv`
|
| 197 |
+
- A description of your agent
|
| 198 |
+
- (Optional) Replay files in `replays/`
|
| 199 |
+
|
| 200 |
+
### Evaluation Parameters
|
| 201 |
+
|
| 202 |
+
| Parameter | Description |
|
| 203 |
+
|-----------|-------------|
|
| 204 |
+
| `--agent` | Agent type: `scripted`, `llm`, `mcp`, `custom` |
|
| 205 |
+
| `--agent-name` | Display name on the leaderboard |
|
| 206 |
+
| `--agent-type` | Category: `Scripted`, `LLM`, `RL` |
|
| 207 |
+
| `--opponent` | AI difficulty: `Easy`, `Normal`, `Hard` |
|
| 208 |
+
| `--games` | Number of games (minimum 10) |
|
| 209 |
+
| `--server` | OpenRA-RL server URL |
|
| 210 |
+
|
| 211 |
+
### Custom Agents
|
| 212 |
+
|
| 213 |
+
For custom agents, implement the standard `reset/step` loop:
|
| 214 |
+
|
| 215 |
+
```python
|
| 216 |
+
from openra_env.client import OpenRAEnv
|
| 217 |
+
from openra_env.models import OpenRAAction
|
| 218 |
+
|
| 219 |
+
async with OpenRAEnv("http://localhost:8000") as env:
|
| 220 |
+
obs = await env.reset()
|
| 221 |
+
while not obs.done:
|
| 222 |
+
action = your_agent.decide(obs)
|
| 223 |
+
obs = await env.step(action)
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
Then run `evaluate.py --agent custom` with your agent integrated.
|
| 227 |
+
"""
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
def build_app() -> gr.Blocks:
|
| 231 |
+
"""Build the Gradio leaderboard app."""
|
| 232 |
+
initial_df = add_type_badges(load_data())
|
| 233 |
+
|
| 234 |
+
with gr.Blocks(title="OpenRA-Bench") as app:
|
| 235 |
+
gr.Markdown(
|
| 236 |
+
"# OpenRA-Bench\n"
|
| 237 |
+
"**Agent Leaderboard for OpenRA-RL** — "
|
| 238 |
+
"Train AI to Play Real-Time Strategy"
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
with gr.Tabs():
|
| 242 |
+
# ── Leaderboard Tab ───────────────────────────────────────────
|
| 243 |
+
with gr.Tab("Leaderboard"):
|
| 244 |
+
with gr.Row():
|
| 245 |
+
search_box = gr.Textbox(
|
| 246 |
+
label="Search agents",
|
| 247 |
+
placeholder="Search by name (supports regex, comma-separated)...",
|
| 248 |
+
scale=3,
|
| 249 |
+
)
|
| 250 |
+
type_filter = gr.CheckboxGroup(
|
| 251 |
+
choices=["Scripted", "LLM", "RL"],
|
| 252 |
+
value=["Scripted", "LLM", "RL"],
|
| 253 |
+
label="Agent Type",
|
| 254 |
+
scale=2,
|
| 255 |
+
)
|
| 256 |
+
opponent_filter = gr.Dropdown(
|
| 257 |
+
choices=["All", "Easy", "Normal", "Hard"],
|
| 258 |
+
value="All",
|
| 259 |
+
label="Opponent",
|
| 260 |
+
scale=1,
|
| 261 |
+
)
|
| 262 |
+
|
| 263 |
+
leaderboard = gr.Dataframe(
|
| 264 |
+
value=initial_df,
|
| 265 |
+
datatype=[
|
| 266 |
+
"number", # Rank
|
| 267 |
+
"str", # Agent
|
| 268 |
+
"html", # Type (badge)
|
| 269 |
+
"str", # Opponent
|
| 270 |
+
"number", # Games
|
| 271 |
+
"number", # Win Rate
|
| 272 |
+
"number", # Score
|
| 273 |
+
"number", # K/D Ratio
|
| 274 |
+
"number", # Avg Kills
|
| 275 |
+
"number", # Avg Deaths
|
| 276 |
+
"number", # Avg Economy
|
| 277 |
+
"number", # Avg Game Length
|
| 278 |
+
"str", # Date
|
| 279 |
+
],
|
| 280 |
+
interactive=False,
|
| 281 |
+
show_label=False,
|
| 282 |
+
)
|
| 283 |
+
|
| 284 |
+
# Wire up filters
|
| 285 |
+
for component in [search_box, type_filter, opponent_filter]:
|
| 286 |
+
component.change(
|
| 287 |
+
fn=filter_leaderboard,
|
| 288 |
+
inputs=[search_box, type_filter, opponent_filter],
|
| 289 |
+
outputs=leaderboard,
|
| 290 |
+
)
|
| 291 |
+
|
| 292 |
+
# ── About Tab ─────────────────────────────────────────────────
|
| 293 |
+
with gr.Tab("About"):
|
| 294 |
+
gr.Markdown(ABOUT_MD)
|
| 295 |
+
|
| 296 |
+
# ── Submit Tab ────────────────────────────────────────────────
|
| 297 |
+
with gr.Tab("Submit"):
|
| 298 |
+
gr.Markdown(SUBMIT_MD)
|
| 299 |
+
|
| 300 |
+
return app
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
if __name__ == "__main__":
|
| 304 |
+
app = build_app()
|
| 305 |
+
app.launch()
|
data/results.csv
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
agent_name,agent_type,opponent,games,win_rate,score,avg_kills,avg_deaths,kd_ratio,avg_economy,avg_game_length,timestamp,replay_url
|
| 2 |
+
ScriptedBot-v1,Scripted,Easy,10,90.0,72.5,8450,2100,4.02,12500,1850,2026-02-19,
|
| 3 |
+
ScriptedBot-v1,Scripted,Normal,10,60.0,52.3,6200,4800,1.29,8200,2400,2026-02-19,
|
| 4 |
+
ScriptedBot-v1,Scripted,Hard,10,20.0,28.1,3100,7200,0.43,4500,1600,2026-02-19,
|
| 5 |
+
LLM-Agent-v1,LLM,Easy,10,80.0,65.8,7200,3400,2.12,11000,2200,2026-02-19,
|
| 6 |
+
LLM-Agent-v1,LLM,Normal,10,50.0,48.7,5800,5200,1.12,7800,2800,2026-02-19,
|
data/schema.md
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Results CSV Schema
|
| 2 |
+
|
| 3 |
+
| Column | Type | Description |
|
| 4 |
+
|--------|------|-------------|
|
| 5 |
+
| `agent_name` | str | Agent identifier displayed on leaderboard |
|
| 6 |
+
| `agent_type` | str | Category: "Scripted", "LLM", or "RL" |
|
| 7 |
+
| `opponent` | str | AI difficulty: "Easy", "Normal", or "Hard" |
|
| 8 |
+
| `games` | int | Number of games played (minimum 10) |
|
| 9 |
+
| `win_rate` | float | Win percentage (0.0 - 100.0) |
|
| 10 |
+
| `score` | float | Composite benchmark score (0.0 - 100.0) |
|
| 11 |
+
| `avg_kills` | float | Average enemy cost destroyed per game |
|
| 12 |
+
| `avg_deaths` | float | Average own cost lost per game |
|
| 13 |
+
| `kd_ratio` | float | Average kills_cost / deaths_cost ratio |
|
| 14 |
+
| `avg_economy` | float | Average final assets_value per game |
|
| 15 |
+
| `avg_game_length` | int | Average game duration in ticks |
|
| 16 |
+
| `timestamp` | str | Evaluation date (ISO 8601, YYYY-MM-DD) |
|
| 17 |
+
| `replay_url` | str | URL to replay file(s), empty if none |
|
evaluate.py
ADDED
|
@@ -0,0 +1,261 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""OpenRA-Bench evaluation harness.
|
| 3 |
+
|
| 4 |
+
Runs N games of an agent against a built-in AI opponent, collects metrics,
|
| 5 |
+
and appends aggregate results to data/results.csv.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
# Start the OpenRA-RL server first:
|
| 9 |
+
docker compose up openra-rl
|
| 10 |
+
|
| 11 |
+
# Run evaluation:
|
| 12 |
+
python evaluate.py \
|
| 13 |
+
--agent scripted \
|
| 14 |
+
--agent-name "ScriptedBot-v1" \
|
| 15 |
+
--opponent hard \
|
| 16 |
+
--games 10 \
|
| 17 |
+
--server http://localhost:8000
|
| 18 |
+
|
| 19 |
+
# Dry run (validate args without connecting):
|
| 20 |
+
python evaluate.py --dry-run --agent-name "Test" --games 5
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
import argparse
|
| 24 |
+
import asyncio
|
| 25 |
+
import csv
|
| 26 |
+
import os
|
| 27 |
+
import sys
|
| 28 |
+
from datetime import datetime, timezone
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
from typing import Any, Dict, List
|
| 31 |
+
|
| 32 |
+
# Evaluation results file
|
| 33 |
+
RESULTS_FILE = Path(__file__).parent / "data" / "results.csv"
|
| 34 |
+
|
| 35 |
+
RESULTS_COLUMNS = [
|
| 36 |
+
"agent_name",
|
| 37 |
+
"agent_type",
|
| 38 |
+
"opponent",
|
| 39 |
+
"games",
|
| 40 |
+
"win_rate",
|
| 41 |
+
"score",
|
| 42 |
+
"avg_kills",
|
| 43 |
+
"avg_deaths",
|
| 44 |
+
"kd_ratio",
|
| 45 |
+
"avg_economy",
|
| 46 |
+
"avg_game_length",
|
| 47 |
+
"timestamp",
|
| 48 |
+
"replay_url",
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def parse_args() -> argparse.Namespace:
|
| 53 |
+
parser = argparse.ArgumentParser(
|
| 54 |
+
description="OpenRA-Bench: Evaluate agents against AI opponents"
|
| 55 |
+
)
|
| 56 |
+
parser.add_argument(
|
| 57 |
+
"--agent",
|
| 58 |
+
choices=["scripted", "llm", "mcp", "custom"],
|
| 59 |
+
default="scripted",
|
| 60 |
+
help="Agent type to run (default: scripted)",
|
| 61 |
+
)
|
| 62 |
+
parser.add_argument(
|
| 63 |
+
"--agent-name",
|
| 64 |
+
required=True,
|
| 65 |
+
help="Name for this agent on the leaderboard",
|
| 66 |
+
)
|
| 67 |
+
parser.add_argument(
|
| 68 |
+
"--agent-type",
|
| 69 |
+
choices=["Scripted", "LLM", "RL"],
|
| 70 |
+
help="Leaderboard category (auto-detected from --agent if not set)",
|
| 71 |
+
)
|
| 72 |
+
parser.add_argument(
|
| 73 |
+
"--opponent",
|
| 74 |
+
choices=["Easy", "Normal", "Hard"],
|
| 75 |
+
default="Normal",
|
| 76 |
+
help="AI opponent difficulty (default: Normal)",
|
| 77 |
+
)
|
| 78 |
+
parser.add_argument(
|
| 79 |
+
"--games",
|
| 80 |
+
type=int,
|
| 81 |
+
default=10,
|
| 82 |
+
help="Number of games to play (default: 10)",
|
| 83 |
+
)
|
| 84 |
+
parser.add_argument(
|
| 85 |
+
"--server",
|
| 86 |
+
default="http://localhost:8000",
|
| 87 |
+
help="OpenRA-RL server URL (default: http://localhost:8000)",
|
| 88 |
+
)
|
| 89 |
+
parser.add_argument(
|
| 90 |
+
"--max-steps",
|
| 91 |
+
type=int,
|
| 92 |
+
default=5000,
|
| 93 |
+
help="Max steps per game before timeout (default: 5000)",
|
| 94 |
+
)
|
| 95 |
+
parser.add_argument(
|
| 96 |
+
"--dry-run",
|
| 97 |
+
action="store_true",
|
| 98 |
+
help="Validate arguments and show what would run, without connecting",
|
| 99 |
+
)
|
| 100 |
+
parser.add_argument(
|
| 101 |
+
"--output",
|
| 102 |
+
type=Path,
|
| 103 |
+
default=RESULTS_FILE,
|
| 104 |
+
help=f"Output CSV path (default: {RESULTS_FILE})",
|
| 105 |
+
)
|
| 106 |
+
args = parser.parse_args()
|
| 107 |
+
|
| 108 |
+
# Auto-detect agent type
|
| 109 |
+
if args.agent_type is None:
|
| 110 |
+
type_map = {"scripted": "Scripted", "llm": "LLM", "mcp": "Scripted", "custom": "RL"}
|
| 111 |
+
args.agent_type = type_map[args.agent]
|
| 112 |
+
|
| 113 |
+
return args
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
async def run_game(env: Any, agent_fn: Any, max_steps: int) -> Dict[str, Any]:
|
| 117 |
+
"""Run a single game and return metrics.
|
| 118 |
+
|
| 119 |
+
Args:
|
| 120 |
+
env: OpenRAEnv client instance.
|
| 121 |
+
agent_fn: Callable(obs) -> action.
|
| 122 |
+
max_steps: Maximum steps before timeout.
|
| 123 |
+
|
| 124 |
+
Returns:
|
| 125 |
+
Dict with game metrics (from rubrics.compute_game_metrics).
|
| 126 |
+
"""
|
| 127 |
+
from rubrics import compute_game_metrics
|
| 128 |
+
|
| 129 |
+
obs = await env.reset()
|
| 130 |
+
steps = 0
|
| 131 |
+
|
| 132 |
+
while not obs.done and steps < max_steps:
|
| 133 |
+
action = agent_fn(obs)
|
| 134 |
+
obs = await env.step(action)
|
| 135 |
+
steps += 1
|
| 136 |
+
|
| 137 |
+
return compute_game_metrics(obs)
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def get_agent_fn(agent_type: str) -> Any:
|
| 141 |
+
"""Get the agent decision function for the specified type.
|
| 142 |
+
|
| 143 |
+
Returns a callable that takes an observation and returns an action.
|
| 144 |
+
"""
|
| 145 |
+
if agent_type == "scripted":
|
| 146 |
+
# Import inline to avoid hard dependency
|
| 147 |
+
from openra_env.models import OpenRAAction
|
| 148 |
+
# Simple no-op agent for evaluation framework testing
|
| 149 |
+
# Replace with actual ScriptedBot integration
|
| 150 |
+
return lambda obs: OpenRAAction(commands=[])
|
| 151 |
+
else:
|
| 152 |
+
from openra_env.models import OpenRAAction
|
| 153 |
+
return lambda obs: OpenRAAction(commands=[])
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
async def run_evaluation(args: argparse.Namespace) -> Dict[str, Any]:
|
| 157 |
+
"""Run the full evaluation: N games, collect metrics, compute aggregates."""
|
| 158 |
+
from openra_env.client import OpenRAEnv
|
| 159 |
+
|
| 160 |
+
agent_fn = get_agent_fn(args.agent)
|
| 161 |
+
game_results: List[Dict[str, Any]] = []
|
| 162 |
+
|
| 163 |
+
async with OpenRAEnv(args.server) as env:
|
| 164 |
+
for i in range(args.games):
|
| 165 |
+
print(f" Game {i + 1}/{args.games}...", end=" ", flush=True)
|
| 166 |
+
metrics = await run_game(env, agent_fn, args.max_steps)
|
| 167 |
+
game_results.append(metrics)
|
| 168 |
+
result_str = metrics["result"] or "timeout"
|
| 169 |
+
print(f"{result_str} (ticks: {metrics['ticks']}, K/D: {metrics['kd_ratio']:.1f})")
|
| 170 |
+
|
| 171 |
+
# Aggregate results
|
| 172 |
+
wins = sum(1 for g in game_results if g["win"])
|
| 173 |
+
total = len(game_results)
|
| 174 |
+
|
| 175 |
+
return {
|
| 176 |
+
"agent_name": args.agent_name,
|
| 177 |
+
"agent_type": args.agent_type,
|
| 178 |
+
"opponent": args.opponent,
|
| 179 |
+
"games": total,
|
| 180 |
+
"win_rate": round(100.0 * wins / max(total, 1), 1),
|
| 181 |
+
"score": round(compute_composite_score(game_results), 1),
|
| 182 |
+
"avg_kills": round(sum(g["kills_cost"] for g in game_results) / max(total, 1)),
|
| 183 |
+
"avg_deaths": round(sum(g["deaths_cost"] for g in game_results) / max(total, 1)),
|
| 184 |
+
"kd_ratio": round(
|
| 185 |
+
sum(g["kd_ratio"] for g in game_results) / max(total, 1), 2
|
| 186 |
+
),
|
| 187 |
+
"avg_economy": round(
|
| 188 |
+
sum(g["assets_value"] for g in game_results) / max(total, 1)
|
| 189 |
+
),
|
| 190 |
+
"avg_game_length": round(
|
| 191 |
+
sum(g["ticks"] for g in game_results) / max(total, 1)
|
| 192 |
+
),
|
| 193 |
+
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
|
| 194 |
+
"replay_url": "",
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
def compute_composite_score(game_results: List[Dict[str, Any]]) -> float:
|
| 199 |
+
"""Compute the OpenRA-Bench composite score.
|
| 200 |
+
|
| 201 |
+
Score = 50% win_rate + 25% avg_kd_normalized + 25% avg_economy_normalized
|
| 202 |
+
"""
|
| 203 |
+
total = len(game_results)
|
| 204 |
+
if total == 0:
|
| 205 |
+
return 0.0
|
| 206 |
+
|
| 207 |
+
win_rate = sum(1 for g in game_results if g["win"]) / total
|
| 208 |
+
|
| 209 |
+
# K/D ratio normalized: kd / (kd + 1) maps [0, inf) -> [0, 1)
|
| 210 |
+
avg_kd = sum(g["kd_ratio"] for g in game_results) / total
|
| 211 |
+
kd_norm = avg_kd / (avg_kd + 1)
|
| 212 |
+
|
| 213 |
+
# Economy normalized: assets / (assets + 10000)
|
| 214 |
+
avg_assets = sum(g["assets_value"] for g in game_results) / total
|
| 215 |
+
econ_norm = avg_assets / (avg_assets + 10000) if avg_assets >= 0 else 0.0
|
| 216 |
+
|
| 217 |
+
return 100.0 * (0.5 * win_rate + 0.25 * kd_norm + 0.25 * econ_norm)
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def append_results(results: Dict[str, Any], output_path: Path) -> None:
|
| 221 |
+
"""Append evaluation results to CSV file."""
|
| 222 |
+
file_exists = output_path.exists() and output_path.stat().st_size > 0
|
| 223 |
+
|
| 224 |
+
with open(output_path, "a", newline="") as f:
|
| 225 |
+
writer = csv.DictWriter(f, fieldnames=RESULTS_COLUMNS)
|
| 226 |
+
if not file_exists:
|
| 227 |
+
writer.writeheader()
|
| 228 |
+
writer.writerow(results)
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def main() -> None:
|
| 232 |
+
args = parse_args()
|
| 233 |
+
|
| 234 |
+
print(f"OpenRA-Bench Evaluation")
|
| 235 |
+
print(f" Agent: {args.agent_name} ({args.agent_type})")
|
| 236 |
+
print(f" Opponent: {args.opponent}")
|
| 237 |
+
print(f" Games: {args.games}")
|
| 238 |
+
print(f" Server: {args.server}")
|
| 239 |
+
print()
|
| 240 |
+
|
| 241 |
+
if args.dry_run:
|
| 242 |
+
print("[DRY RUN] Would run evaluation with the above settings.")
|
| 243 |
+
print(f"[DRY RUN] Results would be written to: {args.output}")
|
| 244 |
+
return
|
| 245 |
+
|
| 246 |
+
results = asyncio.run(run_evaluation(args))
|
| 247 |
+
|
| 248 |
+
print()
|
| 249 |
+
print(f"Results:")
|
| 250 |
+
print(f" Win Rate: {results['win_rate']}%")
|
| 251 |
+
print(f" Score: {results['score']}")
|
| 252 |
+
print(f" K/D Ratio: {results['kd_ratio']}")
|
| 253 |
+
print(f" Avg Economy: {results['avg_economy']}")
|
| 254 |
+
print(f" Avg Game Length: {results['avg_game_length']} ticks")
|
| 255 |
+
|
| 256 |
+
append_results(results, args.output)
|
| 257 |
+
print(f"\nResults appended to {args.output}")
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
if __name__ == "__main__":
|
| 261 |
+
main()
|
requirements.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.44.0
|
| 2 |
+
pandas>=2.0.0
|
| 3 |
+
openenv-core>=0.2.0
|
rubrics.py
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""OpenRA-Bench rubrics for agent evaluation.
|
| 2 |
+
|
| 3 |
+
Follows the OpenEnv rubric pattern (see openenv.core.rubrics).
|
| 4 |
+
These rubrics score game episodes based on win/loss, military efficiency,
|
| 5 |
+
and economic performance.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
rubric = OpenRABenchRubric()
|
| 9 |
+
rubric.reset()
|
| 10 |
+
for action, obs in episode:
|
| 11 |
+
reward = rubric(action, obs) # 0.0 until done
|
| 12 |
+
step_rewards = rubric.win_loss.compute_step_rewards()
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from typing import Any, Dict, List, Tuple
|
| 16 |
+
|
| 17 |
+
from openenv.core.rubrics import (
|
| 18 |
+
ExponentialDiscountingTrajectoryRubric,
|
| 19 |
+
TrajectoryRubric,
|
| 20 |
+
WeightedSum,
|
| 21 |
+
)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class OpenRAWinLossRubric(ExponentialDiscountingTrajectoryRubric):
|
| 25 |
+
"""Score game based on win/loss/draw outcome with temporal discounting.
|
| 26 |
+
|
| 27 |
+
Terminal rewards:
|
| 28 |
+
- Win: +1.0
|
| 29 |
+
- Loss: -1.0
|
| 30 |
+
- Draw: 0.0
|
| 31 |
+
"""
|
| 32 |
+
|
| 33 |
+
def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
|
| 34 |
+
if not trajectory:
|
| 35 |
+
return 0.0
|
| 36 |
+
_, final_obs = trajectory[-1]
|
| 37 |
+
result = getattr(final_obs, "result", "")
|
| 38 |
+
if result == "win":
|
| 39 |
+
return 1.0
|
| 40 |
+
elif result == "lose":
|
| 41 |
+
return -1.0
|
| 42 |
+
return 0.0
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class MilitaryEfficiencyRubric(TrajectoryRubric):
|
| 46 |
+
"""Score based on kill/death cost ratio from final observation.
|
| 47 |
+
|
| 48 |
+
Score = kills_cost / max(kills_cost + deaths_cost, 1)
|
| 49 |
+
Normalized to 0.0-1.0 range.
|
| 50 |
+
"""
|
| 51 |
+
|
| 52 |
+
def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
|
| 53 |
+
if not trajectory:
|
| 54 |
+
return 0.0
|
| 55 |
+
_, final_obs = trajectory[-1]
|
| 56 |
+
military = getattr(final_obs, "military", None)
|
| 57 |
+
if military is None:
|
| 58 |
+
return 0.0
|
| 59 |
+
kills = getattr(military, "kills_cost", 0)
|
| 60 |
+
deaths = getattr(military, "deaths_cost", 0)
|
| 61 |
+
total = kills + deaths
|
| 62 |
+
if total == 0:
|
| 63 |
+
return 0.5 # No combat occurred
|
| 64 |
+
return kills / total
|
| 65 |
+
|
| 66 |
+
def compute_step_rewards(self) -> List[float]:
|
| 67 |
+
if not self._trajectory:
|
| 68 |
+
return []
|
| 69 |
+
score = self.score_trajectory(self._trajectory)
|
| 70 |
+
return [score] * len(self._trajectory)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
class EconomyRubric(TrajectoryRubric):
|
| 74 |
+
"""Score based on final economic state.
|
| 75 |
+
|
| 76 |
+
Score = assets_value / (assets_value + 10000)
|
| 77 |
+
Sigmoid-like normalization to 0.0-1.0 range.
|
| 78 |
+
"""
|
| 79 |
+
|
| 80 |
+
def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
|
| 81 |
+
if not trajectory:
|
| 82 |
+
return 0.0
|
| 83 |
+
_, final_obs = trajectory[-1]
|
| 84 |
+
military = getattr(final_obs, "military", None)
|
| 85 |
+
if military is None:
|
| 86 |
+
return 0.0
|
| 87 |
+
assets = getattr(military, "assets_value", 0)
|
| 88 |
+
# Sigmoid normalization: maps [0, inf) -> [0, 1)
|
| 89 |
+
return assets / (assets + 10000) if assets >= 0 else 0.0
|
| 90 |
+
|
| 91 |
+
def compute_step_rewards(self) -> List[float]:
|
| 92 |
+
if not self._trajectory:
|
| 93 |
+
return []
|
| 94 |
+
score = self.score_trajectory(self._trajectory)
|
| 95 |
+
return [score] * len(self._trajectory)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
class OpenRABenchRubric(WeightedSum):
|
| 99 |
+
"""Composite benchmark score combining win/loss, military, and economy.
|
| 100 |
+
|
| 101 |
+
Weights: 50% win/loss, 25% military efficiency, 25% economy.
|
| 102 |
+
"""
|
| 103 |
+
|
| 104 |
+
def __init__(self, gamma: float = 0.99):
|
| 105 |
+
win_loss = OpenRAWinLossRubric(gamma=gamma)
|
| 106 |
+
military = MilitaryEfficiencyRubric()
|
| 107 |
+
economy = EconomyRubric()
|
| 108 |
+
super().__init__(
|
| 109 |
+
rubrics=[win_loss, military, economy],
|
| 110 |
+
weights=[0.5, 0.25, 0.25],
|
| 111 |
+
)
|
| 112 |
+
# Keep named references for direct access
|
| 113 |
+
self.win_loss = win_loss
|
| 114 |
+
self.military = military
|
| 115 |
+
self.economy = economy
|
| 116 |
+
|
| 117 |
+
def reset(self) -> None:
|
| 118 |
+
self.win_loss.reset()
|
| 119 |
+
self.military.reset()
|
| 120 |
+
self.economy.reset()
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def compute_game_metrics(final_obs: Any) -> Dict[str, Any]:
|
| 124 |
+
"""Extract benchmark metrics from a final game observation.
|
| 125 |
+
|
| 126 |
+
Args:
|
| 127 |
+
final_obs: The terminal GameObservation (where done=True).
|
| 128 |
+
|
| 129 |
+
Returns:
|
| 130 |
+
Dict with keys: result, ticks, kills_cost, deaths_cost,
|
| 131 |
+
kd_ratio, assets_value, cash, win (bool).
|
| 132 |
+
"""
|
| 133 |
+
military = getattr(final_obs, "military", None)
|
| 134 |
+
economy = getattr(final_obs, "economy", None)
|
| 135 |
+
|
| 136 |
+
kills = getattr(military, "kills_cost", 0) if military else 0
|
| 137 |
+
deaths = getattr(military, "deaths_cost", 0) if military else 0
|
| 138 |
+
assets = getattr(military, "assets_value", 0) if military else 0
|
| 139 |
+
cash = getattr(economy, "cash", 0) if economy else 0
|
| 140 |
+
result = getattr(final_obs, "result", "")
|
| 141 |
+
tick = getattr(final_obs, "tick", 0)
|
| 142 |
+
|
| 143 |
+
return {
|
| 144 |
+
"result": result,
|
| 145 |
+
"win": result == "win",
|
| 146 |
+
"ticks": tick,
|
| 147 |
+
"kills_cost": kills,
|
| 148 |
+
"deaths_cost": deaths,
|
| 149 |
+
"kd_ratio": kills / max(deaths, 1),
|
| 150 |
+
"assets_value": assets,
|
| 151 |
+
"cash": cash,
|
| 152 |
+
}
|