yxc20098 commited on
Commit
f96ea53
·
1 Parent(s): 8475b1b

Add OpenRA-Bench leaderboard, evaluation harness, and rubrics

Browse files

- Gradio leaderboard app with agent ranking, type/opponent filters, search
- OpenEnv-compatible rubrics (win/loss, military efficiency, economy)
- CLI evaluation harness for running N-game benchmarks
- Seed data with ScriptedBot and LLM-Agent baselines
- GitHub Actions workflow for HuggingFace Space sync
- HF Space-compatible README with YAML frontmatter

Files changed (9) hide show
  1. .github/workflows/sync-to-hf.yml +40 -0
  2. .gitignore +9 -0
  3. README.md +69 -0
  4. app.py +305 -0
  5. data/results.csv +6 -0
  6. data/schema.md +17 -0
  7. evaluate.py +261 -0
  8. requirements.txt +3 -0
  9. rubrics.py +152 -0
.github/workflows/sync-to-hf.yml ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to HuggingFace Space
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+ paths:
8
+ - 'app.py'
9
+ - 'requirements.txt'
10
+ - 'data/**'
11
+ - 'README.md'
12
+
13
+ jobs:
14
+ sync:
15
+ runs-on: ubuntu-latest
16
+ steps:
17
+ - uses: actions/checkout@v4
18
+ with:
19
+ fetch-depth: 0
20
+ lfs: true
21
+
22
+ - name: Push to HuggingFace Hub
23
+ env:
24
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
25
+ run: |
26
+ # Only sync if HF_TOKEN is configured
27
+ if [ -z "$HF_TOKEN" ]; then
28
+ echo "HF_TOKEN not set, skipping sync"
29
+ exit 0
30
+ fi
31
+
32
+ git config user.name "GitHub Actions"
33
+ git config user.email "actions@github.com"
34
+
35
+ # Add HF remote and push
36
+ git remote add hf https://huggingface.co/spaces/yxc20089/OpenRA-Bench || true
37
+ git remote set-url hf https://x-access-token:${HF_TOKEN}@huggingface.co/spaces/yxc20089/OpenRA-Bench
38
+
39
+ # Push main branch to HF
40
+ git push hf main --force
.gitignore ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ *.egg-info/
4
+ dist/
5
+ build/
6
+ .env
7
+ .venv/
8
+ *.orarep
9
+ flagged/
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: OpenRA-Bench
3
+ emoji: 🎮
4
+ colorFrom: red
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: "5.12.0"
8
+ app_file: app.py
9
+ pinned: true
10
+ license: gpl-3.0
11
+ ---
12
+
13
+ # OpenRA-Bench
14
+
15
+ Standardized benchmark and leaderboard for AI agents playing Red Alert through [OpenRA-RL](https://openra-rl.dev).
16
+
17
+ ## Features
18
+
19
+ - **Leaderboard**: Ranked agent comparison with composite scoring
20
+ - **Filtering**: By agent type (Scripted/LLM/RL) and opponent difficulty
21
+ - **Evaluation harness**: Automated N-game benchmarking with metrics collection
22
+ - **OpenEnv rubrics**: Composable scoring (win/loss, military efficiency, economy)
23
+ - **Replay verification**: Replay files linked to leaderboard entries
24
+
25
+ ## Quick Start
26
+
27
+ ### View the leaderboard
28
+
29
+ ```bash
30
+ pip install -r requirements.txt
31
+ python app.py
32
+ # Opens at http://localhost:7860
33
+ ```
34
+
35
+ ### Run an evaluation
36
+
37
+ ```bash
38
+ # Start OpenRA-RL server
39
+ cd /path/to/OpenRA-RL
40
+ docker compose up openra-rl
41
+
42
+ # Run benchmark
43
+ cd /path/to/OpenRA-Bench
44
+ python evaluate.py \
45
+ --agent scripted \
46
+ --agent-name "MyBot-v1" \
47
+ --opponent Normal \
48
+ --games 10
49
+ ```
50
+
51
+ ### Submit results
52
+
53
+ 1. Fork this repo
54
+ 2. Run evaluation (appends to `data/results.csv`)
55
+ 3. Open a PR with your results
56
+
57
+ ## Scoring
58
+
59
+ | Component | Weight | Description |
60
+ |-----------|--------|-------------|
61
+ | Win Rate | 50% | Games won / total games |
62
+ | Military Efficiency | 25% | Kill/death cost ratio (normalized) |
63
+ | Economy | 25% | Final asset value (normalized) |
64
+
65
+ ## Links
66
+
67
+ - [OpenRA-RL Documentation](https://openra-rl.dev)
68
+ - [OpenRA-RL GitHub](https://github.com/yxc20089/OpenRA-RL)
69
+ - [OpenEnv Framework](https://huggingface.co/openenv)
app.py ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OpenRA-Bench: Agent Leaderboard for OpenRA-RL.
2
+
3
+ A Gradio app that displays agent rankings, supports filtering by type
4
+ and opponent difficulty, and provides submission instructions.
5
+
6
+ Run locally:
7
+ python app.py
8
+
9
+ Deploy on HuggingFace Spaces:
10
+ Push app.py, requirements.txt, data/, and README.md to your HF Space.
11
+ """
12
+
13
+ import os
14
+ from pathlib import Path
15
+
16
+ import gradio as gr
17
+ import pandas as pd
18
+
19
+ # ── Data Loading ──────────────────────────────────────────────────────────────
20
+
21
+ DATA_PATH = Path(__file__).parent / "data" / "results.csv"
22
+
23
+ AGENT_TYPE_COLORS = {
24
+ "Scripted": "#ffcd75", # Gold
25
+ "LLM": "#7497db", # Blue
26
+ "RL": "#75809c", # Gray-blue
27
+ }
28
+
29
+ DISPLAY_COLUMNS = [
30
+ "Rank",
31
+ "Agent",
32
+ "Type",
33
+ "Opponent",
34
+ "Games",
35
+ "Win Rate (%)",
36
+ "Score",
37
+ "K/D Ratio",
38
+ "Avg Kills",
39
+ "Avg Deaths",
40
+ "Avg Economy",
41
+ "Avg Game Length",
42
+ "Date",
43
+ ]
44
+
45
+
46
+ def load_data() -> pd.DataFrame:
47
+ """Load leaderboard data from CSV."""
48
+ if not DATA_PATH.exists():
49
+ return pd.DataFrame(columns=DISPLAY_COLUMNS)
50
+
51
+ df = pd.read_csv(DATA_PATH)
52
+ df = df.sort_values("score", ascending=False).reset_index(drop=True)
53
+ df.insert(0, "Rank", range(1, len(df) + 1))
54
+
55
+ # Rename for display
56
+ df = df.rename(columns={
57
+ "agent_name": "Agent",
58
+ "agent_type": "Type",
59
+ "opponent": "Opponent",
60
+ "games": "Games",
61
+ "win_rate": "Win Rate (%)",
62
+ "score": "Score",
63
+ "kd_ratio": "K/D Ratio",
64
+ "avg_kills": "Avg Kills",
65
+ "avg_deaths": "Avg Deaths",
66
+ "avg_economy": "Avg Economy",
67
+ "avg_game_length": "Avg Game Length",
68
+ "timestamp": "Date",
69
+ })
70
+
71
+ return df[DISPLAY_COLUMNS]
72
+
73
+
74
+ def add_type_badges(df: pd.DataFrame) -> pd.DataFrame:
75
+ """Add color-coded HTML badges to the Type column."""
76
+ def badge(agent_type: str) -> str:
77
+ color = AGENT_TYPE_COLORS.get(agent_type, "#ccc")
78
+ text_color = "#fff" if agent_type != "Scripted" else "#333"
79
+ return (
80
+ f'<span style="background:{color};color:{text_color};'
81
+ f'padding:2px 8px;border-radius:4px;font-size:0.85em">'
82
+ f"{agent_type}</span>"
83
+ )
84
+
85
+ df = df.copy()
86
+ df["Type"] = df["Type"].apply(badge)
87
+ return df
88
+
89
+
90
+ # ── Filtering ─────────────────────────────────────────────────────────────────
91
+
92
+
93
+ def filter_leaderboard(
94
+ search: str,
95
+ agent_types: list[str],
96
+ opponent: str,
97
+ ) -> pd.DataFrame:
98
+ """Filter leaderboard by search, agent type, and opponent."""
99
+ df = load_data()
100
+
101
+ # Filter by agent type
102
+ if agent_types:
103
+ df = df[df["Type"].isin(agent_types)]
104
+
105
+ # Filter by opponent
106
+ if opponent and opponent != "All":
107
+ df = df[df["Opponent"] == opponent]
108
+
109
+ # Search by agent name (regex)
110
+ if search and search.strip():
111
+ patterns = [p.strip() for p in search.split(",") if p.strip()]
112
+ mask = pd.Series([False] * len(df), index=df.index)
113
+ for pattern in patterns:
114
+ mask |= df["Agent"].str.contains(pattern, case=False, regex=True, na=False)
115
+ df = df[mask]
116
+
117
+ # Re-rank after filtering
118
+ df = df.reset_index(drop=True)
119
+ df["Rank"] = range(1, len(df) + 1)
120
+
121
+ return add_type_badges(df)
122
+
123
+
124
+ # ── UI ────────────────────────────────────────────────────────────────────────
125
+
126
+ ABOUT_MD = """
127
+ ## What is OpenRA-Bench?
128
+
129
+ **OpenRA-Bench** is a standardized benchmark for evaluating AI agents playing
130
+ [Red Alert](https://www.openra.net/) through the
131
+ [OpenRA-RL](https://openra-rl.dev) environment.
132
+
133
+ ### Evaluation Protocol
134
+
135
+ - **Game**: Red Alert (OpenRA engine)
136
+ - **Format**: 1v1 agent vs built-in AI
137
+ - **Opponents**: Easy, Normal, Hard difficulty
138
+ - **Games per entry**: Minimum 10 games per configuration
139
+ - **Metrics**: Win rate, composite score, K/D ratio, economy
140
+
141
+ ### Composite Score
142
+
143
+ The benchmark score combines three components:
144
+
145
+ | Component | Weight | Description |
146
+ |-----------|--------|-------------|
147
+ | Win Rate | 50% | Percentage of games won |
148
+ | Military Efficiency | 25% | Kill/death cost ratio (normalized) |
149
+ | Economy | 25% | Final asset value (normalized) |
150
+
151
+ ### Agent Types
152
+
153
+ - **Scripted**: Rule-based bots with hardcoded strategies
154
+ - **LLM**: Language model agents (Claude, GPT, etc.)
155
+ - **RL**: Reinforcement learning policies (PPO, SAC, etc.)
156
+
157
+ ### Links
158
+
159
+ - [OpenRA-RL Documentation](https://openra-rl.dev)
160
+ - [GitHub Repository](https://github.com/yxc20089/OpenRA-RL)
161
+ - [OpenRA-Bench Source](https://github.com/yxc20089/OpenRA-Bench)
162
+ - [OpenEnv Framework](https://huggingface.co/openenv)
163
+ """
164
+
165
+ SUBMIT_MD = """
166
+ ## How to Submit Results
167
+
168
+ ### 1. Set up the environment
169
+
170
+ ```bash
171
+ git clone --recursive https://github.com/yxc20089/OpenRA-RL.git
172
+ cd OpenRA-RL
173
+ pip install -e .
174
+ docker compose up openra-rl
175
+ ```
176
+
177
+ ### 2. Run the evaluation
178
+
179
+ ```bash
180
+ cd /path/to/OpenRA-Bench
181
+
182
+ python evaluate.py \\
183
+ --agent scripted \\
184
+ --agent-name "MyBot-v1" \\
185
+ --agent-type Scripted \\
186
+ --opponent Normal \\
187
+ --games 10 \\
188
+ --server http://localhost:8000
189
+ ```
190
+
191
+ ### 3. Submit via Pull Request
192
+
193
+ 1. Fork [OpenRA-Bench](https://github.com/yxc20089/OpenRA-Bench)
194
+ 2. Run the evaluation (results append to `data/results.csv`)
195
+ 3. Commit and open a PR with:
196
+ - Your updated `data/results.csv`
197
+ - A description of your agent
198
+ - (Optional) Replay files in `replays/`
199
+
200
+ ### Evaluation Parameters
201
+
202
+ | Parameter | Description |
203
+ |-----------|-------------|
204
+ | `--agent` | Agent type: `scripted`, `llm`, `mcp`, `custom` |
205
+ | `--agent-name` | Display name on the leaderboard |
206
+ | `--agent-type` | Category: `Scripted`, `LLM`, `RL` |
207
+ | `--opponent` | AI difficulty: `Easy`, `Normal`, `Hard` |
208
+ | `--games` | Number of games (minimum 10) |
209
+ | `--server` | OpenRA-RL server URL |
210
+
211
+ ### Custom Agents
212
+
213
+ For custom agents, implement the standard `reset/step` loop:
214
+
215
+ ```python
216
+ from openra_env.client import OpenRAEnv
217
+ from openra_env.models import OpenRAAction
218
+
219
+ async with OpenRAEnv("http://localhost:8000") as env:
220
+ obs = await env.reset()
221
+ while not obs.done:
222
+ action = your_agent.decide(obs)
223
+ obs = await env.step(action)
224
+ ```
225
+
226
+ Then run `evaluate.py --agent custom` with your agent integrated.
227
+ """
228
+
229
+
230
+ def build_app() -> gr.Blocks:
231
+ """Build the Gradio leaderboard app."""
232
+ initial_df = add_type_badges(load_data())
233
+
234
+ with gr.Blocks(title="OpenRA-Bench") as app:
235
+ gr.Markdown(
236
+ "# OpenRA-Bench\n"
237
+ "**Agent Leaderboard for OpenRA-RL** — "
238
+ "Train AI to Play Real-Time Strategy"
239
+ )
240
+
241
+ with gr.Tabs():
242
+ # ── Leaderboard Tab ───────────────────────────────────────────
243
+ with gr.Tab("Leaderboard"):
244
+ with gr.Row():
245
+ search_box = gr.Textbox(
246
+ label="Search agents",
247
+ placeholder="Search by name (supports regex, comma-separated)...",
248
+ scale=3,
249
+ )
250
+ type_filter = gr.CheckboxGroup(
251
+ choices=["Scripted", "LLM", "RL"],
252
+ value=["Scripted", "LLM", "RL"],
253
+ label="Agent Type",
254
+ scale=2,
255
+ )
256
+ opponent_filter = gr.Dropdown(
257
+ choices=["All", "Easy", "Normal", "Hard"],
258
+ value="All",
259
+ label="Opponent",
260
+ scale=1,
261
+ )
262
+
263
+ leaderboard = gr.Dataframe(
264
+ value=initial_df,
265
+ datatype=[
266
+ "number", # Rank
267
+ "str", # Agent
268
+ "html", # Type (badge)
269
+ "str", # Opponent
270
+ "number", # Games
271
+ "number", # Win Rate
272
+ "number", # Score
273
+ "number", # K/D Ratio
274
+ "number", # Avg Kills
275
+ "number", # Avg Deaths
276
+ "number", # Avg Economy
277
+ "number", # Avg Game Length
278
+ "str", # Date
279
+ ],
280
+ interactive=False,
281
+ show_label=False,
282
+ )
283
+
284
+ # Wire up filters
285
+ for component in [search_box, type_filter, opponent_filter]:
286
+ component.change(
287
+ fn=filter_leaderboard,
288
+ inputs=[search_box, type_filter, opponent_filter],
289
+ outputs=leaderboard,
290
+ )
291
+
292
+ # ── About Tab ─────────────────────────────────────────────────
293
+ with gr.Tab("About"):
294
+ gr.Markdown(ABOUT_MD)
295
+
296
+ # ── Submit Tab ────────────────────────────────────────────────
297
+ with gr.Tab("Submit"):
298
+ gr.Markdown(SUBMIT_MD)
299
+
300
+ return app
301
+
302
+
303
+ if __name__ == "__main__":
304
+ app = build_app()
305
+ app.launch()
data/results.csv ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ agent_name,agent_type,opponent,games,win_rate,score,avg_kills,avg_deaths,kd_ratio,avg_economy,avg_game_length,timestamp,replay_url
2
+ ScriptedBot-v1,Scripted,Easy,10,90.0,72.5,8450,2100,4.02,12500,1850,2026-02-19,
3
+ ScriptedBot-v1,Scripted,Normal,10,60.0,52.3,6200,4800,1.29,8200,2400,2026-02-19,
4
+ ScriptedBot-v1,Scripted,Hard,10,20.0,28.1,3100,7200,0.43,4500,1600,2026-02-19,
5
+ LLM-Agent-v1,LLM,Easy,10,80.0,65.8,7200,3400,2.12,11000,2200,2026-02-19,
6
+ LLM-Agent-v1,LLM,Normal,10,50.0,48.7,5800,5200,1.12,7800,2800,2026-02-19,
data/schema.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Results CSV Schema
2
+
3
+ | Column | Type | Description |
4
+ |--------|------|-------------|
5
+ | `agent_name` | str | Agent identifier displayed on leaderboard |
6
+ | `agent_type` | str | Category: "Scripted", "LLM", or "RL" |
7
+ | `opponent` | str | AI difficulty: "Easy", "Normal", or "Hard" |
8
+ | `games` | int | Number of games played (minimum 10) |
9
+ | `win_rate` | float | Win percentage (0.0 - 100.0) |
10
+ | `score` | float | Composite benchmark score (0.0 - 100.0) |
11
+ | `avg_kills` | float | Average enemy cost destroyed per game |
12
+ | `avg_deaths` | float | Average own cost lost per game |
13
+ | `kd_ratio` | float | Average kills_cost / deaths_cost ratio |
14
+ | `avg_economy` | float | Average final assets_value per game |
15
+ | `avg_game_length` | int | Average game duration in ticks |
16
+ | `timestamp` | str | Evaluation date (ISO 8601, YYYY-MM-DD) |
17
+ | `replay_url` | str | URL to replay file(s), empty if none |
evaluate.py ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """OpenRA-Bench evaluation harness.
3
+
4
+ Runs N games of an agent against a built-in AI opponent, collects metrics,
5
+ and appends aggregate results to data/results.csv.
6
+
7
+ Usage:
8
+ # Start the OpenRA-RL server first:
9
+ docker compose up openra-rl
10
+
11
+ # Run evaluation:
12
+ python evaluate.py \
13
+ --agent scripted \
14
+ --agent-name "ScriptedBot-v1" \
15
+ --opponent hard \
16
+ --games 10 \
17
+ --server http://localhost:8000
18
+
19
+ # Dry run (validate args without connecting):
20
+ python evaluate.py --dry-run --agent-name "Test" --games 5
21
+ """
22
+
23
+ import argparse
24
+ import asyncio
25
+ import csv
26
+ import os
27
+ import sys
28
+ from datetime import datetime, timezone
29
+ from pathlib import Path
30
+ from typing import Any, Dict, List
31
+
32
+ # Evaluation results file
33
+ RESULTS_FILE = Path(__file__).parent / "data" / "results.csv"
34
+
35
+ RESULTS_COLUMNS = [
36
+ "agent_name",
37
+ "agent_type",
38
+ "opponent",
39
+ "games",
40
+ "win_rate",
41
+ "score",
42
+ "avg_kills",
43
+ "avg_deaths",
44
+ "kd_ratio",
45
+ "avg_economy",
46
+ "avg_game_length",
47
+ "timestamp",
48
+ "replay_url",
49
+ ]
50
+
51
+
52
+ def parse_args() -> argparse.Namespace:
53
+ parser = argparse.ArgumentParser(
54
+ description="OpenRA-Bench: Evaluate agents against AI opponents"
55
+ )
56
+ parser.add_argument(
57
+ "--agent",
58
+ choices=["scripted", "llm", "mcp", "custom"],
59
+ default="scripted",
60
+ help="Agent type to run (default: scripted)",
61
+ )
62
+ parser.add_argument(
63
+ "--agent-name",
64
+ required=True,
65
+ help="Name for this agent on the leaderboard",
66
+ )
67
+ parser.add_argument(
68
+ "--agent-type",
69
+ choices=["Scripted", "LLM", "RL"],
70
+ help="Leaderboard category (auto-detected from --agent if not set)",
71
+ )
72
+ parser.add_argument(
73
+ "--opponent",
74
+ choices=["Easy", "Normal", "Hard"],
75
+ default="Normal",
76
+ help="AI opponent difficulty (default: Normal)",
77
+ )
78
+ parser.add_argument(
79
+ "--games",
80
+ type=int,
81
+ default=10,
82
+ help="Number of games to play (default: 10)",
83
+ )
84
+ parser.add_argument(
85
+ "--server",
86
+ default="http://localhost:8000",
87
+ help="OpenRA-RL server URL (default: http://localhost:8000)",
88
+ )
89
+ parser.add_argument(
90
+ "--max-steps",
91
+ type=int,
92
+ default=5000,
93
+ help="Max steps per game before timeout (default: 5000)",
94
+ )
95
+ parser.add_argument(
96
+ "--dry-run",
97
+ action="store_true",
98
+ help="Validate arguments and show what would run, without connecting",
99
+ )
100
+ parser.add_argument(
101
+ "--output",
102
+ type=Path,
103
+ default=RESULTS_FILE,
104
+ help=f"Output CSV path (default: {RESULTS_FILE})",
105
+ )
106
+ args = parser.parse_args()
107
+
108
+ # Auto-detect agent type
109
+ if args.agent_type is None:
110
+ type_map = {"scripted": "Scripted", "llm": "LLM", "mcp": "Scripted", "custom": "RL"}
111
+ args.agent_type = type_map[args.agent]
112
+
113
+ return args
114
+
115
+
116
+ async def run_game(env: Any, agent_fn: Any, max_steps: int) -> Dict[str, Any]:
117
+ """Run a single game and return metrics.
118
+
119
+ Args:
120
+ env: OpenRAEnv client instance.
121
+ agent_fn: Callable(obs) -> action.
122
+ max_steps: Maximum steps before timeout.
123
+
124
+ Returns:
125
+ Dict with game metrics (from rubrics.compute_game_metrics).
126
+ """
127
+ from rubrics import compute_game_metrics
128
+
129
+ obs = await env.reset()
130
+ steps = 0
131
+
132
+ while not obs.done and steps < max_steps:
133
+ action = agent_fn(obs)
134
+ obs = await env.step(action)
135
+ steps += 1
136
+
137
+ return compute_game_metrics(obs)
138
+
139
+
140
+ def get_agent_fn(agent_type: str) -> Any:
141
+ """Get the agent decision function for the specified type.
142
+
143
+ Returns a callable that takes an observation and returns an action.
144
+ """
145
+ if agent_type == "scripted":
146
+ # Import inline to avoid hard dependency
147
+ from openra_env.models import OpenRAAction
148
+ # Simple no-op agent for evaluation framework testing
149
+ # Replace with actual ScriptedBot integration
150
+ return lambda obs: OpenRAAction(commands=[])
151
+ else:
152
+ from openra_env.models import OpenRAAction
153
+ return lambda obs: OpenRAAction(commands=[])
154
+
155
+
156
+ async def run_evaluation(args: argparse.Namespace) -> Dict[str, Any]:
157
+ """Run the full evaluation: N games, collect metrics, compute aggregates."""
158
+ from openra_env.client import OpenRAEnv
159
+
160
+ agent_fn = get_agent_fn(args.agent)
161
+ game_results: List[Dict[str, Any]] = []
162
+
163
+ async with OpenRAEnv(args.server) as env:
164
+ for i in range(args.games):
165
+ print(f" Game {i + 1}/{args.games}...", end=" ", flush=True)
166
+ metrics = await run_game(env, agent_fn, args.max_steps)
167
+ game_results.append(metrics)
168
+ result_str = metrics["result"] or "timeout"
169
+ print(f"{result_str} (ticks: {metrics['ticks']}, K/D: {metrics['kd_ratio']:.1f})")
170
+
171
+ # Aggregate results
172
+ wins = sum(1 for g in game_results if g["win"])
173
+ total = len(game_results)
174
+
175
+ return {
176
+ "agent_name": args.agent_name,
177
+ "agent_type": args.agent_type,
178
+ "opponent": args.opponent,
179
+ "games": total,
180
+ "win_rate": round(100.0 * wins / max(total, 1), 1),
181
+ "score": round(compute_composite_score(game_results), 1),
182
+ "avg_kills": round(sum(g["kills_cost"] for g in game_results) / max(total, 1)),
183
+ "avg_deaths": round(sum(g["deaths_cost"] for g in game_results) / max(total, 1)),
184
+ "kd_ratio": round(
185
+ sum(g["kd_ratio"] for g in game_results) / max(total, 1), 2
186
+ ),
187
+ "avg_economy": round(
188
+ sum(g["assets_value"] for g in game_results) / max(total, 1)
189
+ ),
190
+ "avg_game_length": round(
191
+ sum(g["ticks"] for g in game_results) / max(total, 1)
192
+ ),
193
+ "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
194
+ "replay_url": "",
195
+ }
196
+
197
+
198
+ def compute_composite_score(game_results: List[Dict[str, Any]]) -> float:
199
+ """Compute the OpenRA-Bench composite score.
200
+
201
+ Score = 50% win_rate + 25% avg_kd_normalized + 25% avg_economy_normalized
202
+ """
203
+ total = len(game_results)
204
+ if total == 0:
205
+ return 0.0
206
+
207
+ win_rate = sum(1 for g in game_results if g["win"]) / total
208
+
209
+ # K/D ratio normalized: kd / (kd + 1) maps [0, inf) -> [0, 1)
210
+ avg_kd = sum(g["kd_ratio"] for g in game_results) / total
211
+ kd_norm = avg_kd / (avg_kd + 1)
212
+
213
+ # Economy normalized: assets / (assets + 10000)
214
+ avg_assets = sum(g["assets_value"] for g in game_results) / total
215
+ econ_norm = avg_assets / (avg_assets + 10000) if avg_assets >= 0 else 0.0
216
+
217
+ return 100.0 * (0.5 * win_rate + 0.25 * kd_norm + 0.25 * econ_norm)
218
+
219
+
220
+ def append_results(results: Dict[str, Any], output_path: Path) -> None:
221
+ """Append evaluation results to CSV file."""
222
+ file_exists = output_path.exists() and output_path.stat().st_size > 0
223
+
224
+ with open(output_path, "a", newline="") as f:
225
+ writer = csv.DictWriter(f, fieldnames=RESULTS_COLUMNS)
226
+ if not file_exists:
227
+ writer.writeheader()
228
+ writer.writerow(results)
229
+
230
+
231
+ def main() -> None:
232
+ args = parse_args()
233
+
234
+ print(f"OpenRA-Bench Evaluation")
235
+ print(f" Agent: {args.agent_name} ({args.agent_type})")
236
+ print(f" Opponent: {args.opponent}")
237
+ print(f" Games: {args.games}")
238
+ print(f" Server: {args.server}")
239
+ print()
240
+
241
+ if args.dry_run:
242
+ print("[DRY RUN] Would run evaluation with the above settings.")
243
+ print(f"[DRY RUN] Results would be written to: {args.output}")
244
+ return
245
+
246
+ results = asyncio.run(run_evaluation(args))
247
+
248
+ print()
249
+ print(f"Results:")
250
+ print(f" Win Rate: {results['win_rate']}%")
251
+ print(f" Score: {results['score']}")
252
+ print(f" K/D Ratio: {results['kd_ratio']}")
253
+ print(f" Avg Economy: {results['avg_economy']}")
254
+ print(f" Avg Game Length: {results['avg_game_length']} ticks")
255
+
256
+ append_results(results, args.output)
257
+ print(f"\nResults appended to {args.output}")
258
+
259
+
260
+ if __name__ == "__main__":
261
+ main()
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio>=4.44.0
2
+ pandas>=2.0.0
3
+ openenv-core>=0.2.0
rubrics.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OpenRA-Bench rubrics for agent evaluation.
2
+
3
+ Follows the OpenEnv rubric pattern (see openenv.core.rubrics).
4
+ These rubrics score game episodes based on win/loss, military efficiency,
5
+ and economic performance.
6
+
7
+ Usage:
8
+ rubric = OpenRABenchRubric()
9
+ rubric.reset()
10
+ for action, obs in episode:
11
+ reward = rubric(action, obs) # 0.0 until done
12
+ step_rewards = rubric.win_loss.compute_step_rewards()
13
+ """
14
+
15
+ from typing import Any, Dict, List, Tuple
16
+
17
+ from openenv.core.rubrics import (
18
+ ExponentialDiscountingTrajectoryRubric,
19
+ TrajectoryRubric,
20
+ WeightedSum,
21
+ )
22
+
23
+
24
+ class OpenRAWinLossRubric(ExponentialDiscountingTrajectoryRubric):
25
+ """Score game based on win/loss/draw outcome with temporal discounting.
26
+
27
+ Terminal rewards:
28
+ - Win: +1.0
29
+ - Loss: -1.0
30
+ - Draw: 0.0
31
+ """
32
+
33
+ def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
34
+ if not trajectory:
35
+ return 0.0
36
+ _, final_obs = trajectory[-1]
37
+ result = getattr(final_obs, "result", "")
38
+ if result == "win":
39
+ return 1.0
40
+ elif result == "lose":
41
+ return -1.0
42
+ return 0.0
43
+
44
+
45
+ class MilitaryEfficiencyRubric(TrajectoryRubric):
46
+ """Score based on kill/death cost ratio from final observation.
47
+
48
+ Score = kills_cost / max(kills_cost + deaths_cost, 1)
49
+ Normalized to 0.0-1.0 range.
50
+ """
51
+
52
+ def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
53
+ if not trajectory:
54
+ return 0.0
55
+ _, final_obs = trajectory[-1]
56
+ military = getattr(final_obs, "military", None)
57
+ if military is None:
58
+ return 0.0
59
+ kills = getattr(military, "kills_cost", 0)
60
+ deaths = getattr(military, "deaths_cost", 0)
61
+ total = kills + deaths
62
+ if total == 0:
63
+ return 0.5 # No combat occurred
64
+ return kills / total
65
+
66
+ def compute_step_rewards(self) -> List[float]:
67
+ if not self._trajectory:
68
+ return []
69
+ score = self.score_trajectory(self._trajectory)
70
+ return [score] * len(self._trajectory)
71
+
72
+
73
+ class EconomyRubric(TrajectoryRubric):
74
+ """Score based on final economic state.
75
+
76
+ Score = assets_value / (assets_value + 10000)
77
+ Sigmoid-like normalization to 0.0-1.0 range.
78
+ """
79
+
80
+ def score_trajectory(self, trajectory: List[Tuple[Any, Any]]) -> float:
81
+ if not trajectory:
82
+ return 0.0
83
+ _, final_obs = trajectory[-1]
84
+ military = getattr(final_obs, "military", None)
85
+ if military is None:
86
+ return 0.0
87
+ assets = getattr(military, "assets_value", 0)
88
+ # Sigmoid normalization: maps [0, inf) -> [0, 1)
89
+ return assets / (assets + 10000) if assets >= 0 else 0.0
90
+
91
+ def compute_step_rewards(self) -> List[float]:
92
+ if not self._trajectory:
93
+ return []
94
+ score = self.score_trajectory(self._trajectory)
95
+ return [score] * len(self._trajectory)
96
+
97
+
98
+ class OpenRABenchRubric(WeightedSum):
99
+ """Composite benchmark score combining win/loss, military, and economy.
100
+
101
+ Weights: 50% win/loss, 25% military efficiency, 25% economy.
102
+ """
103
+
104
+ def __init__(self, gamma: float = 0.99):
105
+ win_loss = OpenRAWinLossRubric(gamma=gamma)
106
+ military = MilitaryEfficiencyRubric()
107
+ economy = EconomyRubric()
108
+ super().__init__(
109
+ rubrics=[win_loss, military, economy],
110
+ weights=[0.5, 0.25, 0.25],
111
+ )
112
+ # Keep named references for direct access
113
+ self.win_loss = win_loss
114
+ self.military = military
115
+ self.economy = economy
116
+
117
+ def reset(self) -> None:
118
+ self.win_loss.reset()
119
+ self.military.reset()
120
+ self.economy.reset()
121
+
122
+
123
+ def compute_game_metrics(final_obs: Any) -> Dict[str, Any]:
124
+ """Extract benchmark metrics from a final game observation.
125
+
126
+ Args:
127
+ final_obs: The terminal GameObservation (where done=True).
128
+
129
+ Returns:
130
+ Dict with keys: result, ticks, kills_cost, deaths_cost,
131
+ kd_ratio, assets_value, cash, win (bool).
132
+ """
133
+ military = getattr(final_obs, "military", None)
134
+ economy = getattr(final_obs, "economy", None)
135
+
136
+ kills = getattr(military, "kills_cost", 0) if military else 0
137
+ deaths = getattr(military, "deaths_cost", 0) if military else 0
138
+ assets = getattr(military, "assets_value", 0) if military else 0
139
+ cash = getattr(economy, "cash", 0) if economy else 0
140
+ result = getattr(final_obs, "result", "")
141
+ tick = getattr(final_obs, "tick", 0)
142
+
143
+ return {
144
+ "result": result,
145
+ "win": result == "win",
146
+ "ticks": tick,
147
+ "kills_cost": kills,
148
+ "deaths_cost": deaths,
149
+ "kd_ratio": kills / max(deaths, 1),
150
+ "assets_value": assets,
151
+ "cash": cash,
152
+ }