eval.py β Evaluation Runner Reference
scripts/eval.py β Standalone LifeStack evaluation runner using a random-action baseline.
No model, no GPU, no API key required.
Overview
Runs N independent episodes against LifeStackEnv using uniformly random actions as a
baseline policy. Prints a live per-episode table and aggregate statistics at the end.
Useful for:
- Verifying environment correctness after changes
- Establishing a random-baseline reward floor before training
- CI smoke checks (no external dependencies)
Usage
# Default: 10 episodes, any domain
python scripts/eval.py
# 20 episodes, flight_crisis domain only
python scripts/eval.py --episodes 20 --domain flight_crisis
# Verbose per-step output
python scripts/eval.py --episodes 5 --verbose
CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--episodes |
int |
10 |
Number of episodes to run |
--domain |
str |
None |
Optional domain filter passed to TaskGenerator.generate() |
--verbose |
flag | False |
Print per-step action, reward, and done status |
Supported --domain values: flight_crisis, code_merge_crisis (or omit for random).
Output
Per-episode table
EP TOTAL REWARD STEPS DOMAIN SUCCESS
ββββ ββββββββββββ ββββββ ββββββββββββββββββββ βββββββ
1 0.3120 8 flight_crisis β
2 1.8450 12 code_merge_crisis β
Aggregate stats
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Episodes : 10
Mean Reward : 0.8231
Success Rate : 30.0%
Mean Steps : 10.4
Action Space (Random Baseline)
Each step samples uniformly from:
execute, inspect, plan, wait, communicate, spend, delegate
executeactions target a real route ID from the active task.inspectactions target a real hidden-state key from the active task.- Other actions apply a small random metric nudge and resource cost.
Change Log
| Date | Change |
|---|---|
| 2026-04-23 | File created β implements random baseline evaluation runner |