| # eval.py β Evaluation Runner Reference |
|
|
| `scripts/eval.py` β Standalone LifeStack evaluation runner using a random-action baseline. |
|
|
| No model, no GPU, no API key required. |
|
|
| --- |
|
|
| ## Overview |
|
|
| Runs N independent episodes against `LifeStackEnv` using uniformly random actions as a |
| baseline policy. Prints a live per-episode table and aggregate statistics at the end. |
|
|
| Useful for: |
| - Verifying environment correctness after changes |
| - Establishing a random-baseline reward floor before training |
| - CI smoke checks (no external dependencies) |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```bash |
| # Default: 10 episodes, any domain |
| python scripts/eval.py |
| |
| # 20 episodes, flight_crisis domain only |
| python scripts/eval.py --episodes 20 --domain flight_crisis |
| |
| # Verbose per-step output |
| python scripts/eval.py --episodes 5 --verbose |
| ``` |
|
|
| --- |
|
|
| ## CLI Arguments |
|
|
| | Argument | Type | Default | Description | |
| |---|---|---|---| |
| | `--episodes` | `int` | `10` | Number of episodes to run | |
| | `--domain` | `str` | `None` | Optional domain filter passed to `TaskGenerator.generate()` | |
| | `--verbose` | flag | `False` | Print per-step action, reward, and done status | |
|
|
| Supported `--domain` values: `flight_crisis`, `code_merge_crisis` (or omit for random). |
|
|
| --- |
|
|
| ## Output |
|
|
| ### Per-episode table |
|
|
| ``` |
| EP TOTAL REWARD STEPS DOMAIN SUCCESS |
| ββββ ββββββββββββ ββββββ ββββββββββββββββββββ βββββββ |
| 1 0.3120 8 flight_crisis β |
| 2 1.8450 12 code_merge_crisis β |
| ``` |
|
|
| ### Aggregate stats |
|
|
| ``` |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| Episodes : 10 |
| Mean Reward : 0.8231 |
| Success Rate : 30.0% |
| Mean Steps : 10.4 |
| ``` |
|
|
| --- |
|
|
| ## Action Space (Random Baseline) |
|
|
| Each step samples uniformly from: |
| `execute`, `inspect`, `plan`, `wait`, `communicate`, `spend`, `delegate` |
|
|
| - `execute` actions target a real route ID from the active task. |
| - `inspect` actions target a real hidden-state key from the active task. |
| - Other actions apply a small random metric nudge and resource cost. |
|
|
| --- |
|
|
| ## Change Log |
|
|
| | Date | Change | |
| |---|---| |
| | 2026-04-23 | File created β implements random baseline evaluation runner | |
|
|