File size: 2,381 Bytes
77da5ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# eval.py β€” Evaluation Runner Reference

`scripts/eval.py` β€” Standalone LifeStack evaluation runner using a random-action baseline.

No model, no GPU, no API key required.

---

## Overview

Runs N independent episodes against `LifeStackEnv` using uniformly random actions as a
baseline policy. Prints a live per-episode table and aggregate statistics at the end.

Useful for:
- Verifying environment correctness after changes
- Establishing a random-baseline reward floor before training
- CI smoke checks (no external dependencies)

---

## Usage

```bash
# Default: 10 episodes, any domain
python scripts/eval.py

# 20 episodes, flight_crisis domain only
python scripts/eval.py --episodes 20 --domain flight_crisis

# Verbose per-step output
python scripts/eval.py --episodes 5 --verbose
```

---

## CLI Arguments

| Argument | Type | Default | Description |
|---|---|---|---|
| `--episodes` | `int` | `10` | Number of episodes to run |
| `--domain` | `str` | `None` | Optional domain filter passed to `TaskGenerator.generate()` |
| `--verbose` | flag | `False` | Print per-step action, reward, and done status |

Supported `--domain` values: `flight_crisis`, `code_merge_crisis` (or omit for random).

---

## Output

### Per-episode table

```
   EP   TOTAL REWARD   STEPS  DOMAIN                SUCCESS
  ────  ────────────  ──────  ────────────────────  ───────
     1        0.3120       8  flight_crisis               βœ—
     2        1.8450      12  code_merge_crisis            βœ“
```

### Aggregate stats

```
  ──────────────────────────────────────────────────────────
  Episodes     : 10
  Mean Reward  : 0.8231
  Success Rate : 30.0%
  Mean Steps   : 10.4
```

---

## Action Space (Random Baseline)

Each step samples uniformly from:
`execute`, `inspect`, `plan`, `wait`, `communicate`, `spend`, `delegate`

- `execute` actions target a real route ID from the active task.
- `inspect` actions target a real hidden-state key from the active task.
- Other actions apply a small random metric nudge and resource cost.

---

## Change Log

| Date | Change |
|---|---|
| 2026-04-23 | File created β€” implements random baseline evaluation runner |