File size: 1,424 Bytes
ddbc1ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | # `scripts/eval.py` — evaluation runner
Standalone **random-policy** baseline against **`LifeStackEnv`**. No trained model, no GPU, no API key.
Use it to:
- Verify the simulator after code changes
- Establish a **reward floor** before GRPO runs
- Quick CI / smoke checks
---
## Usage
```bash
python scripts/eval.py
python scripts/eval.py --episodes 20 --domain flight_crisis
python scripts/eval.py --episodes 5 --verbose
```
Run from the **repo root** so `core` imports resolve.
---
## CLI
| Argument | Default | Description |
|----------|---------|-------------|
| `--episodes` | `10` | Number of episodes |
| `--domain` | `None` | Optional filter for `TaskGenerator` / task domain (e.g. `flight_crisis`, `code_merge_crisis`, or `transport_crisis` if wired) |
| `--verbose` | off | Per-step action, reward, `done` |
---
## Output
- Per-episode table (mean reward, steps, domain)
- Aggregate mean / std across episodes
Interpret **trained** models with `scripts/train_trl.py --full-episode` or app demos — `eval.py` is intentionally **random**.
---
## Relation to GRPO training
| Tool | Policy |
|------|--------|
| `eval.py` | Uniform random actions |
| `train_trl.py` | GRPO-trained LLM completions |
| `train_trl.py --full-episode` | Roll out **multi-step** episodes with a **saved** checkpoint |
---
## See also
- [train_trl.md](train_trl.md)
- [lifestack_env.md](lifestack_env.md)
|