# `scripts/eval.py` — evaluation runner Standalone **random-policy** baseline against **`LifeStackEnv`**. No trained model, no GPU, no API key. Use it to: - Verify the simulator after code changes - Establish a **reward floor** before GRPO runs - Quick CI / smoke checks --- ## Usage ```bash python scripts/eval.py python scripts/eval.py --episodes 20 --domain flight_crisis python scripts/eval.py --episodes 5 --verbose ``` Run from the **repo root** so `core` imports resolve. --- ## CLI | Argument | Default | Description | |----------|---------|-------------| | `--episodes` | `10` | Number of episodes | | `--domain` | `None` | Optional filter for `TaskGenerator` / task domain (e.g. `flight_crisis`, `code_merge_crisis`, or `transport_crisis` if wired) | | `--verbose` | off | Per-step action, reward, `done` | --- ## Output - Per-episode table (mean reward, steps, domain) - Aggregate mean / std across episodes Interpret **trained** models with `scripts/train_trl.py --full-episode` or app demos — `eval.py` is intentionally **random**. --- ## Relation to GRPO training | Tool | Policy | |------|--------| | `eval.py` | Uniform random actions | | `train_trl.py` | GRPO-trained LLM completions | | `train_trl.py --full-episode` | Roll out **multi-step** episodes with a **saved** checkpoint | --- ## See also - [train_trl.md](train_trl.md) - [lifestack_env.md](lifestack_env.md)