scripts/eval.py — evaluation runner
Standalone random-policy baseline against LifeStackEnv. No trained model, no GPU, no API key.
Use it to:
- Verify the simulator after code changes
- Establish a reward floor before GRPO runs
- Quick CI / smoke checks
Usage
python scripts/eval.py
python scripts/eval.py --episodes 20 --domain flight_crisis
python scripts/eval.py --episodes 5 --verbose
Run from the repo root so core imports resolve.
CLI
| Argument | Default | Description |
|---|---|---|
--episodes |
10 |
Number of episodes |
--domain |
None |
Optional filter for TaskGenerator / task domain (e.g. flight_crisis, code_merge_crisis, or transport_crisis if wired) |
--verbose |
off | Per-step action, reward, done |
Output
- Per-episode table (mean reward, steps, domain)
- Aggregate mean / std across episodes
Interpret trained models with scripts/train_trl.py --full-episode or app demos — eval.py is intentionally random.
Relation to GRPO training
| Tool | Policy |
|---|---|
eval.py |
Uniform random actions |
train_trl.py |
GRPO-trained LLM completions |
train_trl.py --full-episode |
Roll out multi-step episodes with a saved checkpoint |