meta-r2 / docs /eval.md
github-actions[bot]
Deploy Space snapshot
ddbc1ba

scripts/eval.py — evaluation runner

Standalone random-policy baseline against LifeStackEnv. No trained model, no GPU, no API key.

Use it to:

  • Verify the simulator after code changes
  • Establish a reward floor before GRPO runs
  • Quick CI / smoke checks

Usage

python scripts/eval.py
python scripts/eval.py --episodes 20 --domain flight_crisis
python scripts/eval.py --episodes 5 --verbose

Run from the repo root so core imports resolve.


CLI

Argument Default Description
--episodes 10 Number of episodes
--domain None Optional filter for TaskGenerator / task domain (e.g. flight_crisis, code_merge_crisis, or transport_crisis if wired)
--verbose off Per-step action, reward, done

Output

  • Per-episode table (mean reward, steps, domain)
  • Aggregate mean / std across episodes

Interpret trained models with scripts/train_trl.py --full-episode or app demos — eval.py is intentionally random.


Relation to GRPO training

Tool Policy
eval.py Uniform random actions
train_trl.py GRPO-trained LLM completions
train_trl.py --full-episode Roll out multi-step episodes with a saved checkpoint

See also