LifeStack / docs /eval.md
Soham Banerjee
deploy: pure lifestack with partitioned wisdom pool
77da5ce

eval.py β€” Evaluation Runner Reference

scripts/eval.py β€” Standalone LifeStack evaluation runner using a random-action baseline.

No model, no GPU, no API key required.


Overview

Runs N independent episodes against LifeStackEnv using uniformly random actions as a baseline policy. Prints a live per-episode table and aggregate statistics at the end.

Useful for:

  • Verifying environment correctness after changes
  • Establishing a random-baseline reward floor before training
  • CI smoke checks (no external dependencies)

Usage

# Default: 10 episodes, any domain
python scripts/eval.py

# 20 episodes, flight_crisis domain only
python scripts/eval.py --episodes 20 --domain flight_crisis

# Verbose per-step output
python scripts/eval.py --episodes 5 --verbose

CLI Arguments

Argument Type Default Description
--episodes int 10 Number of episodes to run
--domain str None Optional domain filter passed to TaskGenerator.generate()
--verbose flag False Print per-step action, reward, and done status

Supported --domain values: flight_crisis, code_merge_crisis (or omit for random).


Output

Per-episode table

   EP   TOTAL REWARD   STEPS  DOMAIN                SUCCESS
  ────  ────────────  ──────  ────────────────────  ───────
     1        0.3120       8  flight_crisis               βœ—
     2        1.8450      12  code_merge_crisis            βœ“

Aggregate stats

  ──────────────────────────────────────────────────────────
  Episodes     : 10
  Mean Reward  : 0.8231
  Success Rate : 30.0%
  Mean Steps   : 10.4

Action Space (Random Baseline)

Each step samples uniformly from: execute, inspect, plan, wait, communicate, spend, delegate

  • execute actions target a real route ID from the active task.
  • inspect actions target a real hidden-state key from the active task.
  • Other actions apply a small random metric nudge and resource cost.

Change Log

Date Change
2026-04-23 File created β€” implements random baseline evaluation runner