Spaces:

s-b3
/

LifeStack

Sleeping

App Files Files Community

LifeStack / docs /eval.md

Soham Banerjee

deploy: pure lifestack with partitioned wisdom pool

77da5ce about 1 month ago

preview code

raw

history blame contribute delete

2.38 kB

eval.py — Evaluation Runner Reference

scripts/eval.py — Standalone LifeStack evaluation runner using a random-action baseline.

No model, no GPU, no API key required.

Overview

Runs N independent episodes against LifeStackEnv using uniformly random actions as a baseline policy. Prints a live per-episode table and aggregate statistics at the end.

Useful for:

Verifying environment correctness after changes
Establishing a random-baseline reward floor before training
CI smoke checks (no external dependencies)

Usage

# Default: 10 episodes, any domain
python scripts/eval.py

# 20 episodes, flight_crisis domain only
python scripts/eval.py --episodes 20 --domain flight_crisis

# Verbose per-step output
python scripts/eval.py --episodes 5 --verbose

CLI Arguments

Argument	Type	Default	Description
`--episodes`	`int`	`10`	Number of episodes to run
`--domain`	`str`	`None`	Optional domain filter passed to `TaskGenerator.generate()`
`--verbose`	flag	`False`	Print per-step action, reward, and done status

Supported --domain values: flight_crisis, code_merge_crisis (or omit for random).

Output

Per-episode table

   EP   TOTAL REWARD   STEPS  DOMAIN                SUCCESS
  ────  ────────────  ──────  ────────────────────  ───────
     1        0.3120       8  flight_crisis               ✗
     2        1.8450      12  code_merge_crisis            ✓

Aggregate stats

  ──────────────────────────────────────────────────────────
  Episodes     : 10
  Mean Reward  : 0.8231
  Success Rate : 30.0%
  Mean Steps   : 10.4

Action Space (Random Baseline)

Each step samples uniformly from: execute, inspect, plan, wait, communicate, spend, delegate

execute actions target a real route ID from the active task.
inspect actions target a real hidden-state key from the active task.
Other actions apply a small random metric nudge and resource cost.

Change Log

Date	Change
2026-04-23	File created — implements random baseline evaluation runner