Spaces:

s-b3
/

LifeStack

Sleeping

App Files Files Community

LifeStack / docs /eval.md

Soham Banerjee

deploy: pure lifestack with partitioned wisdom pool

77da5ce about 1 month ago

preview code

raw

history blame contribute delete

2.38 kB

	# eval.py — Evaluation Runner Reference

	`scripts/eval.py` — Standalone LifeStack evaluation runner using a random-action baseline.

	No model, no GPU, no API key required.

	---

	## Overview

	Runs N independent episodes against `LifeStackEnv` using uniformly random actions as a
	baseline policy. Prints a live per-episode table and aggregate statistics at the end.

	Useful for:
	- Verifying environment correctness after changes
	- Establishing a random-baseline reward floor before training
	- CI smoke checks (no external dependencies)

	---

	## Usage

	```bash
	# Default: 10 episodes, any domain
	python scripts/eval.py

	# 20 episodes, flight_crisis domain only
	python scripts/eval.py --episodes 20 --domain flight_crisis

	# Verbose per-step output
	python scripts/eval.py --episodes 5 --verbose
	```

	---

	## CLI Arguments

	\| Argument \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `--episodes` \| `int` \| `10` \| Number of episodes to run \|
	\| `--domain` \| `str` \| `None` \| Optional domain filter passed to `TaskGenerator.generate()` \|
	\| `--verbose` \| flag \| `False` \| Print per-step action, reward, and done status \|

	Supported `--domain` values: `flight_crisis`, `code_merge_crisis` (or omit for random).

	---

	## Output

	### Per-episode table

	```
	EP TOTAL REWARD STEPS DOMAIN SUCCESS
	──── ──────────── ────── ──────────────────── ───────
	1 0.3120 8 flight_crisis ✗
	2 1.8450 12 code_merge_crisis ✓
	```

	### Aggregate stats

	```
	──────────────────────────────────────────────────────────
	Episodes : 10
	Mean Reward : 0.8231
	Success Rate : 30.0%
	Mean Steps : 10.4
	```

	---

	## Action Space (Random Baseline)

	Each step samples uniformly from:
	`execute`, `inspect`, `plan`, `wait`, `communicate`, `spend`, `delegate`

	- `execute` actions target a real route ID from the active task.
	- `inspect` actions target a real hidden-state key from the active task.
	- Other actions apply a small random metric nudge and resource cost.

	---

	## Change Log

	\| Date \| Change \|
	\|---\|---\|
	\| 2026-04-23 \| File created — implements random baseline evaluation runner \|