LifeStack Long-Horizon Upgrade Plan
Context
LifeStack is a hackathon RL project that simulates life-decision tasks as a gym-style environment. Currently episodes are 5 steps long, use a single linear conflict path, have no hidden state or exogenous events, and reward only step-level metric improvements. Judges expect a proper long-horizon environment with 20+ steps, branching routes, dynamic world changes, partial observability, and task-completion rewards. This plan covers the full upgrade across pre-hackathon, Day 1, and Day 2.
Key discoveries from reading the repo:
app.pyis a Gradio app (not FastAPI). New "endpoints" = new Gradio tabs/functions.max_steps = 5is hardcoded in two places:core/lifestack_env.py:93ANDcore/lifestack_gym_env.py:62.- The current reward is step-local only (no task-completion bonus exists anywhere).
memory.pystores single decisions keyed by conflict title β no trajectory concept exists.run_episode.pyorchestrates the loop outside the env (agent loop + env.step in separate code).- ChromaDB is already persistent (
./lifestack_memory/). train_trl.pyalready has a working GRPO loop with Unsloth β just needs new env interface.app.pyimportsLongitudinalDemo(not in the file listing β likely missing or in a data file).
Proposed core/task.py Schema (SHARED CONTRACT β agree before writing any logic)
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HiddenStateField:
key: str # e.g. "boss_mood"
initial_value: Any # e.g. "neutral"
inspect_target: str # e.g. "call_boss" β which inspect action type reveals this
description: str # shown to agent after reveal
@dataclass
class ExoEvent:
step: int # inject at this step (inclusive); -1 = probabilistic
probability: float # 1.0 = deterministic; <1.0 = random at each step
id: str # e.g. "ticket_price_spike"
description: str # what agent sees in next observation
world_mutation: dict # e.g. {"ticket_price": 450, "seats_remaining": 1}
hidden_state_mutation: dict # e.g. {"boss_mood": "angry"}
closes_routes: list[str] = field(default_factory=list) # route IDs this event blocks
@dataclass
class Milestone:
id: str # e.g. "flight_rebooked"
description: str
condition_key: str # world/hidden key to check, e.g. "flight_rebooked"
condition_value: Any # e.g. True
reward: float # milestone reward added to episode total
@dataclass
class Route:
id: str # e.g. "rebook_premium"
name: str
description: str
required_action_types: list[str] # must use these tool actions to complete
preconditions: dict # world/hidden state checks, e.g. {"card_available": True}
consequences: dict # world mutations on route completion, e.g. {"flight_rebooked": True}
closes_routes: list[str] # route IDs this blocks
milestones_unlocked: list[str] # milestone IDs this route can hit
final_reward: float # bonus on route completion
@dataclass
class Task:
id: str
domain: str # "flight_crisis" | "code_merge_crisis"
goal: str
constraints: dict # e.g. {"budget_max": 400, "deadline_step": 18}
hidden_state: dict # full truth, agent never sees directly
mutable_world: dict # partial truth, some fields revealed by inspect
visible_world: dict # agent sees this at each step (subset of mutable_world)
success_conditions: list[dict] # e.g. [{"key": "flight_rebooked", "value": True}]
failure_conditions: list[dict] # e.g. [{"key": "missed_deadline", "value": True}]
event_schedule: list[ExoEvent]
viable_routes: list[Route]
milestones: list[Milestone]
horizon: int # max steps (20β50)
difficulty: int # 1β5
domain_metadata: dict # domain-specific extra data (story text, etc.)
Agreement required: All three team members must freeze this schema before writing any logic.
Risk Register
| Risk | Severity | Mitigation |
|---|---|---|
| Cascade runaway over 30 steps β DependencyGraph with 0.6 dampening can collapse metrics to 0 after repeated disruptions | HIGH | Add metric_floor = 10.0 in life_state.py; cascade clamps to max(floor, result) not max(0, result). Also add per-step cascade cap: max 3 metrics affected per step. |
| Resource exhaustion on longer episodes β Default 20h/500$/100e depletes in ~5 steps of aggressive action | HIGH | Scale budgets proportionally in reset(): time=20*max_steps/5, etc. Make configurable per-Task via constraints. |
Reward hacking: inspect spam β Agent learns to inspect repeatedly for reward |
HIGH | Anti-cheat: same hidden_state key cannot be inspected twice. Inspect has no intrinsic reward. |
| Reward hacking: wait loops β Agent waits forever | MEDIUM | Cap: max 3 consecutive wait actions; 4th wait triggers forced escalate. |
| Reward hacking: rollback loops β Rollback-execute-rollback cycle | MEDIUM | Rollback is only available once per route; marks action as used_rollback=True in state. |
| Colab T4 session timeout β Free Colab sessions timeout at ~12h | MEDIUM | Save checkpoint every 50 steps in train_trl.py. Use trainer.save_checkpoint() not just save_pretrained_merged() at end. |
| ChromaDB trajectory bloat β 30 steps Γ 23 metrics = ~700 floats per trajectory; 100 trajectories = 70k floats | LOW | Store trajectory summary (start/end state diff + route taken + total reward), not full step-by-step. |
OpenEnv API version β openenv-core>=0.2.3 in requirements; _EnvBase, Action, Observation, State, Rubric are OpenEnv abstractions. Need to confirm create_app() signature matches. |
MEDIUM | Do not change LifeStackAction/LifeStackObservation/LifeStackState class names or fields. Add new fields as Optional to maintain backward compat. |
Two hardcoded max_steps=5 β Will break if only one is updated |
HIGH | Fix both in Phase 0. Make max_steps a constructor param defaulting to task.horizon or 30. |
app.py imports LongitudinalDemo β Not in file listing; may be missing class |
MEDIUM | Check if it's defined inline or in a missing file. If missing, stub it for Day 1. |
run_episode.py duplicates env loop β Agent loop lives outside env. New long-horizon logic must work in both env.step() and the external runner |
MEDIUM | Keep run_episode.py working; it calls env.step() which now handles world mutation/events internally. |
TRL GRPO reward function parses prompt β lifestack_reward_fn in train_trl.py reconstructs state from prompt text |
MEDIUM | After env upgrade, update build_prompt_for_conflict() to include Task fields and update reward function accordingly. |
File-by-File Change Plan
NEW: core/task.py
- All dataclasses from schema above
FlightCrisisTask()factory function returning a hardcoded Task instance (used for testing)CodeMergeCrisisTask()factory (stubbed Day 1, complete Day 2)- No imports from other project files (pure data)
MODIFIED: core/lifestack_env.py
Existing: max_steps=5, flat step logic, no hidden state, no events
Changes:
- Add
WorldEngineinner class:__init__(task: Task)β stores event scheduleinject_events(step: int, world: dict, hidden: dict) -> list[ExoEvent]β returns events fired this step, mutates world/hidden in-placeget_closed_routes() -> set[str]β routes blocked by events
- Add
PartialObsFilter:filter(world: dict, revealed_keys: set[str]) -> dictβ returns only visible_world + revealed fields
- Change
__init__signature:__init__(task: Task = None, max_steps: int = 30) - In
reset(): initializeworld_state,hidden_state,revealed_hidden_keys,current_task,active_route,milestones_achieved,used_rollback - In
step():- Run
world_engine.inject_events(step)β get fired events - Apply ToolAction logic (inspect/plan/execute/wait/rollback/escalate)
- Check route preconditions; mark routes closed if violated
- Compute reward via updated
compute_reward() - Check success/failure conditions from task
- Build observation with
partial_obs_filter
- Run
- Add
render()update: show task goal, active route, milestones achieved, events log - Preserve:
LifeStackAction,LifeStackObservation,LifeStackStateclass names and core fields (add Optional new fields)
MODIFIED: core/action_space.py
Add ToolAction enum:
class ToolActionType(str, Enum):
INSPECT = "inspect"
PLAN = "plan"
EXECUTE = "execute"
COMMUNICATE = "communicate"
WAIT = "wait"
ROLLBACK = "rollback"
ESCALATE = "escalate"
Add ToolAction dataclass:
@dataclass
class ToolAction:
action_type: ToolActionType
target: str # inspect target, execute target, communicate recipient, etc.
parameters: dict # action-specific params
reasoning: str
Add validate_tool_action(action: ToolAction, env_state: dict) -> tuple[bool, str]
- Checks: inspect not repeated for same key, wait count β€ 3, rollback only if not used
Keep:
AgentAction,PrimaryAction,CommunicationAction,EXAMPLE_ACTIONSunchanged
MODIFIED: core/reward.py
Add functions (do NOT remove compute_reward):
def compute_milestone_reward(milestones_achieved: list[str], task: Task) -> float
def compute_task_completion_reward(success_conditions_met: list[bool], task: Task) -> float
def compute_replan_bonus(exo_events_seen: int, milestones_after_event: int) -> float
def compute_dead_end_penalty(routes_remaining: int) -> float
Add compute_task_reward(...) β orchestrates all components:
- 10% local metric delta (old
compute_reward) - 40% milestone rewards
- 30% task completion
- 10% replan bonus
- 10% efficiency
- Penalties: dead end (-0.5), rollback used (-0.1), cascade collapse (-0.3)
MODIFIED: core/life_state.py
- Add
METRIC_FLOOR = 10.0constant - In
DependencyGraph.cascade(): changemax(0, ...)tomax(METRIC_FLOOR, ...)for cascade-induced changes (not direct actions) - Add
per_step_cascade_cap = 3β BFS stops after affecting 3 nodes per step call
MODIFIED: agent/conflict_generator.py
Add TaskGenerator class:
class TaskGenerator:
def generate(self, domain: str = None, difficulty: int = None) -> Task
def generate_flight_crisis(self, difficulty: int) -> Task
def generate_code_merge_crisis(self, difficulty: int) -> Task
Keep: ConflictEvent, TEMPLATES, generate_conflict(), escalate_conflict() fully intact
MODIFIED: agent/memory.py
Add to store_decision(): optional trajectory: list[dict] = None and route_outcome: str = None params
Add store_trajectory(task_id, route_taken, total_reward, trajectory_summary) method:
trajectory_summary={start_state_diff, end_state_diff, milestones_hit, events_seen, route_id, total_reward}- Store in separate ChromaDB collection
'trajectories'Addretrieve_similar_trajectories(task_domain, current_world) -> list[dict]Keep: all existing methods unchanged
MODIFIED: app.py (Gradio)
Add Tab 5: "Task Explorer":
- Shows current Task object (goal, constraints, visible routes, milestones)
- Shows event log for current episode
- Shows route lock status
Add helper functions:
task_html(task: Task) -> strβ renders goal, routes, milestonesevent_log_html(events: list[ExoEvent]) -> strroute_status_html(routes: list[Route], closed: set[str]) -> str
Keep: All existing tabs and functions unchanged.
MODIFIED: openenv.yaml
metadata:
max_episode_steps: 50
task_domains: [flight_crisis, code_merge_crisis]
# existing fields unchanged
MODIFIED: notebooks/LifeStack_Training.ipynb
- Update env init cell to use
Taskobjects - Add Colab-ready GRPO cell with pinned versions:
unsloth==2024.12.4,trl>=0.9,transformers>=4.45- Model:
Qwen2.5-1.5B-Instruct(fits T4 with 4-bit)
- Add reward breakdown visualization cell
- Checkpoint every 50 steps cell
Task Domain Specs
Domain 1: Flight Crisis
goal: "Catch the rescheduled flight and submit expense report by Sunday"
constraints: {budget_max: 400, deadline_step: 18, report_deadline_step: 22}
hidden_state:
boss_mood: "neutral" # revealed by inspect("call_boss")
card_limit: 350 # revealed by inspect("check_card")
partner_flexibility: 0.7 # revealed by inspect("text_partner")
mutable_world:
ticket_price: 280 # changes at step 5 (spike to 450)
seats_remaining: 3 # decreases each step probabilistically
flight_rebooked: false
report_submitted: false
event_schedule:
step 5: {ticket_price: 450, seats_remaining: 1} (closes route "rebook_premium" if budget_max=400)
step 8: {boss_mood: "annoyed"} (hidden_state mutation via msg)
step 12: {card_blocked: true} (closes routes "rebook_premium", "hotel_stay")
routes:
A: rebook_premium (precond: card_available=True, budget>=ticket_price)
B: bus_and_remote (always open; slower, lower reward)
C: hotel_next_day (precond: card_available=True; closed at step 12)
D: family_loan (precond: partner_flexibility>=0.5; revealed after inspect)
E: negotiate_deadline (precond: boss_mood != "furious"; closed if boss_mood="furious")
milestones:
- inspect_boss: reward=0.05 (inspected boss_mood)
- flight_rebooked: reward=0.20
- report_submitted: reward=0.15
- under_budget: reward=0.10 (total spend < budget_max)
horizon: 25
Domain 2: Code Merge Crisis
goal: "Merge feature branch without breaking main; deploy by Friday"
constraints: {deploy_deadline_step: 30, max_conflicts: 5}
hidden_state:
reviewer_strictness: "medium" # revealed by inspect("check_pr_history")
ci_flakiness_score: 0.3 # revealed by inspect("check_ci_logs")
teammate_available: true # revealed by inspect("ping_teammate")
mutable_world:
conflicts_remaining: 4
ci_passing: false
pr_approved: false
deploy_done: false
event_schedule:
step 3: new commits land (conflicts_remaining += 2)
step 7: CI fails (ci_passing: false, closes "direct_merge" route)
step 10: reviewer blocks PR (pr_approved: false, mutates reviewer_strictness based on history)
routes:
A: rebase (always open; risk of conflict if new commits land)
B: cherry_pick (precond: conflicts_remaining <= 3)
C: manual_merge (always open; slower, high reward if careful)
D: rollback_split_pr (precond: used_rollback=False)
milestones:
- conflicts_resolved: reward=0.15
- ci_passing: reward=0.15
- pr_approved: reward=0.15
- deployed: reward=0.25
horizon: 30
Hour-by-Hour Task Board
Phase 0 β Pre-hackathon (Now β Apr 25 8 AM)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|---|---|---|---|
| Now | Define core/task.py together β ALL THREE agree on schema |
Same | Same |
| +1h | Add ToolActionType enum to action_space.py |
Add TaskGenerator stub returning 1 hardcoded FlightCrisis Task |
Colab smoke test: TRL+Unsloth GRPO on 5-step env. Confirm GPU, pin versions. |
| +2h | Stub WorldEngine in lifestack_env.py (inject_events returns []) |
Define full FlightCrisis mutable_world and hidden_state dicts |
Confirm training loop runs 100 steps with non-zero reward |
| +3h | Bump max_steps=30 in both files + openenv.yaml. Run run_episode.py. |
Build all 5 Route objects for Flight Crisis | Save Colab checkpoint; verify Unsloth merge path works |
| +4h | Confirm existing tests pass with max_steps=30 | Stub Code Merge task (fields only, no events yet) | Update train_trl.py to accept Task object from env |
| +4h | Sleep | Sleep | Sleep |
Day 1 β Apr 25 (8 AM β Midnight)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|---|---|---|---|
| 8β10 AM | Full WorldEngine: inject_events fires at correct steps, mutates world/hidden dicts | Complete event_schedule for Flight Crisis (3 events) | Trajectory memory: add store_trajectory() to memory.py |
| 10 AMβ1 PM | PartialObsFilter: filter() hides hidden_state fields until revealed. inspect action reveals one field per call. | Milestone reward: compute_milestone_reward() fires when condition_key/value matches. Test manually. | /task and /routes Gradio tab (task_html, route_status_html) |
| 1β3 PM | Integration test: run_episode.py on 25-step Flight Crisis. Events inject at steps 5/8/12. inspect reveals boss_mood. Milestone fires on flight_rebooked. | Integration test: reward breakdown shows milestone + completion components. Fix any component that returns NaN or 0 always. | Integration test: training loop runs on new env, reward curve non-trivially non-zero |
| 3β5 PM | Fix cascade runaway: add METRIC_FLOOR=10, per-step cascade cap=3 | Code Merge task: full event_schedule (steps 3/7/10) + all 4 routes | Start Colab training on FlightCrisis. Qwen2.5-1.5B. Log every 50 steps. |
| 5β7 PM | Reward hacking audit: can inspect spam score high? Can wait=30 score? Can rollback-loop? Fix each exploit. | Reward hacking audit: same. Anti-cheat: inspect blocks on repeated key, wait cap=3 consecutive | Monitor training. If reward flats at 0, check reward_fn in train_trl.py. |
| 7β9 PM | Smoke test: both task domains, 5 episodes each, no crashes | Smoke test all milestones + failure conditions fire correctly | Save checkpoint. Run before/after comparison: baseline vs trained on FlightCrisis. |
| 9β11 PM | render() update: show task goal, active route, milestone log, event log | Efficiency penalty tuning: make it punish but not dominate | Push notebook to Colab. Test from cold start. |
| 11 PM | Commit stable checkpoint | Commit | Commit |
Day 2 β Apr 26 (8 AM β 8 PM)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|---|---|---|---|
| 8β10 AM | Curriculum variants: easy Flight Crisis (deadline_step=25, no card block event) | Easy/medium/hard difficulty scaling for both tasks | Longer Kaggle (P100) training run. Curriculum: easy β hard. |
| 10 AMβ12 PM | Render polish: episode timeline readable by judges | Reward breakdown display in Gradio | Inference test: load merged model, run 5 episodes, compare reward vs baseline |
| 12β2 PM | HF Space setup: test Space endpoint with $200 credits | Code Merge fully working end-to-end | Demo script: baseline β reward output β trained β measurable gain |
| 2β4 PM | README architecture diagram | Reward breakdown chart (matplotlib, per episode) | Record 2-min demo |
| 4β6 PM | Final smoke test of both domains | Final reward hacking audit pass | BLOG.md update |
| 6β8 PM | Submit | Submit | Submit |
Verification Plan
- Unit test
core/task.py: instantiate both Task objects, check all fields present and typed correctly - Unit test
WorldEngine: inject step 5 event on FlightCrisis, verifyticket_priceupdates from 280 to 450 - Unit test
PartialObsFilter: hidden field not in output before inspect; in output after inspect("call_boss") - Unit test
compute_milestone_reward: setflight_rebooked=Truein world, verify milestone fires with reward=0.20 - Integration test (run_episode.py): 25-step FlightCrisis episode with LifeStackAgent. Check: (a) reward > 0, (b) events fired at correct steps, (c) route closed after card_blocked event, (d) milestones logged in obs.metadata
- Reward hacking test: manually set actions to pure inspect for 25 steps β verify total_reward < 0.1. Pure wait for 25 steps β verify truncation fires and penalty applied.
- Training test: run
train_trl.pyfor 50 steps on Colab. Verify reward_curve shows non-flat trend. - Backward compat test: run
run_episode.pywith the oldconflict_generator.generate_conflict()(no Task object). Should not crash.
Critical Files
| File | Status | Owner |
|---|---|---|
core/task.py |
NEW | A+B together first |
core/lifestack_env.py |
MAJOR CHANGE | A |
core/action_space.py |
ADD ToolAction enum | B |
core/reward.py |
ADD task-level functions | B |
core/life_state.py |
ADD floor + cap | A |
agent/conflict_generator.py |
ADD TaskGenerator | B |
agent/memory.py |
ADD trajectory storage | C |
app.py |
ADD Task Explorer tab | C |
openenv.yaml |
UPDATE max_episode_steps | A |
notebooks/LifeStack_Training.ipynb |
UPDATE for new env | C |
scripts/train_trl.py |
UPDATE reward_fn + prompt | C |