File size: 20,909 Bytes
77da5ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 | # LifeStack Long-Horizon Upgrade Plan
## Context
LifeStack is a hackathon RL project that simulates life-decision tasks as a gym-style environment. Currently episodes are 5 steps long, use a single linear conflict path, have no hidden state or exogenous events, and reward only step-level metric improvements. Judges expect a proper long-horizon environment with 20+ steps, branching routes, dynamic world changes, partial observability, and task-completion rewards. This plan covers the full upgrade across pre-hackathon, Day 1, and Day 2.
**Key discoveries from reading the repo:**
- `app.py` is a **Gradio app** (not FastAPI). New "endpoints" = new Gradio tabs/functions.
- `max_steps = 5` is hardcoded in **two places**: `core/lifestack_env.py:93` AND `core/lifestack_gym_env.py:62`.
- The current reward is step-local only (no task-completion bonus exists anywhere).
- `memory.py` stores single decisions keyed by conflict title β no trajectory concept exists.
- `run_episode.py` orchestrates the loop outside the env (agent loop + env.step in separate code).
- ChromaDB is already persistent (`./lifestack_memory/`).
- `train_trl.py` already has a working GRPO loop with Unsloth β just needs new env interface.
- `app.py` imports `LongitudinalDemo` (not in the file listing β likely missing or in a data file).
---
## Proposed `core/task.py` Schema (SHARED CONTRACT β agree before writing any logic)
```python
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HiddenStateField:
key: str # e.g. "boss_mood"
initial_value: Any # e.g. "neutral"
inspect_target: str # e.g. "call_boss" β which inspect action type reveals this
description: str # shown to agent after reveal
@dataclass
class ExoEvent:
step: int # inject at this step (inclusive); -1 = probabilistic
probability: float # 1.0 = deterministic; <1.0 = random at each step
id: str # e.g. "ticket_price_spike"
description: str # what agent sees in next observation
world_mutation: dict # e.g. {"ticket_price": 450, "seats_remaining": 1}
hidden_state_mutation: dict # e.g. {"boss_mood": "angry"}
closes_routes: list[str] = field(default_factory=list) # route IDs this event blocks
@dataclass
class Milestone:
id: str # e.g. "flight_rebooked"
description: str
condition_key: str # world/hidden key to check, e.g. "flight_rebooked"
condition_value: Any # e.g. True
reward: float # milestone reward added to episode total
@dataclass
class Route:
id: str # e.g. "rebook_premium"
name: str
description: str
required_action_types: list[str] # must use these tool actions to complete
preconditions: dict # world/hidden state checks, e.g. {"card_available": True}
consequences: dict # world mutations on route completion, e.g. {"flight_rebooked": True}
closes_routes: list[str] # route IDs this blocks
milestones_unlocked: list[str] # milestone IDs this route can hit
final_reward: float # bonus on route completion
@dataclass
class Task:
id: str
domain: str # "flight_crisis" | "code_merge_crisis"
goal: str
constraints: dict # e.g. {"budget_max": 400, "deadline_step": 18}
hidden_state: dict # full truth, agent never sees directly
mutable_world: dict # partial truth, some fields revealed by inspect
visible_world: dict # agent sees this at each step (subset of mutable_world)
success_conditions: list[dict] # e.g. [{"key": "flight_rebooked", "value": True}]
failure_conditions: list[dict] # e.g. [{"key": "missed_deadline", "value": True}]
event_schedule: list[ExoEvent]
viable_routes: list[Route]
milestones: list[Milestone]
horizon: int # max steps (20β50)
difficulty: int # 1β5
domain_metadata: dict # domain-specific extra data (story text, etc.)
```
**Agreement required:** All three team members must freeze this schema before writing any logic.
---
## Risk Register
| Risk | Severity | Mitigation |
|------|----------|------------|
| **Cascade runaway over 30 steps** β DependencyGraph with 0.6 dampening can collapse metrics to 0 after repeated disruptions | HIGH | Add `metric_floor = 10.0` in `life_state.py`; cascade clamps to `max(floor, result)` not `max(0, result)`. Also add per-step cascade cap: max 3 metrics affected per step. |
| **Resource exhaustion on longer episodes** β Default 20h/500$/100e depletes in ~5 steps of aggressive action | HIGH | Scale budgets proportionally in `reset()`: `time=20*max_steps/5`, etc. Make configurable per-Task via `constraints`. |
| **Reward hacking: inspect spam** β Agent learns to `inspect` repeatedly for reward | HIGH | Anti-cheat: same hidden_state key cannot be inspected twice. Inspect has no intrinsic reward. |
| **Reward hacking: wait loops** β Agent waits forever | MEDIUM | Cap: max 3 consecutive `wait` actions; 4th `wait` triggers forced `escalate`. |
| **Reward hacking: rollback loops** β Rollback-execute-rollback cycle | MEDIUM | Rollback is only available once per route; marks action as `used_rollback=True` in state. |
| **Colab T4 session timeout** β Free Colab sessions timeout at ~12h | MEDIUM | Save checkpoint every 50 steps in `train_trl.py`. Use `trainer.save_checkpoint()` not just `save_pretrained_merged()` at end. |
| **ChromaDB trajectory bloat** β 30 steps Γ 23 metrics = ~700 floats per trajectory; 100 trajectories = 70k floats | LOW | Store trajectory summary (start/end state diff + route taken + total reward), not full step-by-step. |
| **OpenEnv API version** β `openenv-core>=0.2.3` in requirements; `_EnvBase`, `Action`, `Observation`, `State`, `Rubric` are OpenEnv abstractions. Need to confirm `create_app()` signature matches. | MEDIUM | Do not change `LifeStackAction`/`LifeStackObservation`/`LifeStackState` class names or fields. Add new fields as `Optional` to maintain backward compat. |
| **Two hardcoded `max_steps=5`** β Will break if only one is updated | HIGH | Fix both in Phase 0. Make `max_steps` a constructor param defaulting to `task.horizon` or 30. |
| **`app.py` imports `LongitudinalDemo`** β Not in file listing; may be missing class | MEDIUM | Check if it's defined inline or in a missing file. If missing, stub it for Day 1. |
| **`run_episode.py` duplicates env loop** β Agent loop lives outside env. New long-horizon logic must work in both env.step() and the external runner | MEDIUM | Keep `run_episode.py` working; it calls `env.step()` which now handles world mutation/events internally. |
| **TRL GRPO reward function parses prompt** β `lifestack_reward_fn` in `train_trl.py` reconstructs state from prompt text | MEDIUM | After env upgrade, update `build_prompt_for_conflict()` to include Task fields and update reward function accordingly. |
---
## File-by-File Change Plan
### NEW: `core/task.py`
- All dataclasses from schema above
- `FlightCrisisTask()` factory function returning a hardcoded Task instance (used for testing)
- `CodeMergeCrisisTask()` factory (stubbed Day 1, complete Day 2)
- No imports from other project files (pure data)
### MODIFIED: `core/lifestack_env.py`
**Existing:** `max_steps=5`, flat step logic, no hidden state, no events
**Changes:**
- Add `WorldEngine` inner class:
- `__init__(task: Task)` β stores event schedule
- `inject_events(step: int, world: dict, hidden: dict) -> list[ExoEvent]` β returns events fired this step, mutates world/hidden in-place
- `get_closed_routes() -> set[str]` β routes blocked by events
- Add `PartialObsFilter`:
- `filter(world: dict, revealed_keys: set[str]) -> dict` β returns only visible_world + revealed fields
- Change `__init__` signature: `__init__(task: Task = None, max_steps: int = 30)`
- In `reset()`: initialize `world_state`, `hidden_state`, `revealed_hidden_keys`, `current_task`, `active_route`, `milestones_achieved`, `used_rollback`
- In `step()`:
1. Run `world_engine.inject_events(step)` β get fired events
2. Apply ToolAction logic (inspect/plan/execute/wait/rollback/escalate)
3. Check route preconditions; mark routes closed if violated
4. Compute reward via updated `compute_reward()`
5. Check success/failure conditions from task
6. Build observation with `partial_obs_filter`
- Add `render()` update: show task goal, active route, milestones achieved, events log
- **Preserve:** `LifeStackAction`, `LifeStackObservation`, `LifeStackState` class names and core fields (add Optional new fields)
### MODIFIED: `core/action_space.py`
**Add** `ToolAction` enum:
```python
class ToolActionType(str, Enum):
INSPECT = "inspect"
PLAN = "plan"
EXECUTE = "execute"
COMMUNICATE = "communicate"
WAIT = "wait"
ROLLBACK = "rollback"
ESCALATE = "escalate"
```
**Add** `ToolAction` dataclass:
```python
@dataclass
class ToolAction:
action_type: ToolActionType
target: str # inspect target, execute target, communicate recipient, etc.
parameters: dict # action-specific params
reasoning: str
```
**Add** `validate_tool_action(action: ToolAction, env_state: dict) -> tuple[bool, str]`
- Checks: inspect not repeated for same key, wait count β€ 3, rollback only if not used
**Keep:** `AgentAction`, `PrimaryAction`, `CommunicationAction`, `EXAMPLE_ACTIONS` unchanged
### MODIFIED: `core/reward.py`
**Add** functions (do NOT remove `compute_reward`):
```python
def compute_milestone_reward(milestones_achieved: list[str], task: Task) -> float
def compute_task_completion_reward(success_conditions_met: list[bool], task: Task) -> float
def compute_replan_bonus(exo_events_seen: int, milestones_after_event: int) -> float
def compute_dead_end_penalty(routes_remaining: int) -> float
```
**Add** `compute_task_reward(...)` β orchestrates all components:
- 10% local metric delta (old `compute_reward`)
- 40% milestone rewards
- 30% task completion
- 10% replan bonus
- 10% efficiency
- Penalties: dead end (-0.5), rollback used (-0.1), cascade collapse (-0.3)
### MODIFIED: `core/life_state.py`
- Add `METRIC_FLOOR = 10.0` constant
- In `DependencyGraph.cascade()`: change `max(0, ...)` to `max(METRIC_FLOOR, ...)` for cascade-induced changes (not direct actions)
- Add `per_step_cascade_cap = 3` β BFS stops after affecting 3 nodes per step call
### MODIFIED: `agent/conflict_generator.py`
**Add** `TaskGenerator` class:
```python
class TaskGenerator:
def generate(self, domain: str = None, difficulty: int = None) -> Task
def generate_flight_crisis(self, difficulty: int) -> Task
def generate_code_merge_crisis(self, difficulty: int) -> Task
```
**Keep:** `ConflictEvent`, `TEMPLATES`, `generate_conflict()`, `escalate_conflict()` fully intact
### MODIFIED: `agent/memory.py`
**Add** to `store_decision()`: optional `trajectory: list[dict] = None` and `route_outcome: str = None` params
**Add** `store_trajectory(task_id, route_taken, total_reward, trajectory_summary)` method:
- `trajectory_summary` = `{start_state_diff, end_state_diff, milestones_hit, events_seen, route_id, total_reward}`
- Store in separate ChromaDB collection `'trajectories'`
**Add** `retrieve_similar_trajectories(task_domain, current_world) -> list[dict]`
**Keep:** all existing methods unchanged
### MODIFIED: `app.py` (Gradio)
**Add** Tab 5: "Task Explorer":
- Shows current Task object (goal, constraints, visible routes, milestones)
- Shows event log for current episode
- Shows route lock status
**Add** helper functions:
- `task_html(task: Task) -> str` β renders goal, routes, milestones
- `event_log_html(events: list[ExoEvent]) -> str`
- `route_status_html(routes: list[Route], closed: set[str]) -> str`
**Keep:** All existing tabs and functions unchanged.
### MODIFIED: `openenv.yaml`
```yaml
metadata:
max_episode_steps: 50
task_domains: [flight_crisis, code_merge_crisis]
# existing fields unchanged
```
### MODIFIED: `notebooks/LifeStack_Training.ipynb`
- Update env init cell to use `Task` objects
- Add Colab-ready GRPO cell with pinned versions:
- `unsloth==2024.12.4`, `trl>=0.9`, `transformers>=4.45`
- Model: `Qwen2.5-1.5B-Instruct` (fits T4 with 4-bit)
- Add reward breakdown visualization cell
- Checkpoint every 50 steps cell
---
## Task Domain Specs
### Domain 1: Flight Crisis
```
goal: "Catch the rescheduled flight and submit expense report by Sunday"
constraints: {budget_max: 400, deadline_step: 18, report_deadline_step: 22}
hidden_state:
boss_mood: "neutral" # revealed by inspect("call_boss")
card_limit: 350 # revealed by inspect("check_card")
partner_flexibility: 0.7 # revealed by inspect("text_partner")
mutable_world:
ticket_price: 280 # changes at step 5 (spike to 450)
seats_remaining: 3 # decreases each step probabilistically
flight_rebooked: false
report_submitted: false
event_schedule:
step 5: {ticket_price: 450, seats_remaining: 1} (closes route "rebook_premium" if budget_max=400)
step 8: {boss_mood: "annoyed"} (hidden_state mutation via msg)
step 12: {card_blocked: true} (closes routes "rebook_premium", "hotel_stay")
routes:
A: rebook_premium (precond: card_available=True, budget>=ticket_price)
B: bus_and_remote (always open; slower, lower reward)
C: hotel_next_day (precond: card_available=True; closed at step 12)
D: family_loan (precond: partner_flexibility>=0.5; revealed after inspect)
E: negotiate_deadline (precond: boss_mood != "furious"; closed if boss_mood="furious")
milestones:
- inspect_boss: reward=0.05 (inspected boss_mood)
- flight_rebooked: reward=0.20
- report_submitted: reward=0.15
- under_budget: reward=0.10 (total spend < budget_max)
horizon: 25
```
### Domain 2: Code Merge Crisis
```
goal: "Merge feature branch without breaking main; deploy by Friday"
constraints: {deploy_deadline_step: 30, max_conflicts: 5}
hidden_state:
reviewer_strictness: "medium" # revealed by inspect("check_pr_history")
ci_flakiness_score: 0.3 # revealed by inspect("check_ci_logs")
teammate_available: true # revealed by inspect("ping_teammate")
mutable_world:
conflicts_remaining: 4
ci_passing: false
pr_approved: false
deploy_done: false
event_schedule:
step 3: new commits land (conflicts_remaining += 2)
step 7: CI fails (ci_passing: false, closes "direct_merge" route)
step 10: reviewer blocks PR (pr_approved: false, mutates reviewer_strictness based on history)
routes:
A: rebase (always open; risk of conflict if new commits land)
B: cherry_pick (precond: conflicts_remaining <= 3)
C: manual_merge (always open; slower, high reward if careful)
D: rollback_split_pr (precond: used_rollback=False)
milestones:
- conflicts_resolved: reward=0.15
- ci_passing: reward=0.15
- pr_approved: reward=0.15
- deployed: reward=0.25
horizon: 30
```
---
## Hour-by-Hour Task Board
### Phase 0 β Pre-hackathon (Now β Apr 25 8 AM)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|------|----------------|------------------------|---------------------|
| Now | Define `core/task.py` together β ALL THREE agree on schema | Same | Same |
| +1h | Add `ToolActionType` enum to `action_space.py` | Add `TaskGenerator` stub returning 1 hardcoded FlightCrisis Task | Colab smoke test: TRL+Unsloth GRPO on 5-step env. Confirm GPU, pin versions. |
| +2h | Stub `WorldEngine` in `lifestack_env.py` (inject_events returns []) | Define full FlightCrisis `mutable_world` and `hidden_state` dicts | Confirm training loop runs 100 steps with non-zero reward |
| +3h | Bump `max_steps=30` in both files + openenv.yaml. Run `run_episode.py`. | Build all 5 Route objects for Flight Crisis | Save Colab checkpoint; verify Unsloth merge path works |
| +4h | Confirm existing tests pass with max_steps=30 | Stub Code Merge task (fields only, no events yet) | Update `train_trl.py` to accept Task object from env |
| +4h | Sleep | Sleep | Sleep |
### Day 1 β Apr 25 (8 AM β Midnight)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|------|----------------|------------------------|---------------------|
| 8β10 AM | Full WorldEngine: inject_events fires at correct steps, mutates world/hidden dicts | Complete event_schedule for Flight Crisis (3 events) | Trajectory memory: add store_trajectory() to memory.py |
| 10 AMβ1 PM | PartialObsFilter: filter() hides hidden_state fields until revealed. inspect action reveals one field per call. | Milestone reward: compute_milestone_reward() fires when condition_key/value matches. Test manually. | /task and /routes Gradio tab (task_html, route_status_html) |
| 1β3 PM | **Integration test**: run_episode.py on 25-step Flight Crisis. Events inject at steps 5/8/12. inspect reveals boss_mood. Milestone fires on flight_rebooked. | **Integration test**: reward breakdown shows milestone + completion components. Fix any component that returns NaN or 0 always. | **Integration test**: training loop runs on new env, reward curve non-trivially non-zero |
| 3β5 PM | Fix cascade runaway: add METRIC_FLOOR=10, per-step cascade cap=3 | Code Merge task: full event_schedule (steps 3/7/10) + all 4 routes | Start Colab training on FlightCrisis. Qwen2.5-1.5B. Log every 50 steps. |
| 5β7 PM | Reward hacking audit: can inspect spam score high? Can wait=30 score? Can rollback-loop? Fix each exploit. | Reward hacking audit: same. Anti-cheat: inspect blocks on repeated key, wait cap=3 consecutive | Monitor training. If reward flats at 0, check reward_fn in train_trl.py. |
| 7β9 PM | Smoke test: both task domains, 5 episodes each, no crashes | Smoke test all milestones + failure conditions fire correctly | Save checkpoint. Run before/after comparison: baseline vs trained on FlightCrisis. |
| 9β11 PM | render() update: show task goal, active route, milestone log, event log | Efficiency penalty tuning: make it punish but not dominate | Push notebook to Colab. Test from cold start. |
| 11 PM | Commit stable checkpoint | Commit | Commit |
### Day 2 β Apr 26 (8 AM β 8 PM)
| Time | Person A (Env) | Person B (Task+Reward) | Person C (Training) |
|------|----------------|------------------------|---------------------|
| 8β10 AM | Curriculum variants: easy Flight Crisis (deadline_step=25, no card block event) | Easy/medium/hard difficulty scaling for both tasks | Longer Kaggle (P100) training run. Curriculum: easy β hard. |
| 10 AMβ12 PM | Render polish: episode timeline readable by judges | Reward breakdown display in Gradio | Inference test: load merged model, run 5 episodes, compare reward vs baseline |
| 12β2 PM | HF Space setup: test Space endpoint with $200 credits | Code Merge fully working end-to-end | Demo script: baseline β reward output β trained β measurable gain |
| 2β4 PM | README architecture diagram | Reward breakdown chart (matplotlib, per episode) | Record 2-min demo |
| 4β6 PM | Final smoke test of both domains | Final reward hacking audit pass | BLOG.md update |
| 6β8 PM | Submit | Submit | Submit |
---
## Verification Plan
1. **Unit test `core/task.py`**: instantiate both Task objects, check all fields present and typed correctly
2. **Unit test `WorldEngine`**: inject step 5 event on FlightCrisis, verify `ticket_price` updates from 280 to 450
3. **Unit test `PartialObsFilter`**: hidden field not in output before inspect; in output after inspect("call_boss")
4. **Unit test `compute_milestone_reward`**: set `flight_rebooked=True` in world, verify milestone fires with reward=0.20
5. **Integration test (run_episode.py)**: 25-step FlightCrisis episode with LifeStackAgent. Check: (a) reward > 0, (b) events fired at correct steps, (c) route closed after card_blocked event, (d) milestones logged in obs.metadata
6. **Reward hacking test**: manually set actions to pure inspect for 25 steps β verify total_reward < 0.1. Pure wait for 25 steps β verify truncation fires and penalty applied.
7. **Training test**: run `train_trl.py` for 50 steps on Colab. Verify reward_curve shows non-flat trend.
8. **Backward compat test**: run `run_episode.py` with the old `conflict_generator.generate_conflict()` (no Task object). Should not crash.
---
## Critical Files
| File | Status | Owner |
|------|--------|-------|
| `core/task.py` | NEW | A+B together first |
| `core/lifestack_env.py` | MAJOR CHANGE | A |
| `core/action_space.py` | ADD ToolAction enum | B |
| `core/reward.py` | ADD task-level functions | B |
| `core/life_state.py` | ADD floor + cap | A |
| `agent/conflict_generator.py` | ADD TaskGenerator | B |
| `agent/memory.py` | ADD trajectory storage | C |
| `app.py` | ADD Task Explorer tab | C |
| `openenv.yaml` | UPDATE max_episode_steps | A |
| `notebooks/LifeStack_Training.ipynb` | UPDATE for new env | C |
| `scripts/train_trl.py` | UPDATE reward_fn + prompt | C |
|