# LifeStack Hackathon Sprint — Implementation Plan

## Context

**Submission deadline:** 26 Apr 5 PM. Offline from 25 Apr 8 AM. ~30 hours of offline build time.

The LifeStack Flask demo (`app_flask.py` + `templates/index.html`) already ships 10 API endpoints, a 6-tab UI, and a working agent/memory/cascade/reward pipeline. This sprint adds **13 additive features** (demo panels, APIs, RLHF loop, multi-step training, real-data connectors, tests, blog) without breaking existing endpoints. All work is additive.

Budget: **$90 HF credits** — T4 Small for the always-on demo Space, A10G for GRPO training runs, HF Inference API for the NLP panel. Target trained checkpoint: **`jdsb06/lifestack-grpo-v2`** (user will push).

Key reusable primitives already in repo (do not rebuild):
- `core/cascade_utils.py:5 animate_cascade()` — returns list of 4 frames with `flat` + `status` dicts
- `agent/counterfactuals.py:10 generate_counterfactuals()` — returns list of alternatives
- `agent/memory.py:74 LifeStackMemory.store_trajectory()` and `:128 store_feedback(OutcomeFeedback)`
- `core/feedback.py OutcomeFeedback` + `compute_human_feedback_reward()`
- `core/life_state.py:61 LifeMetrics.flatten()` — 23 metric paths
- `agent/conflict_generator.py TEMPLATES` (13 scenarios) + `generate_conflict()`
- `core/metric_schema.py VALID_METRIC_PATHS`

Already wired in `app_flask.py`: `/api/feedback/submit` (Feature 9 backend is done — scope of F9 reduces to frontend panel + training integration); `/api/simulation/cascade` (kept intact, new `/api/cascade/frames` added alongside).

---

## Implementation Order (Offline Sprint)

1. F1 Trained-vs-Baseline comparison (impact demo)
2. F5 Domain risk heatmap (sidebar, always visible)
3. F3 "Try Your Own" NLP + HF Inference fallback
4. F2 D3 cascade visualisation
5. F4 Personality comparison with OCEAN radar
6. F6 Counterfactual explorer panel
7. F8 Multi-step GRPO training loop + `push_to_hub`
8. F9 RLHF feedback panel + training integration
9. F7 Cold-vs-warm memory ablation demo
10. F10 Health + calendar uploads
11. F11 BLOG.md (~700 words)
12. F12 Four tests
13. F13 Episode history/replay

Before starting, run smoke tests (`scripts/smoke_test.py`, `scripts/eval.py --episodes 5`, cascade/counterfactual imports). Fix before adding features.

---

## Cross-Cutting Changes

### `requirements.txt` — add
- `huggingface_hub` (for F3 InferenceClient and F8 push_to_hub)
- `icalendar` (F10 calendar upload)

### `intake/intake.py` — LLM fallback chain (F3 dependency)
Refactor `_call_llm()` (~line 44) to cascade: **HF Inference API (`HF_TOKEN`) → Groq (`GROQ_API_KEY`) → empty-string fallback** (existing behaviour). `LifeIntake.__init__` constructs both an `InferenceClient(model="Qwen/Qwen2.5-1.5B-Instruct", token=HF_TOKEN)` when `HF_TOKEN` is present and the existing Groq `OpenAI` client when `GROQ_API_KEY` is present. `extract_conflict()` already returns an empty `ConflictEvent` when the LLM returns empty — keyword fallback below strengthens that path.

**Keyword fallback:** add `_match_template_by_keywords(text: str) -> ConflictEvent | None` that scans `TEMPLATES` for overlap with user text and returns the best match. Called inside `extract_conflict()` when both LLM clients fail.

### `app_flask.py` — shared helpers (used by F1, F4, F5, F7)
- `_run_episode(person, conflict, steps, seed, agent_fn) -> list[step_dict]`: initialises a fresh `LifeStackEnv`, applies the conflict disruption, loops `steps` iterations calling `agent_fn(metrics, budget, conflict, person)` to pick an action, runs `env.step()`, and collects `{step, action_type, target, reward, metrics, cost}`. `agent_fn` is injected so F1 can pass a random-action picker and a `LifeStackAgent.get_action`-wrapped version.
- `_random_action(metrics, budget, conflict, person) -> AgentAction`: samples uniformly from `core.action_space.EXAMPLE_ACTIONS` (line 98–196) and jitters `metric_changes` slightly so the baseline isn't deterministic. Same return shape as `AGENT.get_action()`.
- `compute_domain_health(flat_metrics: dict) -> dict[str, float]`: averages sub-metrics per domain, inverts `INVERTED_METRICS` (line 67, already defined), returns `{career, finances, relationships, physical_health, mental_wellbeing, time}` each in [0,1].

### `templates/index.html` — UI integration pattern
Every new feature adds one new tab button in the nav bar (line 37–44) and one content `<div id="content-X">` in the main section (line 46–202). Reuse existing classes: `.glass`, `.tab-active`, `.metric-bar`, Tailwind (`.rounded-2xl`, `.p-6`, `.space-y-6`, `.grid grid-cols-2 gap-6`, `.text-slate-400`, `.bg-indigo-500/10`). Chart.js is already loaded via CDN (line 8); D3 v7 to be added.

---

## Feature-by-Feature

### F1 — Trained vs Baseline Comparison
**Backend — `app_flask.py`:**
- `POST /api/comparison/run` → body `{conflict, person, steps=5, seed=42}`.
  - Resolve `conflict` via `CONFLICT_CHOICES`, `person` via `PERSONS`.
  - Call `_run_episode(..., agent_fn=_random_action)` → `baseline`.
  - Call `_run_episode(..., agent_fn=lambda m,b,c,p: AGENT.get_action(m,b,c,p))` with identical seed → `trained`.
  - Compute `reward_delta = sum(trained_rewards) - sum(baseline_rewards)`.
  - Return `{baseline: [...], trained: [...], reward_delta}`.

**Frontend:**
- New tab "Comparison". Two side-by-side `.glass` cards titled "Baseline (Random)" and "GRPO-Trained". For each step, render action-type badge + reward bar. Delta banner at the bottom (`bg-indigo-500/10`) showing `+X.XX`.

### F2 — Live Cascade Visualisation (D3)
**Backend:**
- `POST /api/cascade/frames` → body `{primary_disruption: {metric_path: delta}}`. Calls `animate_cascade(primary_disruption, LifeMetrics())` and returns `{frames}`. Keeps existing `/api/simulation/cascade` untouched.

**Frontend:**
- Add D3 v7 CDN line in `<head>`.
- New section inside the "Situational Portal" tab (below the existing cascade timeline at line ~70): `<svg id="cascade-graph" width="720" height="420">`.
- JS module `renderCascade(frames)`: creates 23 nodes from `VALID_METRIC_PATHS`, clusters by domain (6 cluster centres at: career TL, finances TR, relationships ML, physical_health MR, mental_wellbeing BC, time TC), draws edges from a hardcoded copy of the 20+ edges in `DependencyGraph.edges`. Iterates frames with 600ms `setTimeout`, recolouring nodes based on `frames[i].status[metric]`: `unchanged→#334155`, `primary→#ef4444`, `first→#f97316`, `second→#facc15`.
- Called from the existing simulation-action flow after each `/api/simulation/action` response.

### F3 — "Try Your Own Situation" NLP Panel
**Backend:**
- `/api/custom/run` already exists (line 162) and is fully wired. No route changes.
- `intake/intake.py` cross-cutting change above adds HF→Groq→keyword fallback.

**Frontend:**
- Existing "Try Your Case" tab (`#tab-custom`) is currently slider-heavy. Add a prominent textarea + Submit above the sliders. On submit, `fetch('/api/custom/run', {situation: text})` → render a card with detected domain(s), recommended action type/target, metric deltas as coloured badges (green for positive on positive-sense metrics, red otherwise, using `INVERTED_METRICS` set), reward bar.

### F4 — Personality Comparison
**Backend:**
- `POST /api/personality/compare` → body `{conflict_id="d5_friday", person_a, person_b, steps=3}`.
  - Look up persons from `PERSONS`. Run `_run_episode` twice with the trained agent on the same conflict + seed.
  - Return `{person_a: {name, actions, total_reward, ocean: {O,C,E,A,N}}, person_b: {...}, dominant_trait: "neuroticism"}` where `dominant_trait = argmax(|ocean_a[t] - ocean_b[t]|)`.

**Frontend:**
- New tab "Personality". Two `.glass` columns. Each has a Chart.js radar chart (already CDN-loaded) with 5 axes (OCEAN). Below the radar: action sequence + total reward. Banner highlighting the dominant trait.

### F5 — Domain Risk Heatmap
**Backend:** `compute_domain_health()` helper added (cross-cutting section). Every response from `/api/simulation/start`, `/api/simulation/action`, `/api/custom/run` gets an extra `domain_health` field derived from the metrics already in the payload — no new route.

**Frontend:** Persistent top bar above tab nav (inserted at ~line 35): 6 cells (2×3 grid on small, 6×1 on large). Each cell shows the domain emoji from `DOMAIN_EMOJI` and a pill background coloured via `hsl((1 - h) * 120, 70%, 45%)`. Re-rendered from every simulation response.

### F6 — Counterfactual Explorer
**Backend:**
- `POST /api/counterfactuals/generate` → body `{conflict, person, chosen_action: {...}}`. Reconstructs state, calls `generate_counterfactuals(AGENT, metrics, budget, conflict, person, chosen_action)`, returns `{chosen: {...}, alternatives: [3 items from the list]}`. (Counterfactuals already appear inside `/api/simulation/action` response — this route is the on-demand variant Feature 6 wants.)

**Frontend:** "What If?" collapsible panel appended below each step output. 3 alternative cards sorted by predicted reward. Chosen action outlined in indigo, best alt in green, worst in red.

### F7 — Memory Ablation (Cold vs Warm)
**Backend:**
- `POST /api/memory/ablation` → body `{conflict, person, steps=5}`.
  - Episode 1: pass `memory=None` (or a fresh `LifeStackAgent()` with empty `.memory`). Record actions + rewards.
  - `MEMORY.store_trajectory(conflict_title=..., route_taken=..., total_reward=..., reasoning=...)` for episode 1.
  - Episode 2: reuse `AGENT` (global — has ChromaDB via `MEMORY`). Query `MEMORY` for similar trajectories (existing retrieval method) and pass the top-k summary into `get_action`'s `few_shot_context` param.
  - Return `{cold: {actions, reward}, warm: {actions, reward, retrieved_context}, improvement_pct}`.

**Frontend:** Two-column timeline in a new "Memory" tab. Callout box with `💡 Agent recalled: …` when warm has retrieved context. Big percentage banner at the bottom.

### F8 — Multi-Step GRPO Training
**`scripts/train_trl.py` (currently 914 lines, single-prompt per scenario):**
- Add `run_full_episode(task, person, model, tokenizer, max_steps=10) -> tuple[list[step_reward], dict]`:
  - For each step: build prompt from current `LifeMetrics` + `ResourceBudget` + conflict, call `model.generate`, parse JSON action, call `env.step()`, append step reward from existing `compute_task_reward()`.
  - Return per-step rewards and a serialised trajectory.
- New CLI flag `--full-episode`. When set, `generate_dataset()` is replaced by `generate_episodic_dataset()` which calls `run_full_episode` per scenario and uses `sum(step_rewards) / max_steps` as the GRPO reward.
- `--dry-run` compatibility: 1 episode × 2 steps with a mock model (existing dry-run path stays valid).
- After `trainer.save_model()` at line 610, add `if not args.dry_run and args.push_to_hub: model.push_to_hub("jdsb06/lifestack-grpo-v2"); tokenizer.push_to_hub("jdsb06/lifestack-grpo-v2")`. New `--push-to-hub` flag guards it.
- Run on HF A10G once built: `python scripts/train_trl.py --full-episode --stages 5 --push-to-hub` (~$5).

### F9 — RLHF Loop
- **Backend:** `/api/feedback/submit` already fully implemented (line 267). No route changes needed.
- **Frontend:** Post-episode feedback panel (rendered after every completed simulation/custom/comparison episode). Slider 0–10, domain checkboxes (6 domains × improved/worsened), textarea. Submit posts `{episode_id, score, improved[], worsened[], notes, time}` to existing endpoint.
- **Training integration (`scripts/train_trl.py`):** New `--with-human-feedback` flag. When set, a new reward component `reward_human_feedback_fn` (hook already exists around line 379) loads stored feedback via `MEMORY.feedback_collection.query()` keyed by episode_id and blends `compute_human_feedback_reward()` output at weight 0.10, rebalancing existing weights proportionally.

### F10 — Real Data Integrations
**Backend:**
- `POST /api/data/health/upload` (multipart): accepts `.json` (Google Fit) or `.xml` (Apple Health). Parse `steps`, `heart_rate_resting`, `sleep_hours` (approximate parse; tolerate missing fields). Map to `physical_health.fitness`, `physical_health.energy`, `physical_health.sleep_quality`. Store in new module-level dict `USER_HEALTH_OVERRIDES`. Return `{parsed_metrics, events_found}`.
- `POST /api/data/calendar/upload` (multipart): `.ics` via `icalendar.Calendar.from_ical()`. Count events in next 7 days → `time.free_hours_per_week` (inverse), `career.workload`. Keyword match ("gym", "run", "yoga") → bump `physical_health.fitness`. Return same shape.
- `/api/simulation/start` and `/api/custom/run` consult `USER_HEALTH_OVERRIDES` when initialising `LifeMetrics()`.

**Frontend:** New "Connect My Data" subsection at the top of "Try Your Case". Two file inputs. After upload, render a chip list with `📊 From your real data — physical_health.fitness: 78`.

### F11 — BLOG.md (~700 words)
Rewrite the 13-line BLOG.md with 5 sections: Problem, What We Built, Key Results (+125%, +155%, +116% — already in README lines 45–71), What We Learned, What's Next. Inline-cite the 4 papers from README lines 233–241 (Starcke & Brand 2012; Roijers et al. 2013; Mullainathan & Shafir 2013; Wang et al. 2024).

### F12 — Four Tests (tests/)
- `test_env_reset.py`: `LifeStackEnv().reset()` → budget is fresh; reset twice → metrics identical. ~20 lines, pytest.
- `test_cascade.py`: `animate_cascade({"mental_wellbeing.stress_level": 30}, LifeMetrics())` returns 4 frames; frame 0 status all `unchanged`; frame 1 has at least one `primary`.
- `test_task_generator.py` (scoped per user answer): asserts `generate_conflict()` returns a valid `ConflictEvent` for each of the 6 life domains and `TEMPLATES` covers difficulties 1–5.
- `test_reward.py`: `compute_reward()` result in `[-1, 1]`; plausibility component penalises a 0-cost, 50-delta action.

### F13 — Episode History
**Backend:** 
- Maintain ring buffer `EPISODE_HISTORY: deque[dict] = deque(maxlen=5)` module-level in `app_flask.py`. After every episode-producing route, append `{id, conflict, steps[], final_reward, timestamp}`.
- `GET /api/history/list` returns summaries. `GET /api/history/replay/<episode_id>` returns full step log.

**Frontend:** New "History" tab, accordion list, click-to-expand per episode.

---

## Critical Files to Modify

| File | Features touching it |
|------|------|
| `app_flask.py` | F1, F2, F4, F5, F6, F7, F10, F13 (7 new routes, 3 helpers, 1 deque) |
| `intake/intake.py` | F3 (LLM fallback chain, keyword match) |
| `templates/index.html` | F1, F2, F3, F4, F5, F6, F7, F9, F10, F13 (new tabs, heatmap bar, D3 SVG, feedback panel) |
| `scripts/train_trl.py` | F8 (`run_full_episode`, `--full-episode`, `--push-to-hub`), F9 (`--with-human-feedback`) |
| `requirements.txt` | `huggingface_hub`, `icalendar` |
| `BLOG.md` | F11 (full rewrite) |
| `tests/test_env_reset.py`, `test_cascade.py`, `test_task_generator.py`, `test_reward.py` | F12 (new files) |

No other files get edited. No existing route or dataclass is modified.

---

## Verification

**Local (no GPU):**
```bash
python scripts/smoke_test.py
python scripts/eval.py --episodes 5
python -m pytest tests/ -v
python scripts/train_trl.py --full-episode --dry-run   # F8 dry-run
python app_flask.py  # open localhost:7860, click through each new tab
```

**HF Inference API check (F3):**
```python
from huggingface_hub import InferenceClient; import os
c = InferenceClient(model="Qwen/Qwen2.5-1.5B-Instruct", token=os.getenv("HF_TOKEN"))
print(c.chat_completion([{"role":"user","content":"Reply OK"}], max_tokens=5).choices[0].message.content)
```

**HF Space (T4, $0.60/hr, leave running 25 Apr 8 AM → 26 Apr 5 PM ≈ $20):**
1. Space settings → hardware: T4 Small.
2. Secrets: `HF_TOKEN`, `GROQ_API_KEY`.
3. Push branch → confirm Flask app starts on port 7860 → open every tab.

**A10G training run (F8, ~$5, one-off):** 
```bash
python scripts/train_trl.py --full-episode --stages 5 --push-to-hub
```
Afterwards: `https://huggingface.co/jdsb06/lifestack-grpo-v2` should show the checkpoint.

**End-to-end demo walkthrough to rehearse before 26 Apr 5 PM:**
1. Open Situational Portal → run Friday 6PM conflict → cascade SVG animates, heatmap shifts red.
2. Switch to Comparison tab → same conflict → watch delta bar fill positive.
3. Personality tab → Alex vs Chloe → radars + different rewards.
4. Try Your Case → paste "I just got fired and rent is due tomorrow" → plan card renders.
5. Memory tab → cold vs warm ablation → +116% banner.
6. Submit a feedback slider → stats endpoint reflects new feedback count.