LifeStack / Implementation_final.md
Soham Banerjee
deploy: pure lifestack with partitioned wisdom pool
77da5ce

LifeStack Hackathon Sprint β€” Implementation Plan

Context

Submission deadline: 26 Apr 5 PM. Offline from 25 Apr 8 AM. ~30 hours of offline build time.

The LifeStack Flask demo (app_flask.py + templates/index.html) already ships 10 API endpoints, a 6-tab UI, and a working agent/memory/cascade/reward pipeline. This sprint adds 13 additive features (demo panels, APIs, RLHF loop, multi-step training, real-data connectors, tests, blog) without breaking existing endpoints. All work is additive.

Budget: $90 HF credits β€” T4 Small for the always-on demo Space, A10G for GRPO training runs, HF Inference API for the NLP panel. Target trained checkpoint: jdsb06/lifestack-grpo-v2 (user will push).

Key reusable primitives already in repo (do not rebuild):

  • core/cascade_utils.py:5 animate_cascade() β€” returns list of 4 frames with flat + status dicts
  • agent/counterfactuals.py:10 generate_counterfactuals() β€” returns list of alternatives
  • agent/memory.py:74 LifeStackMemory.store_trajectory() and :128 store_feedback(OutcomeFeedback)
  • core/feedback.py OutcomeFeedback + compute_human_feedback_reward()
  • core/life_state.py:61 LifeMetrics.flatten() β€” 23 metric paths
  • agent/conflict_generator.py TEMPLATES (13 scenarios) + generate_conflict()
  • core/metric_schema.py VALID_METRIC_PATHS

Already wired in app_flask.py: /api/feedback/submit (Feature 9 backend is done β€” scope of F9 reduces to frontend panel + training integration); /api/simulation/cascade (kept intact, new /api/cascade/frames added alongside).


Implementation Order (Offline Sprint)

  1. F1 Trained-vs-Baseline comparison (impact demo)
  2. F5 Domain risk heatmap (sidebar, always visible)
  3. F3 "Try Your Own" NLP + HF Inference fallback
  4. F2 D3 cascade visualisation
  5. F4 Personality comparison with OCEAN radar
  6. F6 Counterfactual explorer panel
  7. F8 Multi-step GRPO training loop + push_to_hub
  8. F9 RLHF feedback panel + training integration
  9. F7 Cold-vs-warm memory ablation demo
  10. F10 Health + calendar uploads
  11. F11 BLOG.md (~700 words)
  12. F12 Four tests
  13. F13 Episode history/replay

Before starting, run smoke tests (scripts/smoke_test.py, scripts/eval.py --episodes 5, cascade/counterfactual imports). Fix before adding features.


Cross-Cutting Changes

requirements.txt β€” add

  • huggingface_hub (for F3 InferenceClient and F8 push_to_hub)
  • icalendar (F10 calendar upload)

intake/intake.py β€” LLM fallback chain (F3 dependency)

Refactor _call_llm() (~line 44) to cascade: HF Inference API (HF_TOKEN) β†’ Groq (GROQ_API_KEY) β†’ empty-string fallback (existing behaviour). LifeIntake.__init__ constructs both an InferenceClient(model="Qwen/Qwen2.5-1.5B-Instruct", token=HF_TOKEN) when HF_TOKEN is present and the existing Groq OpenAI client when GROQ_API_KEY is present. extract_conflict() already returns an empty ConflictEvent when the LLM returns empty β€” keyword fallback below strengthens that path.

Keyword fallback: add _match_template_by_keywords(text: str) -> ConflictEvent | None that scans TEMPLATES for overlap with user text and returns the best match. Called inside extract_conflict() when both LLM clients fail.

app_flask.py β€” shared helpers (used by F1, F4, F5, F7)

  • _run_episode(person, conflict, steps, seed, agent_fn) -> list[step_dict]: initialises a fresh LifeStackEnv, applies the conflict disruption, loops steps iterations calling agent_fn(metrics, budget, conflict, person) to pick an action, runs env.step(), and collects {step, action_type, target, reward, metrics, cost}. agent_fn is injected so F1 can pass a random-action picker and a LifeStackAgent.get_action-wrapped version.
  • _random_action(metrics, budget, conflict, person) -> AgentAction: samples uniformly from core.action_space.EXAMPLE_ACTIONS (line 98–196) and jitters metric_changes slightly so the baseline isn't deterministic. Same return shape as AGENT.get_action().
  • compute_domain_health(flat_metrics: dict) -> dict[str, float]: averages sub-metrics per domain, inverts INVERTED_METRICS (line 67, already defined), returns {career, finances, relationships, physical_health, mental_wellbeing, time} each in [0,1].

templates/index.html β€” UI integration pattern

Every new feature adds one new tab button in the nav bar (line 37–44) and one content <div id="content-X"> in the main section (line 46–202). Reuse existing classes: .glass, .tab-active, .metric-bar, Tailwind (.rounded-2xl, .p-6, .space-y-6, .grid grid-cols-2 gap-6, .text-slate-400, .bg-indigo-500/10). Chart.js is already loaded via CDN (line 8); D3 v7 to be added.


Feature-by-Feature

F1 β€” Trained vs Baseline Comparison

Backend β€” app_flask.py:

  • POST /api/comparison/run β†’ body {conflict, person, steps=5, seed=42}.
    • Resolve conflict via CONFLICT_CHOICES, person via PERSONS.
    • Call _run_episode(..., agent_fn=_random_action) β†’ baseline.
    • Call _run_episode(..., agent_fn=lambda m,b,c,p: AGENT.get_action(m,b,c,p)) with identical seed β†’ trained.
    • Compute reward_delta = sum(trained_rewards) - sum(baseline_rewards).
    • Return {baseline: [...], trained: [...], reward_delta}.

Frontend:

  • New tab "Comparison". Two side-by-side .glass cards titled "Baseline (Random)" and "GRPO-Trained". For each step, render action-type badge + reward bar. Delta banner at the bottom (bg-indigo-500/10) showing +X.XX.

F2 β€” Live Cascade Visualisation (D3)

Backend:

  • POST /api/cascade/frames β†’ body {primary_disruption: {metric_path: delta}}. Calls animate_cascade(primary_disruption, LifeMetrics()) and returns {frames}. Keeps existing /api/simulation/cascade untouched.

Frontend:

  • Add D3 v7 CDN line in <head>.
  • New section inside the "Situational Portal" tab (below the existing cascade timeline at line ~70): <svg id="cascade-graph" width="720" height="420">.
  • JS module renderCascade(frames): creates 23 nodes from VALID_METRIC_PATHS, clusters by domain (6 cluster centres at: career TL, finances TR, relationships ML, physical_health MR, mental_wellbeing BC, time TC), draws edges from a hardcoded copy of the 20+ edges in DependencyGraph.edges. Iterates frames with 600ms setTimeout, recolouring nodes based on frames[i].status[metric]: unchangedβ†’#334155, primaryβ†’#ef4444, firstβ†’#f97316, secondβ†’#facc15.
  • Called from the existing simulation-action flow after each /api/simulation/action response.

F3 β€” "Try Your Own Situation" NLP Panel

Backend:

  • /api/custom/run already exists (line 162) and is fully wired. No route changes.
  • intake/intake.py cross-cutting change above adds HFβ†’Groqβ†’keyword fallback.

Frontend:

  • Existing "Try Your Case" tab (#tab-custom) is currently slider-heavy. Add a prominent textarea + Submit above the sliders. On submit, fetch('/api/custom/run', {situation: text}) β†’ render a card with detected domain(s), recommended action type/target, metric deltas as coloured badges (green for positive on positive-sense metrics, red otherwise, using INVERTED_METRICS set), reward bar.

F4 β€” Personality Comparison

Backend:

  • POST /api/personality/compare β†’ body {conflict_id="d5_friday", person_a, person_b, steps=3}.
    • Look up persons from PERSONS. Run _run_episode twice with the trained agent on the same conflict + seed.
    • Return {person_a: {name, actions, total_reward, ocean: {O,C,E,A,N}}, person_b: {...}, dominant_trait: "neuroticism"} where dominant_trait = argmax(|ocean_a[t] - ocean_b[t]|).

Frontend:

  • New tab "Personality". Two .glass columns. Each has a Chart.js radar chart (already CDN-loaded) with 5 axes (OCEAN). Below the radar: action sequence + total reward. Banner highlighting the dominant trait.

F5 β€” Domain Risk Heatmap

Backend: compute_domain_health() helper added (cross-cutting section). Every response from /api/simulation/start, /api/simulation/action, /api/custom/run gets an extra domain_health field derived from the metrics already in the payload β€” no new route.

Frontend: Persistent top bar above tab nav (inserted at ~line 35): 6 cells (2Γ—3 grid on small, 6Γ—1 on large). Each cell shows the domain emoji from DOMAIN_EMOJI and a pill background coloured via hsl((1 - h) * 120, 70%, 45%). Re-rendered from every simulation response.

F6 β€” Counterfactual Explorer

Backend:

  • POST /api/counterfactuals/generate β†’ body {conflict, person, chosen_action: {...}}. Reconstructs state, calls generate_counterfactuals(AGENT, metrics, budget, conflict, person, chosen_action), returns {chosen: {...}, alternatives: [3 items from the list]}. (Counterfactuals already appear inside /api/simulation/action response β€” this route is the on-demand variant Feature 6 wants.)

Frontend: "What If?" collapsible panel appended below each step output. 3 alternative cards sorted by predicted reward. Chosen action outlined in indigo, best alt in green, worst in red.

F7 β€” Memory Ablation (Cold vs Warm)

Backend:

  • POST /api/memory/ablation β†’ body {conflict, person, steps=5}.
    • Episode 1: pass memory=None (or a fresh LifeStackAgent() with empty .memory). Record actions + rewards.
    • MEMORY.store_trajectory(conflict_title=..., route_taken=..., total_reward=..., reasoning=...) for episode 1.
    • Episode 2: reuse AGENT (global β€” has ChromaDB via MEMORY). Query MEMORY for similar trajectories (existing retrieval method) and pass the top-k summary into get_action's few_shot_context param.
    • Return {cold: {actions, reward}, warm: {actions, reward, retrieved_context}, improvement_pct}.

Frontend: Two-column timeline in a new "Memory" tab. Callout box with πŸ’‘ Agent recalled: … when warm has retrieved context. Big percentage banner at the bottom.

F8 β€” Multi-Step GRPO Training

scripts/train_trl.py (currently 914 lines, single-prompt per scenario):

  • Add run_full_episode(task, person, model, tokenizer, max_steps=10) -> tuple[list[step_reward], dict]:
    • For each step: build prompt from current LifeMetrics + ResourceBudget + conflict, call model.generate, parse JSON action, call env.step(), append step reward from existing compute_task_reward().
    • Return per-step rewards and a serialised trajectory.
  • New CLI flag --full-episode. When set, generate_dataset() is replaced by generate_episodic_dataset() which calls run_full_episode per scenario and uses sum(step_rewards) / max_steps as the GRPO reward.
  • --dry-run compatibility: 1 episode Γ— 2 steps with a mock model (existing dry-run path stays valid).
  • After trainer.save_model() at line 610, add if not args.dry_run and args.push_to_hub: model.push_to_hub("jdsb06/lifestack-grpo-v2"); tokenizer.push_to_hub("jdsb06/lifestack-grpo-v2"). New --push-to-hub flag guards it.
  • Run on HF A10G once built: python scripts/train_trl.py --full-episode --stages 5 --push-to-hub (~$5).

F9 β€” RLHF Loop

  • Backend: /api/feedback/submit already fully implemented (line 267). No route changes needed.
  • Frontend: Post-episode feedback panel (rendered after every completed simulation/custom/comparison episode). Slider 0–10, domain checkboxes (6 domains Γ— improved/worsened), textarea. Submit posts {episode_id, score, improved[], worsened[], notes, time} to existing endpoint.
  • Training integration (scripts/train_trl.py): New --with-human-feedback flag. When set, a new reward component reward_human_feedback_fn (hook already exists around line 379) loads stored feedback via MEMORY.feedback_collection.query() keyed by episode_id and blends compute_human_feedback_reward() output at weight 0.10, rebalancing existing weights proportionally.

F10 β€” Real Data Integrations

Backend:

  • POST /api/data/health/upload (multipart): accepts .json (Google Fit) or .xml (Apple Health). Parse steps, heart_rate_resting, sleep_hours (approximate parse; tolerate missing fields). Map to physical_health.fitness, physical_health.energy, physical_health.sleep_quality. Store in new module-level dict USER_HEALTH_OVERRIDES. Return {parsed_metrics, events_found}.
  • POST /api/data/calendar/upload (multipart): .ics via icalendar.Calendar.from_ical(). Count events in next 7 days β†’ time.free_hours_per_week (inverse), career.workload. Keyword match ("gym", "run", "yoga") β†’ bump physical_health.fitness. Return same shape.
  • /api/simulation/start and /api/custom/run consult USER_HEALTH_OVERRIDES when initialising LifeMetrics().

Frontend: New "Connect My Data" subsection at the top of "Try Your Case". Two file inputs. After upload, render a chip list with πŸ“Š From your real data β€” physical_health.fitness: 78.

F11 β€” BLOG.md (~700 words)

Rewrite the 13-line BLOG.md with 5 sections: Problem, What We Built, Key Results (+125%, +155%, +116% β€” already in README lines 45–71), What We Learned, What's Next. Inline-cite the 4 papers from README lines 233–241 (Starcke & Brand 2012; Roijers et al. 2013; Mullainathan & Shafir 2013; Wang et al. 2024).

F12 β€” Four Tests (tests/)

  • test_env_reset.py: LifeStackEnv().reset() β†’ budget is fresh; reset twice β†’ metrics identical. ~20 lines, pytest.
  • test_cascade.py: animate_cascade({"mental_wellbeing.stress_level": 30}, LifeMetrics()) returns 4 frames; frame 0 status all unchanged; frame 1 has at least one primary.
  • test_task_generator.py (scoped per user answer): asserts generate_conflict() returns a valid ConflictEvent for each of the 6 life domains and TEMPLATES covers difficulties 1–5.
  • test_reward.py: compute_reward() result in [-1, 1]; plausibility component penalises a 0-cost, 50-delta action.

F13 β€” Episode History

Backend:

  • Maintain ring buffer EPISODE_HISTORY: deque[dict] = deque(maxlen=5) module-level in app_flask.py. After every episode-producing route, append {id, conflict, steps[], final_reward, timestamp}.
  • GET /api/history/list returns summaries. GET /api/history/replay/<episode_id> returns full step log.

Frontend: New "History" tab, accordion list, click-to-expand per episode.


Critical Files to Modify

File Features touching it
app_flask.py F1, F2, F4, F5, F6, F7, F10, F13 (7 new routes, 3 helpers, 1 deque)
intake/intake.py F3 (LLM fallback chain, keyword match)
templates/index.html F1, F2, F3, F4, F5, F6, F7, F9, F10, F13 (new tabs, heatmap bar, D3 SVG, feedback panel)
scripts/train_trl.py F8 (run_full_episode, --full-episode, --push-to-hub), F9 (--with-human-feedback)
requirements.txt huggingface_hub, icalendar
BLOG.md F11 (full rewrite)
tests/test_env_reset.py, test_cascade.py, test_task_generator.py, test_reward.py F12 (new files)

No other files get edited. No existing route or dataclass is modified.


Verification

Local (no GPU):

python scripts/smoke_test.py
python scripts/eval.py --episodes 5
python -m pytest tests/ -v
python scripts/train_trl.py --full-episode --dry-run   # F8 dry-run
python app_flask.py  # open localhost:7860, click through each new tab

HF Inference API check (F3):

from huggingface_hub import InferenceClient; import os
c = InferenceClient(model="Qwen/Qwen2.5-1.5B-Instruct", token=os.getenv("HF_TOKEN"))
print(c.chat_completion([{"role":"user","content":"Reply OK"}], max_tokens=5).choices[0].message.content)

HF Space (T4, $0.60/hr, leave running 25 Apr 8 AM β†’ 26 Apr 5 PM β‰ˆ $20):

  1. Space settings β†’ hardware: T4 Small.
  2. Secrets: HF_TOKEN, GROQ_API_KEY.
  3. Push branch β†’ confirm Flask app starts on port 7860 β†’ open every tab.

A10G training run (F8, ~$5, one-off):

python scripts/train_trl.py --full-episode --stages 5 --push-to-hub

Afterwards: https://huggingface.co/jdsb06/lifestack-grpo-v2 should show the checkpoint.

End-to-end demo walkthrough to rehearse before 26 Apr 5 PM:

  1. Open Situational Portal β†’ run Friday 6PM conflict β†’ cascade SVG animates, heatmap shifts red.
  2. Switch to Comparison tab β†’ same conflict β†’ watch delta bar fill positive.
  3. Personality tab β†’ Alex vs Chloe β†’ radars + different rewards.
  4. Try Your Case β†’ paste "I just got fired and rent is due tomorrow" β†’ plan card renders.
  5. Memory tab β†’ cold vs warm ablation β†’ +116% banner.
  6. Submit a feedback slider β†’ stats endpoint reflects new feedback count.