| # Mentor Meeting Playbook — LifeStack Engine |
|
|
| ## The Core Framing |
| **Research Question:** "Can a small model (1.5B) learn to navigate multi-domain, causally-coupled crises better than a base LLM, using GRPO with a 7-day horizon reward?" |
|
|
| --- |
|
|
| ## Slide Deck Structure (8 Slides Max) |
|
|
| ### Slide 1 — The Gap (30 sec) |
| * **Current AI:** Single-turn advice, no state, no consequence modeling. |
| * **LifeStack:** Life as a Markov Decision Process — 23 metrics, 6 domains, 40 causal edges. |
| * **Hook:** "We built the environment that lets you train models on the 'ripple effects' of human decisions." |
|
|
| ### Slide 2 — The Environment (1 min) |
| * **Standards-Based:** LifeStackEnv extends `openenv.Environment`. |
| * **Causal Foundation:** 40 edges from Starcke & Brand (2012) — research-grounded, not arbitrary. |
| * **Deterministic World:** `DependencyGraph.propagate()` uses matrix math, not LLM hallucination. |
| * **State Vector:** 26-dim observation space across 23 tracked metrics. |
|
|
| ### Slide 3 — The Cascade (The Visual Hook) |
| * **Visual:** Screenshot/GIF of the 4-frame cascade animation (STABLE → DISRUPTION → 1ST CASCADE → 2ND CASCADE). |
| * **Narrative:** "A $350 flight rebooking cascades into stress (day 1) → sleep loss (day 2) → relationship strain (day 4). Our graph engine computes this propagation." |
|
|
| ### Slide 4 — Training Setup (45 sec) |
| * **Model:** Qwen2.5-1.5B-Instruct, fine-tuned with GRPO via HuggingFace TRL. |
| * **Reward:** 7-signal orchestrator (Milestone, Outcome, Preservation, Replan, Efficiency, Reasoning Coherence). |
| * **Innovation:** **$\gamma=0.9$ discounted 7-day rollout.** Decisions are penalized today if they cause system collapse on day 4. |
|
|
| ### Slide 5 — The Research Result (Comparison) |
| | Feature | Untrained LLM (Base) | GRPO-Trained LifeStack | |
| | :--- | :--- | :--- | |
| | **Logic** | Treats each action independently | Reasons across all 6 domains | |
| | **Budgeting** | Maximizes single metric | Preserves global resource budget | |
| | **Strategy** | Generic advice | Reward-shaped justification | |
| | **Memory** | None | RAG memory flywheel (+116% efficiency) | |
|
|
| ### Slide 6 — Memory Flywheel |
| * **The Numbers:** Cold start 42% success rate → Warm (RAG) 88% success rate. |
| * **The Edge:** ChromaDB retrieval lets the agent reason from past successful precedents. |
|
|
| ### Slide 7 — Current Progress (Status) |
| * **Live:** Flask demo on HuggingFace Spaces. |
| * **Functionality:** 6 working tabs including Comparison, Personality Lab, and What-If Lab. |
| * **Pipeline:** GRPO training backbone complete; model lazy-loads for instant demo reliability. |
|
|
| ### Slide 8 — Next Steps |
| * **Full Multi-Step Evaluation:** Running 30-day episodes (beyond single-action). |
| * **Real Data Ingestion:** OAuth for Gmail/Calendar signals (currently stubbed). |
| * **Quantitative Scaling:** Benchmarking 1000+ synthetic scenarios. |
|
|
| --- |
|
|
| ## Demo Script (The 4-Step Sequence) |
|
|
| 1. **Stage the Crisis:** Open the "Situational Portal". Select Alex (Executive) + Career crisis. |
| 2. **The Cascade:** Hit "Start Simulation". Let the 4-frame animation play. **Silence for 5 seconds.** Then: "Every color change was computed by the graph, zero LLM involvement yet." |
| 3. **The Heatmap:** Point at the Red cells. "Red means crisis. Notice how a work deadline dragged Physical Health into the red. The agent must now resolve this composite state." |
| 4. **The Comparison:** Switch to "Trained vs Untrained". Hit "Run Comparison". "On the left is the raw model. On the right is the model after RL feedback on our 7-day reward signal." |
|
|
| --- |
|
|
| ## Counter-Questions & Defensive Positioning (QA) |
|
|
| | Question | Winning Answer | |
| | :--- | :--- | |
| | **"Is this just prompt engineering?"** | "No. We modified model weights via GRPO. The reward comes from the environment simulator, not a system prompt." | |
| | **"Your environment is hand-coded?"** | "The environment physics are expert-coded (research-based); the policy navigating them is learned. Chess rules are coded, but AlphaZero is a research breakthrough." | |
| | **"How do you prevent reward hacking?"** | "Triple-check: Reasoning audit, resource preservation costs, and discounted 7-day rollouts penalize short-sighted wins." | |
| | **"Why 1.5B parameters?"** | "Intentional. It allows consumer-local deployment (privacy) and makes the RL training signal highly measurable." | |
|
|
| --- |
|
|
| ## The Perfect Hook |
|
|
| ### Opening (30 Seconds) |
| > "Most AI tools give you advice. LifeStack gives you consequences. We built a 6-domain, 23-metric RL environment where a career crisis cascades into sleep loss, relationship strain, and financial pressure—all causally linked. Then we trained a model to navigate that using GRPO. The question we're answering is: can a 1.5B model, trained on life-state rewards, make better long-term decisions than an untrained LLM? We can show you the delta right now." |
|
|
| ### Closing (The Final Word) |
| > "The real contribution isn't the UI—its the environment + training loop. Everything you see in the demo is an artifact of that system working." |
|
|