File size: 4,048 Bytes
ddbc1ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | # LifeStack β implementation summary
**Repo:** [https://github.com/oki-dokii/Meta-R2](https://github.com/oki-dokii/Meta-R2)
**HF Space:** [https://huggingface.co/spaces/jdsb06/meta-r2](https://huggingface.co/spaces/jdsb06/meta-r2)
---
## What LifeStack is
A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON β the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.
**Policy:** `Qwen2.5-1.5B-Instruct` (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.
**Output:** `{"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."}` or episodic `{"actions": [...]}`.
---
## Training timeline
| Run | Key change | Outcome |
|-----|-----------|---------|
| v1 Run 1 | Initial β `max_completion_length` too large | Parse failures β mean **β0.944** |
| v1 Run 2 | Shorter completions | Minimal learning β **β0.266** |
| v1 Run 3 | JSON extraction fix (`raw_decode` + greedy regex) | First positive mean β **+0.023** |
| v1 Run 4 | Longer single-step run | Plateau β **β0.1** eval |
| v1 Run 5 | Episodic GRPO, horizon=3 | Best β **+0.734**, curriculum 1β2 |
| v3 | Episodic curriculum 1β4 with `reward_compact_fn` | `reward_compact_fn` constant β bad logged mean |
| v4 | Removed compact head; weights `[1, 0.5, 0.5, 2]` on format/EOS/plausibility/return | Peak reward β **+0.856** |
**HF Hub:** `jdsb06/lifestack-grpo` (v1), `jdsb06/lifestack-grpo-v3`, `jdsb06/lifestack-grpo-v4`.
---
## Code map
| File | What it does |
|------|-------------|
| `core/lifestack_env.py` | `LifeStackEnv` β reset, step, rollout, WorldEngine |
| `core/life_state.py` | `LifeMetrics` (23 values), `DependencyGraph` (32 edges), `ResourceBudget` |
| `core/reward.py` | `compute_task_reward`, `compute_reward`, 4 standalone scoring functions |
| `core/task.py` | `Task`, `Route`, `Milestone`, `ExoEvent`, `FlightCrisisTask`, `CodeMergeCrisisTask` |
| `core/verifier.py` | `LifeStackVerifier` β success/failure/milestone checks |
| `core/cascade_utils.py` | `animate_cascade()` β frame-by-frame visualization of propagation |
| `core/action_space.py` | `AgentAction`, `PrimaryAction`, `apply_action` |
| `agent/agent.py` | `LifeStackAgent` β GRPO model + Groq fallback, prompt building |
| `agent/memory.py` | `LifeStackMemory` β ChromaDB episodic memory, similarity retrieval |
| `agent/conflict_generator.py` | `TaskGenerator` (8 domains), `ConflictEvent` templates |
| `agent/conflict_predictor.py` | `ConflictPredictor` β pattern matching on episode history |
| `agent/counterfactuals.py` | What-if reasoning over metric snapshots |
| `intake/simperson.py` | `SimPerson` β Big Five personality, action uptake scaling |
| `scripts/train_trl.py` | Full GRPO training β `LifeStackGRPOTrainer`, 5-stage curriculum |
| `scripts/eval.py` | Random-policy baseline for reward floor measurement |
| `app_flask.py` | Flask demo UI β port 7860, 10 tabs, Chart.js, vis-network |
| `server.py` | Crash-safe OpenEnv server entry point β port 8000 |
| `start.sh` | Docker CMD β starts both services |
---
## Known limitations
`reward_timeout_check()` in `core/reward.py` has a known bug: it only fires when `step_count >= max_steps AND done == False`. Since `done` is `True` by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in `compute_task_reward()` is applied directly rather than through this function.
The holdout evaluation reuses the same `TaskGenerator` as training. This limits how confidently we can claim generalization to novel conflict types.
---
## Related docs
- [lifestack_env.md](lifestack_env.md) β environment reference
- [reward.md](reward.md) β all reward functions
- [train_trl.md](train_trl.md) β training reference
- [DEPLOYMENT.md](DEPLOYMENT.md) β deployment guide
|