meta-r2 / docs /implementation_summary.md
github-actions[bot]
Deploy Space snapshot
ddbc1ba
# LifeStack β€” implementation summary
**Repo:** [https://github.com/oki-dokii/Meta-R2](https://github.com/oki-dokii/Meta-R2)
**HF Space:** [https://huggingface.co/spaces/jdsb06/meta-r2](https://huggingface.co/spaces/jdsb06/meta-r2)
---
## What LifeStack is
A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON β€” the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.
**Policy:** `Qwen2.5-1.5B-Instruct` (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.
**Output:** `{"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."}` or episodic `{"actions": [...]}`.
---
## Training timeline
| Run | Key change | Outcome |
|-----|-----------|---------|
| v1 Run 1 | Initial β€” `max_completion_length` too large | Parse failures β†’ mean **βˆ’0.944** |
| v1 Run 2 | Shorter completions | Minimal learning β†’ **βˆ’0.266** |
| v1 Run 3 | JSON extraction fix (`raw_decode` + greedy regex) | First positive mean β‰ˆ **+0.023** |
| v1 Run 4 | Longer single-step run | Plateau β‰ˆ **βˆ’0.1** eval |
| v1 Run 5 | Episodic GRPO, horizon=3 | Best β‰ˆ **+0.734**, curriculum 1β†’2 |
| v3 | Episodic curriculum 1β†’4 with `reward_compact_fn` | `reward_compact_fn` constant β†’ bad logged mean |
| v4 | Removed compact head; weights `[1, 0.5, 0.5, 2]` on format/EOS/plausibility/return | Peak reward β‰ˆ **+0.856** |
**HF Hub:** `jdsb06/lifestack-grpo` (v1), `jdsb06/lifestack-grpo-v3`, `jdsb06/lifestack-grpo-v4`.
---
## Code map
| File | What it does |
|------|-------------|
| `core/lifestack_env.py` | `LifeStackEnv` β€” reset, step, rollout, WorldEngine |
| `core/life_state.py` | `LifeMetrics` (23 values), `DependencyGraph` (32 edges), `ResourceBudget` |
| `core/reward.py` | `compute_task_reward`, `compute_reward`, 4 standalone scoring functions |
| `core/task.py` | `Task`, `Route`, `Milestone`, `ExoEvent`, `FlightCrisisTask`, `CodeMergeCrisisTask` |
| `core/verifier.py` | `LifeStackVerifier` β€” success/failure/milestone checks |
| `core/cascade_utils.py` | `animate_cascade()` β€” frame-by-frame visualization of propagation |
| `core/action_space.py` | `AgentAction`, `PrimaryAction`, `apply_action` |
| `agent/agent.py` | `LifeStackAgent` β€” GRPO model + Groq fallback, prompt building |
| `agent/memory.py` | `LifeStackMemory` β€” ChromaDB episodic memory, similarity retrieval |
| `agent/conflict_generator.py` | `TaskGenerator` (8 domains), `ConflictEvent` templates |
| `agent/conflict_predictor.py` | `ConflictPredictor` β€” pattern matching on episode history |
| `agent/counterfactuals.py` | What-if reasoning over metric snapshots |
| `intake/simperson.py` | `SimPerson` β€” Big Five personality, action uptake scaling |
| `scripts/train_trl.py` | Full GRPO training β€” `LifeStackGRPOTrainer`, 5-stage curriculum |
| `scripts/eval.py` | Random-policy baseline for reward floor measurement |
| `app_flask.py` | Flask demo UI β€” port 7860, 10 tabs, Chart.js, vis-network |
| `server.py` | Crash-safe OpenEnv server entry point β€” port 8000 |
| `start.sh` | Docker CMD β€” starts both services |
---
## Known limitations
`reward_timeout_check()` in `core/reward.py` has a known bug: it only fires when `step_count >= max_steps AND done == False`. Since `done` is `True` by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in `compute_task_reward()` is applied directly rather than through this function.
The holdout evaluation reuses the same `TaskGenerator` as training. This limits how confidently we can claim generalization to novel conflict types.
---
## Related docs
- [lifestack_env.md](lifestack_env.md) β€” environment reference
- [reward.md](reward.md) β€” all reward functions
- [train_trl.md](train_trl.md) β€” training reference
- [DEPLOYMENT.md](DEPLOYMENT.md) β€” deployment guide