Spaces:

jdsb06
/

meta-r2

Sleeping

File size: 4,048 Bytes

ddbc1ba

# LifeStack — implementation summary

**Repo:** [https://github.com/oki-dokii/Meta-R2](https://github.com/oki-dokii/Meta-R2)  
**HF Space:** [https://huggingface.co/spaces/jdsb06/meta-r2](https://huggingface.co/spaces/jdsb06/meta-r2)

---

## What LifeStack is

A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON — the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.

**Policy:** `Qwen2.5-1.5B-Instruct` (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.

**Output:** `{"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."}` or episodic `{"actions": [...]}`.

---

## Training timeline

| Run | Key change | Outcome |
|-----|-----------|---------|
| v1 Run 1 | Initial — `max_completion_length` too large | Parse failures → mean **−0.944** |
| v1 Run 2 | Shorter completions | Minimal learning → **−0.266** |
| v1 Run 3 | JSON extraction fix (`raw_decode` + greedy regex) | First positive mean ≈ **+0.023** |
| v1 Run 4 | Longer single-step run | Plateau ≈ **−0.1** eval |
| v1 Run 5 | Episodic GRPO, horizon=3 | Best ≈ **+0.734**, curriculum 1→2 |
| v3 | Episodic curriculum 1→4 with `reward_compact_fn` | `reward_compact_fn` constant → bad logged mean |
| v4 | Removed compact head; weights `[1, 0.5, 0.5, 2]` on format/EOS/plausibility/return | Peak reward ≈ **+0.856** |

**HF Hub:** `jdsb06/lifestack-grpo` (v1), `jdsb06/lifestack-grpo-v3`, `jdsb06/lifestack-grpo-v4`.

---

## Code map

| File | What it does |
|------|-------------|
| `core/lifestack_env.py` | `LifeStackEnv` — reset, step, rollout, WorldEngine |
| `core/life_state.py` | `LifeMetrics` (23 values), `DependencyGraph` (32 edges), `ResourceBudget` |
| `core/reward.py` | `compute_task_reward`, `compute_reward`, 4 standalone scoring functions |
| `core/task.py` | `Task`, `Route`, `Milestone`, `ExoEvent`, `FlightCrisisTask`, `CodeMergeCrisisTask` |
| `core/verifier.py` | `LifeStackVerifier` — success/failure/milestone checks |
| `core/cascade_utils.py` | `animate_cascade()` — frame-by-frame visualization of propagation |
| `core/action_space.py` | `AgentAction`, `PrimaryAction`, `apply_action` |
| `agent/agent.py` | `LifeStackAgent` — GRPO model + Groq fallback, prompt building |
| `agent/memory.py` | `LifeStackMemory` — ChromaDB episodic memory, similarity retrieval |
| `agent/conflict_generator.py` | `TaskGenerator` (8 domains), `ConflictEvent` templates |
| `agent/conflict_predictor.py` | `ConflictPredictor` — pattern matching on episode history |
| `agent/counterfactuals.py` | What-if reasoning over metric snapshots |
| `intake/simperson.py` | `SimPerson` — Big Five personality, action uptake scaling |
| `scripts/train_trl.py` | Full GRPO training — `LifeStackGRPOTrainer`, 5-stage curriculum |
| `scripts/eval.py` | Random-policy baseline for reward floor measurement |
| `app_flask.py` | Flask demo UI — port 7860, 10 tabs, Chart.js, vis-network |
| `server.py` | Crash-safe OpenEnv server entry point — port 8000 |
| `start.sh` | Docker CMD — starts both services |

---

## Known limitations

`reward_timeout_check()` in `core/reward.py` has a known bug: it only fires when `step_count >= max_steps AND done == False`. Since `done` is `True` by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in `compute_task_reward()` is applied directly rather than through this function.

The holdout evaluation reuses the same `TaskGenerator` as training. This limits how confidently we can claim generalization to novel conflict types.

---

## Related docs

- [lifestack_env.md](lifestack_env.md) — environment reference
- [reward.md](reward.md) — all reward functions
- [train_trl.md](train_trl.md) — training reference
- [DEPLOYMENT.md](DEPLOYMENT.md) — deployment guide