Spaces:

jdsb06
/

meta-r2

Sleeping

App Files Files Community

meta-r2 / docs /implementation_summary.md

github-actions[bot]

Deploy Space snapshot

ddbc1ba about 1 month ago

preview code

raw

history blame contribute delete

4.05 kB

	# LifeStack — implementation summary

	Repo: [https://github.com/oki-dokii/Meta-R2](https://github.com/oki-dokii/Meta-R2)
	HF Space: [https://huggingface.co/spaces/jdsb06/meta-r2](https://huggingface.co/spaces/jdsb06/meta-r2)

	---

	## What LifeStack is

	A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON — the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.

	Policy: `Qwen2.5-1.5B-Instruct` (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.

	Output: `{"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."}` or episodic `{"actions": [...]}`.

	---

	## Training timeline

	\| Run \| Key change \| Outcome \|
	\|-----\|-----------\|---------\|
	\| v1 Run 1 \| Initial — `max_completion_length` too large \| Parse failures → mean −0.944 \|
	\| v1 Run 2 \| Shorter completions \| Minimal learning → −0.266 \|
	\| v1 Run 3 \| JSON extraction fix (`raw_decode` + greedy regex) \| First positive mean ≈ +0.023 \|
	\| v1 Run 4 \| Longer single-step run \| Plateau ≈ −0.1 eval \|
	\| v1 Run 5 \| Episodic GRPO, horizon=3 \| Best ≈ +0.734, curriculum 1→2 \|
	\| v3 \| Episodic curriculum 1→4 with `reward_compact_fn` \| `reward_compact_fn` constant → bad logged mean \|
	\| v4 \| Removed compact head; weights `[1, 0.5, 0.5, 2]` on format/EOS/plausibility/return \| Peak reward ≈ +0.856 \|

	HF Hub: `jdsb06/lifestack-grpo` (v1), `jdsb06/lifestack-grpo-v3`, `jdsb06/lifestack-grpo-v4`.

	---

	## Code map

	\| File \| What it does \|
	\|------\|-------------\|
	\| `core/lifestack_env.py` \| `LifeStackEnv` — reset, step, rollout, WorldEngine \|
	\| `core/life_state.py` \| `LifeMetrics` (23 values), `DependencyGraph` (32 edges), `ResourceBudget` \|
	\| `core/reward.py` \| `compute_task_reward`, `compute_reward`, 4 standalone scoring functions \|
	\| `core/task.py` \| `Task`, `Route`, `Milestone`, `ExoEvent`, `FlightCrisisTask`, `CodeMergeCrisisTask` \|
	\| `core/verifier.py` \| `LifeStackVerifier` — success/failure/milestone checks \|
	\| `core/cascade_utils.py` \| `animate_cascade()` — frame-by-frame visualization of propagation \|
	\| `core/action_space.py` \| `AgentAction`, `PrimaryAction`, `apply_action` \|
	\| `agent/agent.py` \| `LifeStackAgent` — GRPO model + Groq fallback, prompt building \|
	\| `agent/memory.py` \| `LifeStackMemory` — ChromaDB episodic memory, similarity retrieval \|
	\| `agent/conflict_generator.py` \| `TaskGenerator` (8 domains), `ConflictEvent` templates \|
	\| `agent/conflict_predictor.py` \| `ConflictPredictor` — pattern matching on episode history \|
	\| `agent/counterfactuals.py` \| What-if reasoning over metric snapshots \|
	\| `intake/simperson.py` \| `SimPerson` — Big Five personality, action uptake scaling \|
	\| `scripts/train_trl.py` \| Full GRPO training — `LifeStackGRPOTrainer`, 5-stage curriculum \|
	\| `scripts/eval.py` \| Random-policy baseline for reward floor measurement \|
	\| `app_flask.py` \| Flask demo UI — port 7860, 10 tabs, Chart.js, vis-network \|
	\| `server.py` \| Crash-safe OpenEnv server entry point — port 8000 \|
	\| `start.sh` \| Docker CMD — starts both services \|

	---

	## Known limitations

	`reward_timeout_check()` in `core/reward.py` has a known bug: it only fires when `step_count >= max_steps AND done == False`. Since `done` is `True` by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in `compute_task_reward()` is applied directly rather than through this function.

	The holdout evaluation reuses the same `TaskGenerator` as training. This limits how confidently we can claim generalization to novel conflict types.

	---

	## Related docs

	- [lifestack_env.md](lifestack_env.md) — environment reference
	- [reward.md](reward.md) — all reward functions
	- [train_trl.md](train_trl.md) — training reference
	- [DEPLOYMENT.md](DEPLOYMENT.md) — deployment guide

	# LifeStack — implementation summary

	Repo: [https://github.com/oki-dokii/Meta-R2](https://github.com/oki-dokii/Meta-R2)
	HF Space: [https://huggingface.co/spaces/jdsb06/meta-r2](https://huggingface.co/spaces/jdsb06/meta-r2)

	---

	## What LifeStack is

	A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON — the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.

	Policy: `Qwen2.5-1.5B-Instruct` (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.

	Output: `{"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."}` or episodic `{"actions": [...]}`.

	---

	## Training timeline

	\| Run \| Key change \| Outcome \|
	\|-----\|-----------\|---------\|
	\| v1 Run 1 \| Initial — `max_completion_length` too large \| Parse failures → mean −0.944 \|
	\| v1 Run 2 \| Shorter completions \| Minimal learning → −0.266 \|
	\| v1 Run 3 \| JSON extraction fix (`raw_decode` + greedy regex) \| First positive mean ≈ +0.023 \|
	\| v1 Run 4 \| Longer single-step run \| Plateau ≈ −0.1 eval \|
	\| v1 Run 5 \| Episodic GRPO, horizon=3 \| Best ≈ +0.734, curriculum 1→2 \|
	\| v3 \| Episodic curriculum 1→4 with `reward_compact_fn` \| `reward_compact_fn` constant → bad logged mean \|
	\| v4 \| Removed compact head; weights `[1, 0.5, 0.5, 2]` on format/EOS/plausibility/return \| Peak reward ≈ +0.856 \|

	HF Hub: `jdsb06/lifestack-grpo` (v1), `jdsb06/lifestack-grpo-v3`, `jdsb06/lifestack-grpo-v4`.

	---

	## Code map

	\| File \| What it does \|
	\|------\|-------------\|
	\| `core/lifestack_env.py` \| `LifeStackEnv` — reset, step, rollout, WorldEngine \|
	\| `core/life_state.py` \| `LifeMetrics` (23 values), `DependencyGraph` (32 edges), `ResourceBudget` \|
	\| `core/reward.py` \| `compute_task_reward`, `compute_reward`, 4 standalone scoring functions \|
	\| `core/task.py` \| `Task`, `Route`, `Milestone`, `ExoEvent`, `FlightCrisisTask`, `CodeMergeCrisisTask` \|
	\| `core/verifier.py` \| `LifeStackVerifier` — success/failure/milestone checks \|
	\| `core/cascade_utils.py` \| `animate_cascade()` — frame-by-frame visualization of propagation \|
	\| `core/action_space.py` \| `AgentAction`, `PrimaryAction`, `apply_action` \|
	\| `agent/agent.py` \| `LifeStackAgent` — GRPO model + Groq fallback, prompt building \|
	\| `agent/memory.py` \| `LifeStackMemory` — ChromaDB episodic memory, similarity retrieval \|
	\| `agent/conflict_generator.py` \| `TaskGenerator` (8 domains), `ConflictEvent` templates \|
	\| `agent/conflict_predictor.py` \| `ConflictPredictor` — pattern matching on episode history \|
	\| `agent/counterfactuals.py` \| What-if reasoning over metric snapshots \|
	\| `intake/simperson.py` \| `SimPerson` — Big Five personality, action uptake scaling \|
	\| `scripts/train_trl.py` \| Full GRPO training — `LifeStackGRPOTrainer`, 5-stage curriculum \|
	\| `scripts/eval.py` \| Random-policy baseline for reward floor measurement \|
	\| `app_flask.py` \| Flask demo UI — port 7860, 10 tabs, Chart.js, vis-network \|
	\| `server.py` \| Crash-safe OpenEnv server entry point — port 8000 \|
	\| `start.sh` \| Docker CMD — starts both services \|

	---

	## Known limitations

	`reward_timeout_check()` in `core/reward.py` has a known bug: it only fires when `step_count >= max_steps AND done == False`. Since `done` is `True` by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in `compute_task_reward()` is applied directly rather than through this function.

	The holdout evaluation reuses the same `TaskGenerator` as training. This limits how confidently we can claim generalization to novel conflict types.

	---

	## Related docs

	- [lifestack_env.md](lifestack_env.md) — environment reference
	- [reward.md](reward.md) — all reward functions
	- [train_trl.md](train_trl.md) — training reference
	- [DEPLOYMENT.md](DEPLOYMENT.md) — deployment guide