meta-r2 / docs /implementation_summary.md
github-actions[bot]
Deploy Space snapshot
ddbc1ba

LifeStack β€” implementation summary

Repo: https://github.com/oki-dokii/Meta-R2
HF Space: https://huggingface.co/spaces/jdsb06/meta-r2


What LifeStack is

A multi-domain life-management RL environment built on OpenEnv 0.2.3. It models a human life as 23 interdependent metrics across 6 domains with a 32-edge directed dependency graph. Actions are structured JSON β€” the trained model allocates time, money, and energy across competing priorities without collapsing any metric to zero.

Policy: Qwen2.5-1.5B-Instruct (QLoRA / bf16) + LoRA adapters via GRPO fine-tuning.

Output: {"action_type": ..., "target_domain": ..., "metric_changes": {...}, "resource_cost": {...}, "reasoning": "..."} or episodic {"actions": [...]}.


Training timeline

Run Key change Outcome
v1 Run 1 Initial β€” max_completion_length too large Parse failures β†’ mean βˆ’0.944
v1 Run 2 Shorter completions Minimal learning β†’ βˆ’0.266
v1 Run 3 JSON extraction fix (raw_decode + greedy regex) First positive mean β‰ˆ +0.023
v1 Run 4 Longer single-step run Plateau β‰ˆ βˆ’0.1 eval
v1 Run 5 Episodic GRPO, horizon=3 Best β‰ˆ +0.734, curriculum 1β†’2
v3 Episodic curriculum 1β†’4 with reward_compact_fn reward_compact_fn constant β†’ bad logged mean
v4 Removed compact head; weights [1, 0.5, 0.5, 2] on format/EOS/plausibility/return Peak reward β‰ˆ +0.856

HF Hub: jdsb06/lifestack-grpo (v1), jdsb06/lifestack-grpo-v3, jdsb06/lifestack-grpo-v4.


Code map

File What it does
core/lifestack_env.py LifeStackEnv β€” reset, step, rollout, WorldEngine
core/life_state.py LifeMetrics (23 values), DependencyGraph (32 edges), ResourceBudget
core/reward.py compute_task_reward, compute_reward, 4 standalone scoring functions
core/task.py Task, Route, Milestone, ExoEvent, FlightCrisisTask, CodeMergeCrisisTask
core/verifier.py LifeStackVerifier β€” success/failure/milestone checks
core/cascade_utils.py animate_cascade() β€” frame-by-frame visualization of propagation
core/action_space.py AgentAction, PrimaryAction, apply_action
agent/agent.py LifeStackAgent β€” GRPO model + Groq fallback, prompt building
agent/memory.py LifeStackMemory β€” ChromaDB episodic memory, similarity retrieval
agent/conflict_generator.py TaskGenerator (8 domains), ConflictEvent templates
agent/conflict_predictor.py ConflictPredictor β€” pattern matching on episode history
agent/counterfactuals.py What-if reasoning over metric snapshots
intake/simperson.py SimPerson β€” Big Five personality, action uptake scaling
scripts/train_trl.py Full GRPO training β€” LifeStackGRPOTrainer, 5-stage curriculum
scripts/eval.py Random-policy baseline for reward floor measurement
app_flask.py Flask demo UI β€” port 7860, 10 tabs, Chart.js, vis-network
server.py Crash-safe OpenEnv server entry point β€” port 8000
start.sh Docker CMD β€” starts both services

Known limitations

reward_timeout_check() in core/reward.py has a known bug: it only fires when step_count >= max_steps AND done == False. Since done is True by the time the check runs, this function returns 0.0 in normal operation. The timeout penalty in compute_task_reward() is applied directly rather than through this function.

The holdout evaluation reuses the same TaskGenerator as training. This limits how confidently we can claim generalization to novel conflict types.


Related docs