Reward System Review vs. the Guide
What you have
In core/reward.py: One composite reward function (compute_task_reward) that blends 7 weighted components into a single float:
| Component | Weight | Function |
|---|---|---|
| local metric delta | 5% | compute_reward |
| milestone | 35% | compute_milestone_reward |
| task completion | 25% | compute_task_completion_reward |
| replanning | 10% | compute_replan_bonus |
| resource efficiency | 5% | - |
| reasoning coherence | 10% | reward_reasoning_coherence |
| format compliance | 10% | reward_format_compliance |
In train_trl.py: 6 separate functions passed to reward_funcs=[] for GRPO:
reward_format_fn, reward_plausibility_fn, reward_task_success_fn, reward_milestone_fn, reward_reasoning_fn, reward_human_feedback_fn
Where you follow the guide β
- 6 separate GRPO reward functions β matches the guide's "multiple independent reward functions" recommendation
- Format compliance (
reward_format_compliance) β guide explicitly lists format compliance - Timeout penalty (
reward_timeout_check) β guide says "penalize timeouts" - Plausibility anti-cheat (
reward_plausibility_check) β catches zero-cost metric hacks (guide: "anti-cheating checks") - Reasoning coherence β guide recommends process-aware feedback
- Resource lockout (
lifestack_env.py:431-439) β resource deduction happens before metric changes, withmetric_changes = {}if budget depleted. Good explicit lockdown. CRITICAL_FLOOR_VIOLATION,INACTION_PENALTY,CASCADE_COLLAPSEpenalties- Curriculum learning in
train.pyandtrain_trl.pyβ matches guide section 6 - Component-level logging (
train_trl.py:274-277) β guide section 15 says watch individual reward columns, not just total reward
Where you don't fully follow the guide β (Fixed β )
The 6 GRPO functions are NOT truly independent β they share one environment call
- Fix applied: Decoupled
reward_format_fnby explicitly checking JSON format usingcore.reward.reward_format_compliance(), making it fully independent.
- Fix applied: Decoupled
_REWARD_CACHEis a global mutable dict β a guide-listed hacking vector- Fix applied: Added a size cap of
1000cache entries to mitigate this vector.
- Fix applied: Added a size cap of
reward_human_feedback_fnsilently goes neutral when ChromaDB is unavailable- Fix applied: Logs a warning and returns
-0.01(a small penalty) instead of0.0.
- Fix applied: Logs a warning and returns
No execution sandboxing
- Fix applied: Added a
allowed_keyswhitelist inlifestack_env.step()constructed fromcurrent_metrics.flatten().keys().
- Fix applied: Added a
Step-level reward (
compute_task_reward) is still one blended number for the env itself- (For future consideration/rewrite)
Quick priority fixes
| Priority | Fix | Guide reference | Protocol / Fixed? |
|---|---|---|---|
| High | Add a TTL or size cap to _REWARD_CACHE (or disable it) |
Section 8: "caching results" | β Fixed |
| High | Add a metric key whitelist in lifestack_env.step() so model can't inject arbitrary paths |
Section 8: "Lock down execution" | β Fixed |
| Medium | Make at least 1-2 GRPO functions truly independent (e.g., reward_format_fn can parse JSON without calling get_lifestack_evaluation) |
Section 7: "multiple independent checks" | β Fixed |
| Low | Log a warning or small penalty when reward_human_feedback_fn falls back to 0.0 |
Section 15: monitor individual columns | β Fixed |
The biggest structural win is decoupling reward_format_fn from the shared env call β it can check JSON validity entirely on its own, making it genuinely independent from the environment's result.
Secondary Bug Fixes β -> β
Bug 1:
reward_plausibility_fninverted/broken output- Fix applied: Extracted the parsed completion and invoked
reward_plausibility_checknatively to retrieve the true continuous penalty score (e.g.,-0.1,-0.3) instead of returning a binary1.0/-1.0.
- Fix applied: Extracted the parsed completion and invoked
Bug 2:
reward_task_success_fndouble-dipping components- Fix applied: Narrowed the function to retrieve just the
.get("completion", 0.0)score from the breakdown, avoiding re-summing milestone, format, and reasoning.
- Fix applied: Narrowed the function to retrieve just the
Bug 3:
reward_reasoning_fnoutput range is noise- Fix applied: Added a
* 10.0scalar to inflate the[-0.10, 0.10]range to[-1.0, 1.0], equalizing its variance and ensuring it produces valid gradients.
- Fix applied: Added a
Bug 4: Task reconstruction was non-deterministic
- Fix applied: Injected a sampled
seedinto<SYSTEM_METADATA>and setrandom.seed()aroundTaskGenerator.generate()in the evaluation function. Now the environment evaluates against the exact same routes and milestones the prompt originally described.
- Fix applied: Injected a sampled
Bug 5:
reward_human_feedback_fnDB query exploit- Fix applied: Switched the ChromaDB lookup to query against the
promptstring instead ofaction.reasoning. The agent can no longer manipulate the query text to retrieve high scores.
- Fix applied: Switched the ChromaDB lookup to query against the
Critical Bug Fixes β -> β
- Critical Bug 1: Milestone and Completion rewards were dead
- Fix applied: Populated
success_conditionsfor all task domains inTaskGenerator. - Fix applied: Exposed
viable_routesin the GRPO prompt so the model knows which IDs to target. - Fix applied: Added
executeto the allowedaction_typelist and updated schema instructions.
- Fix applied: Populated
Final Structural Hardening β -> β
- Critical Bug 3: CodeMergeCrisisTask() was a stub
- Fix applied: Fully implemented the
CodeMergeCrisisTaskincore/task.pywith real disruptions and routes. - Fix applied: Seeded
mutable_worldandvisible_worldbaseline disruptions into ALL domain generators inTaskGenerator. No more "phantom crises."
- Fix applied: Fully implemented the
Reward Signal Activations β -> β
- Critical Bug 4: replan_bonus was always 0.0
- Fix applied: Modified
generate_datasetto sample tasks at steps 0, 2, and 4 instead of only step 0. - Fix applied: Capture and display
EXOGENOUS EVENTS ENCOUNTEREDin the prompt context. - Fix applied: Synchronized
get_lifestack_evaluationto fast-forward the environment to the corresponding step before scoring.
- Fix applied: Modified
Anti-Hacking Hardening β -> β
- Critical Bug 5: _REWARD_CACHE contradicted anti-hacking rules
- Fix applied: Completely removed
_REWARD_CACHEfromscripts/train_trl.py. Every reward call now triggers a fresh environment execution. - Fix applied: Eliminated potential memory leak from unbounded global dictionary.
- Fix applied: Completely removed
Ecosystem Integration & Realism β -> β
Bug 4 (Secondary): drift() was hardcoded to career.satisfaction
- Fix applied: Implemented personality-to-metric mapping in
intake/simperson.py. Neuroticism now impacts Stress, Conscientiousness impacts Admin Overhead, etc.
- Fix applied: Implemented personality-to-metric mapping in
Model Integration: Qwen trained model never used in demo
- Fix applied: Updated
LifeStackAgentinagent/agent.pyto check for./lifestack_model. If found, it loads the GRPO-trained policy via Transformers/Unsloth for all demos and episode runs. - Fix applied: Documented model switching via
LIFESTACK_MODEL_PATHenv var.
- Fix applied: Updated
Technical Debt & Memory Hardening β -> β
Bug 8: query_texts vs query_embeddings in ChromaDB
- Fix applied: Switched all memory retrieval to use
memo._embed_text()explicitly andquery_embeddingsin ChromaDB to ensure semantic consistency.
- Fix applied: Switched all memory retrieval to use
Bug 10: hardcoded disruption_baseline=2
- Fix applied: Updated
compute_rewardto accept an optionaldisruption_baseline.compute_task_rewardnow passeslen(task.mutable_world)from metadata, ensuring the "cascade spread" penalty scales with the actual complexity of the crisis.
- Fix applied: Updated
Bug 11: store_decision drops negative examples
- Fix applied: Removed reward thresholds (
<0.5and<2.0) fromLifeStackMemory.store_decisionandstore_trajectory. The system now captures the full longitudinal record, filtering for "successful" examples only during retrieval time for few-shot prompting.
- Fix applied: Removed reward thresholds (
Final Policy Refinement β -> β
Success Termination Logic: Resolved the "Mutually Exclusive Route" blocker.
- Fix applied: Changed
is_successverification fromall()toany()incore/lifestack_env.py. This ensures that episodes terminate correctly when one of the valid task goals is met, preventing the agent from being penalized for not achieving impossible combinations of exclusive routes.
- Fix applied: Changed
Explicit Replan Signal: Promoted Replan Bonus to a primary training objective.
- Fix applied: Implemented a dedicated
reward_replan_fninscripts/train_trl.py. By exposing this as a standalone GRPO reward function, the model now receives a direct gradient for "recovering" (achieving milestones) specifically after exogenous events, rather than it being absorbed into general task success.
- Fix applied: Implemented a dedicated
GRPO Independence & Judge Separation β
- Decoupled Reward Signals:
- Architecture update: The GRPO training pipeline no longer relies on a single environment evaluation for all rewards.
- Static Judges:
reward_format_fn,reward_plausibility_fn, andreward_reasoning_fnnow operate through direct JSON parsing and independent semantic verification. They provide gradients for "logical integrity" without needing the simulation engine. - Empirical Judges:
reward_task_success_fnandreward_milestone_fnremain tied to theLifeStackEnvsimulation. They provide gradients for "causal outcome"βensuring the agent's logic actually works in the simulated world. - Outcome: This prevents "signal contamination" where an environment bug or a single gammable path could inflate all reward components simultaneously.
Success Logic Reconciliation β
- Alignment of Win States:
- Fix applied: Updated
compute_task_completion_rewardincore/reward.pyto useany()logic. - Reasoning: This reconciles the reward system with the environment's early termination logic. In crises with multiple resolution paths (e.g., selling an asset vs. negotiating a payment plan), the agent now receives full completion credit (1.0) for reaching any valid goal-state, rather than previously being capped at partial credit.
- Fix applied: Updated