| # Reward System Review vs. the Guide |
|
|
| ## What you have |
|
|
| In `core/reward.py`: One composite reward function (`compute_task_reward`) that blends 7 weighted components into a single float: |
|
|
| | Component | Weight | Function | |
| |-----------------------|--------|--------------------------------| |
| | local metric delta | 5% | compute_reward | |
| | milestone | 35% | compute_milestone_reward | |
| | task completion | 25% | compute_task_completion_reward | |
| | replanning | 10% | compute_replan_bonus | |
| | resource efficiency | 5% | - | |
| | reasoning coherence | 10% | reward_reasoning_coherence | |
| | format compliance | 10% | reward_format_compliance | |
|
|
| In `train_trl.py`: 6 separate functions passed to `reward_funcs=[]` for GRPO: |
| `reward_format_fn`, `reward_plausibility_fn`, `reward_task_success_fn`, `reward_milestone_fn`, `reward_reasoning_fn`, `reward_human_feedback_fn` |
|
|
| --- |
|
|
| ## Where you follow the guide β
|
|
|
| - 6 separate GRPO reward functions β matches the guide's "multiple independent reward functions" recommendation |
| - Format compliance (`reward_format_compliance`) β guide explicitly lists format compliance |
| - Timeout penalty (`reward_timeout_check`) β guide says "penalize timeouts" |
| - Plausibility anti-cheat (`reward_plausibility_check`) β catches zero-cost metric hacks (guide: "anti-cheating checks") |
| - Reasoning coherence β guide recommends process-aware feedback |
| - Resource lockout (`lifestack_env.py:431-439`) β resource deduction happens before metric changes, with `metric_changes = {}` if budget depleted. Good explicit lockdown. |
| - `CRITICAL_FLOOR_VIOLATION`, `INACTION_PENALTY`, `CASCADE_COLLAPSE` penalties |
| - Curriculum learning in `train.py` and `train_trl.py` β matches guide section 6 |
| - Component-level logging (`train_trl.py:274-277`) β guide section 15 says watch individual reward columns, not just total reward |
|
|
| --- |
|
|
| ## Where you don't fully follow the guide β (Fixed β
) |
|
|
| 1. **The 6 GRPO functions are NOT truly independent β they share one environment call** |
| - *Fix applied*: Decoupled `reward_format_fn` by explicitly checking JSON format using `core.reward.reward_format_compliance()`, making it fully independent. |
|
|
| 2. **`_REWARD_CACHE` is a global mutable dict β a guide-listed hacking vector** |
| - *Fix applied*: Added a size cap of `1000` cache entries to mitigate this vector. |
|
|
| 3. **`reward_human_feedback_fn` silently goes neutral when ChromaDB is unavailable** |
| - *Fix applied*: Logs a warning and returns `-0.01` (a small penalty) instead of `0.0`. |
| |
| 4. **No execution sandboxing** |
| - *Fix applied*: Added a `allowed_keys` whitelist in `lifestack_env.step()` constructed from `current_metrics.flatten().keys()`. |
| |
| 5. **Step-level reward (`compute_task_reward`) is still one blended number for the env itself** |
| - (For future consideration/rewrite) |
|
|
| --- |
|
|
| ## Quick priority fixes |
|
|
| | Priority | Fix | Guide reference | Protocol / Fixed? | |
| |----------|-----|-----------------|-------------------| |
| | High | Add a TTL or size cap to `_REWARD_CACHE` (or disable it) | Section 8: "caching results" | β
Fixed | |
| | High | Add a metric key whitelist in `lifestack_env.step()` so model can't inject arbitrary paths | Section 8: "Lock down execution" | β
Fixed | |
| | Medium | Make at least 1-2 GRPO functions truly independent (e.g., `reward_format_fn` can parse JSON without calling `get_lifestack_evaluation`) | Section 7: "multiple independent checks" | β
Fixed | |
| | Low | Log a warning or small penalty when `reward_human_feedback_fn` falls back to 0.0 | Section 15: monitor individual columns | β
Fixed | |
|
|
| *The biggest structural win is decoupling `reward_format_fn` from the shared env call β it can check JSON validity entirely on its own, making it genuinely independent from the environment's result.* |
|
|
| --- |
|
|
| ## Secondary Bug Fixes β -> β
|
|
|
| 1. **Bug 1: `reward_plausibility_fn` inverted/broken output** |
| - *Fix applied*: Extracted the parsed completion and invoked `reward_plausibility_check` natively to retrieve the true continuous penalty score (e.g., `-0.1`, `-0.3`) instead of returning a binary `1.0`/`-1.0`. |
|
|
| 2. **Bug 2: `reward_task_success_fn` double-dipping components** |
| - *Fix applied*: Narrowed the function to retrieve just the `.get("completion", 0.0)` score from the breakdown, avoiding re-summing milestone, format, and reasoning. |
| |
| 3. **Bug 3: `reward_reasoning_fn` output range is noise** |
| - *Fix applied*: Added a `* 10.0` scalar to inflate the `[-0.10, 0.10]` range to `[-1.0, 1.0]`, equalizing its variance and ensuring it produces valid gradients. |
| |
| 4. **Bug 4: Task reconstruction was non-deterministic** |
| - *Fix applied*: Injected a sampled `seed` into `<SYSTEM_METADATA>` and set `random.seed()` around `TaskGenerator.generate()` in the evaluation function. Now the environment evaluates against the exact same routes and milestones the prompt originally described. |
| |
| 5. **Bug 5: `reward_human_feedback_fn` DB query exploit** |
| - *Fix applied*: Switched the ChromaDB lookup to query against the `prompt` string instead of `action.reasoning`. The agent can no longer manipulate the query text to retrieve high scores. |
|
|
| --- |
|
|
| ## Critical Bug Fixes β -> β
|
|
|
| 1. **Critical Bug 1: Milestone and Completion rewards were dead** |
| - *Fix applied*: Populated `success_conditions` for all task domains in `TaskGenerator`. |
| - *Fix applied*: Exposed `viable_routes` in the GRPO prompt so the model knows which IDs to target. |
| - *Fix applied*: Added `execute` to the allowed `action_type` list and updated schema instructions. |
|
|
| --- |
|
|
| ## Final Structural Hardening β -> β
|
|
|
| 1. **Critical Bug 3: CodeMergeCrisisTask() was a stub** |
| - *Fix applied*: Fully implemented the `CodeMergeCrisisTask` in `core/task.py` with real disruptions and routes. |
| - *Fix applied*: Seeded `mutable_world` and `visible_world` baseline disruptions into ALL domain generators in `TaskGenerator`. No more "phantom crises." |
|
|
| --- |
|
|
| ## Reward Signal Activations β -> β
|
|
|
| 1. **Critical Bug 4: replan_bonus was always 0.0** |
| - *Fix applied*: Modified `generate_dataset` to sample tasks at steps 0, 2, and 4 instead of only step 0. |
| - *Fix applied*: Capture and display `EXOGENOUS EVENTS ENCOUNTERED` in the prompt context. |
| - *Fix applied*: Synchronized `get_lifestack_evaluation` to fast-forward the environment to the corresponding step before scoring. |
| |
| --- |
| |
| ## Anti-Hacking Hardening β -> β
|
| |
| 1. **Critical Bug 5: _REWARD_CACHE contradicted anti-hacking rules** |
| - *Fix applied*: Completely removed `_REWARD_CACHE` from `scripts/train_trl.py`. Every reward call now triggers a fresh environment execution. |
| - *Fix applied*: Eliminated potential memory leak from unbounded global dictionary. |
|
|
| --- |
|
|
| ## Ecosystem Integration & Realism β -> β
|
|
|
| 1. **Bug 4 (Secondary): drift() was hardcoded to career.satisfaction** |
| - *Fix applied*: Implemented personality-to-metric mapping in `intake/simperson.py`. Neuroticism now impacts Stress, Conscientiousness impacts Admin Overhead, etc. |
| |
| 2. **Model Integration: Qwen trained model never used in demo** |
| - *Fix applied*: Updated `LifeStackAgent` in `agent/agent.py` to check for `./lifestack_model`. If found, it loads the GRPO-trained policy via Transformers/Unsloth for all demos and episode runs. |
| - *Fix applied*: Documented model switching via `LIFESTACK_MODEL_PATH` env var. |
|
|
| --- |
|
|
| ## Technical Debt & Memory Hardening β -> β
|
|
|
| 1. **Bug 8: query_texts vs query_embeddings in ChromaDB** |
| - *Fix applied*: Switched all memory retrieval to use `memo._embed_text()` explicitly and `query_embeddings` in ChromaDB to ensure semantic consistency. |
| |
| 2. **Bug 10: hardcoded disruption_baseline=2** |
| - *Fix applied*: Updated `compute_reward` to accept an optional `disruption_baseline`. `compute_task_reward` now passes `len(task.mutable_world)` from metadata, ensuring the "cascade spread" penalty scales with the actual complexity of the crisis. |
| |
| 3. **Bug 11: store_decision drops negative examples** |
| - *Fix applied*: Removed reward thresholds (`<0.5` and `<2.0`) from `LifeStackMemory.store_decision` and `store_trajectory`. The system now captures the full longitudinal record, filtering for "successful" examples only during retrieval time for few-shot prompting. |
| |
| --- |
| |
| ## Final Policy Refinement β -> β
|
| |
| 1. **Success Termination Logic**: Resolved the "Mutually Exclusive Route" blocker. |
| - *Fix applied*: Changed `is_success` verification from `all()` to `any()` in `core/lifestack_env.py`. This ensures that episodes terminate correctly when one of the valid task goals is met, preventing the agent from being penalized for not achieving impossible combinations of exclusive routes. |
| |
| 2. **Explicit Replan Signal**: Promoted Replan Bonus to a primary training objective. |
| - *Fix applied*: Implemented a dedicated `reward_replan_fn` in `scripts/train_trl.py`. By exposing this as a standalone GRPO reward function, the model now receives a direct gradient for "recovering" (achieving milestones) specifically after exogenous events, rather than it being absorbed into general task success. |
|
|
| --- |
|
|
| ## GRPO Independence & Judge Separation β
|
|
|
| 1. **Decoupled Reward Signals**: |
| - *Architecture update*: The GRPO training pipeline no longer relies on a single environment evaluation for all rewards. |
| - **Static Judges**: `reward_format_fn`, `reward_plausibility_fn`, and `reward_reasoning_fn` now operate through direct JSON parsing and independent semantic verification. They provide gradients for "logical integrity" without needing the simulation engine. |
| - **Empirical Judges**: `reward_task_success_fn` and `reward_milestone_fn` remain tied to the `LifeStackEnv` simulation. They provide gradients for "causal outcome"βensuring the agent's logic actually works in the simulated world. |
| - **Outcome**: This prevents "signal contamination" where an environment bug or a single gammable path could inflate all reward components simultaneously. |
|
|
| --- |
|
|
| ## Success Logic Reconciliation β
|
|
|
| 1. **Alignment of Win States**: |
| - *Fix applied*: Updated `compute_task_completion_reward` in `core/reward.py` to use `any()` logic. |
| - **Reasoning**: This reconciles the reward system with the environment's early termination logic. In crises with multiple resolution paths (e.g., selling an asset vs. negotiating a payment plan), the agent now receives full completion credit (1.0) for reaching any valid goal-state, rather than previously being capped at partial credit. |
|
|