| # Hardening Audit & Implementation Plan |
|
|
| Now I have the full picture. Let me give you a complete, honest audit and plan. |
|
|
| --- |
|
|
| ## Part 1: How many reward functions do we have? |
|
|
| Technically 6 functions in `core/reward.py`, but the reality is worse than it looks: |
|
|
| | Function | What it does | Actually wired? | |
| | :--- | :--- | :--- | |
| | `compute_reward()` | 4 components + 4 penalties (local step) | β
called in `lifestack_env.py:step()` | |
| | `compute_milestone_reward()` | milestones hit / total possible | β defined, never called in env | |
| | `compute_task_completion_reward()` | success conditions met | β defined, never called in env | |
| | `compute_replan_bonus()` | recovery after exo-events | β defined, never called in env | |
| | `compute_dead_end_penalty()` | no routes remaining | β defined, never called in env | |
| | `compute_task_reward()` | orchestrator combining all above | β defined, `env.step()` still calls only `compute_reward()` | |
|
|
| **So in practice: 1 reward function is active. 5 are dead code.** |
|
|
| --- |
|
|
| ## Part 2: Gap vs. hackathon guide |
|
|
| The guide explicitly says (Β§7, Β§8, Β§21): |
| > "Use multiple independent reward functions. If you only have one, it's easier to hack. Multiple independent checks reduce that risk." |
| > "Common mistake: using only one reward function" |
|
|
| ### Full Gap Analysis: |
|
|
| | Guide Requirement | Our Status | Implementation Detail | |
| | :--- | :--- | :--- | |
| | **Execution success** (task completed?) | β Missing | `compute_task_completion_reward` exists but unwired | |
| | **Correctness** (metrics actually improved?) | β
Active | `outcome_score` in `compute_reward` | |
| | **Format compliance** (valid JSON?) | β Missing | Completely missing in previous version | |
| | **Timeouts** (step limit hit penalty?) | β Missing | Missing | |
| | **Resource usage** | β
Active | `resource_efficiency_score` | |
| | **Safety constraints** (floor violations) | β
Active | `CRITICAL_FLOOR_VIOLATION` | |
| | **Anti-cheating checks** | β Missing | Model can claim +50 metric change with 0 resource cost | |
| | **Process-aware feedback** (step-level) | β Missing | Missing | |
| | **Multiple independent fns logged** | β Missing | Only one fn running | |
|
|
| **Parameters currently used to compute reward (the one active fn):** |
| - `outcome_score`: delta across all 23 sub-metrics, domain-weighted 1/6 each |
| - `cascade_containment_score`: % of metrics that didn't worsen |
| - `resource_efficiency_score`: 1 - avg(time/20, money/500, energy/100) |
| - `relationship_preservation_score`: sigmoid on relationship domain average delta |
| - **Penalties:** CRITICAL_FLOOR (-0.50), CASCADE_SPREAD (-0.30), INACTION (-0.40), RELATIONSHIP_COLLAPSE (-0.15) |
| |
| **Weights:** 0.40 outcome + 0.25 containment + 0.20 efficiency + 0.15 preservation |
| |
| --- |
| |
| ## Part 3: Delayed Human Outcome Signal |
| |
| This is excellent and has a formal name: **delayed human outcome signal**. The idea: |
| > After the agent gives advice β user acts on it β after N hours/days when the effect resolves β user submits: "did it work? what else changed?" |
| |
| This gives you two things the simulator can't: |
| 1. **Ground truth** on whether advice was correct (human validates predicted changes). |
| 2. **Unmeasured second-order effects** (e.g., trust damage not captured by metrics). |
| |
| --- |
| |
| ## The Plan |
| |
| ### Step 1 β Wire the orchestrator (1 day, critical) |
| `lifestack_env.py:step()` currently calls `compute_reward()`. Change it to call `compute_task_reward()` when a `Task` is present. This instantly activates milestone + completion + replan rewards without writing new code. |
| |
| ### Step 2 β Add the 3 missing independent reward functions (1 day) |
| * **reward_format_compliance**: +1.0 for valid JSON, -1.0 for refusals/text. Prevents the most common GRPO failure mode. |
| * **reward_plausibility_check**: Anti-gaming check. `ratio = sum(abs(metric_changes)) / max(1, sum(resource_costs))`. If ratio > 15, return -0.30. |
| * **reward_timeout_check**: Penalty if `step_count >= max_steps` and not done. |
| |
| ### Step 3 β Process-aware intermediate reward (1 day) |
| Add a reasoning coherence check β does the reasoning field actually mention the conflict domain? insegning the same final reward to every token is inefficient. |
| |
| ### Step 4 β Anti-hacking logging |
| Add "suspicious" flag to logs: `reward > 0.8 and resource_cost == {}`. |
|
|
| ### Step 5 β Human outcome feedback loop (new feature, 2-3 days) |
| Build `core/feedback.py` and Gradio UI for users to submit `OutcomeFeedback`. Store in ChromaDB and wire into retraining loop via `compute_human_feedback_reward`. |
|
|
| --- |
|
|
| ## Priority Order |
| 1. **Wire compute_task_reward into env.step()** β Immediate 4x more reward signal |
| 2. **Add format_compliance reward fn** β Prevents #1 GRPO failure mode |
| 3. **Add plausibility_check reward fn** β Blocks reward hacking |
| 4. **Log each fn independently in breakdown** β Satisfies guide Β§15 |
| 5. **Build OutcomeFeedback dataclass + app UI** β Differentiator |
| 6. **Wire human feedback into ChromaDB + retraining** β Long-term loop |
|
|