| # reward.md — Reward System Reference |
|
|
| `core/reward.py` — Task-aware reward orchestrator. |
|
|
| --- |
|
|
| ## Overview |
|
|
| Two reward functions are available: |
|
|
| | Function | Used when | |
| |---|---| |
| | `compute_reward(...)` | Legacy / no-task episodes | |
| | `compute_task_reward(...)` | All task-driven episodes (v2.0+) | |
|
|
| --- |
|
|
| ## `compute_task_reward` — Components |
|
|
| ``` |
| reward = (0.35 × milestone) # Reaching key progress markers |
| + (0.25 × completion) # Final goal achievement (binary 1.0 if any goal met) |
| + (0.15 × outcome) # Isolated local metric improvement |
| + (0.10 × replan_bonus) # Recovery after ExoEvents |
| + (0.10 × efficiency) # Resource preservation relative to delta |
| + (0.05 × reasoning) # Logical coherence & action alignment |
| + penalties |
| ``` |
|
|
| ### Penalties |
|
|
| | Penalty | Value | Level | Trigger | |
| |---|---|---|---| |
| | `INACTION_PENALTY` | `-0.40` | Step | `actions_taken == 0` | |
| | `TASK_INACTION_PENALTY` | `-0.20` | Task | `actions_taken == 0` (additive to step penalty) | |
| | `CRITICAL_FLOOR_VIOLATION` | `-0.50` | Step | Any metric drops below 20 | |
| | `DEAD_END` | `-0.50` | Task | All viable routes closed without success | |
| | `CASCADE_SPREAD_WIDER` | `-0.30` | Step | Changes spread wider than disruption baseline | |
| | `RELATIONSHIP_COLLAPSE` | `-0.15` | Step | Relationships drop more than 20 points in one step | |
| | `CUMULATIVE_RELATIONSHIP_EROSION` | `-0.15` | Episode | Cumulative relationship drop more than 20 points | |
| | `PLAUSIBILITY_VIOLATION` | `-0.10 to -0.30` | Step | Implausible metric/cost ratio | |
| | `TIMEOUT` | `-0.20` | Task | Max steps reached without resolution | |
|
|
| --- |
|
|
| ## Return Value |
|
|
| Both functions return `(reward: float, breakdown: dict)`, but the component keys differ slightly. |
|
|
| ```python |
| breakdown = { |
| "components": { |
| # compute_reward(...) |
| "outcome": float, |
| "containment": float, |
| "efficiency": float, |
| "preservation": float, |
| "format_compliance": float, |
| "plausibility": float, |
| "reasoning_alignment": float, |
| |
| # compute_task_reward(...) |
| "local_metric_delta": float, |
| "milestone": float, |
| "completion": float, |
| "replan": float, |
| "reasoning": float, |
| "timeout_penalty": float, |
| }, |
| "penalties_fired": list[str], |
| "base_reward": float, |
| "penalties_total": float, |
| } |
| ``` |
|
|
| --- |
|
|
| ## Change Log |
|
|
| | Date | Change | |
| |---|---| |
| | 2026-04-23 | Initial doc created | |
|
|