File size: 2,497 Bytes
77da5ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | # reward.md — Reward System Reference
`core/reward.py` — Task-aware reward orchestrator.
---
## Overview
Two reward functions are available:
| Function | Used when |
|---|---|
| `compute_reward(...)` | Legacy / no-task episodes |
| `compute_task_reward(...)` | All task-driven episodes (v2.0+) |
---
## `compute_task_reward` — Components
```
reward = (0.35 × milestone) # Reaching key progress markers
+ (0.25 × completion) # Final goal achievement (binary 1.0 if any goal met)
+ (0.15 × outcome) # Isolated local metric improvement
+ (0.10 × replan_bonus) # Recovery after ExoEvents
+ (0.10 × efficiency) # Resource preservation relative to delta
+ (0.05 × reasoning) # Logical coherence & action alignment
+ penalties
```
### Penalties
| Penalty | Value | Level | Trigger |
|---|---|---|---|
| `INACTION_PENALTY` | `-0.40` | Step | `actions_taken == 0` |
| `TASK_INACTION_PENALTY` | `-0.20` | Task | `actions_taken == 0` (additive to step penalty) |
| `CRITICAL_FLOOR_VIOLATION` | `-0.50` | Step | Any metric drops below 20 |
| `DEAD_END` | `-0.50` | Task | All viable routes closed without success |
| `CASCADE_SPREAD_WIDER` | `-0.30` | Step | Changes spread wider than disruption baseline |
| `RELATIONSHIP_COLLAPSE` | `-0.15` | Step | Relationships drop more than 20 points in one step |
| `CUMULATIVE_RELATIONSHIP_EROSION` | `-0.15` | Episode | Cumulative relationship drop more than 20 points |
| `PLAUSIBILITY_VIOLATION` | `-0.10 to -0.30` | Step | Implausible metric/cost ratio |
| `TIMEOUT` | `-0.20` | Task | Max steps reached without resolution |
---
## Return Value
Both functions return `(reward: float, breakdown: dict)`, but the component keys differ slightly.
```python
breakdown = {
"components": {
# compute_reward(...)
"outcome": float,
"containment": float,
"efficiency": float,
"preservation": float,
"format_compliance": float,
"plausibility": float,
"reasoning_alignment": float,
# compute_task_reward(...)
"local_metric_delta": float,
"milestone": float,
"completion": float,
"replan": float,
"reasoning": float,
"timeout_penalty": float,
},
"penalties_fired": list[str],
"base_reward": float,
"penalties_total": float,
}
```
---
## Change Log
| Date | Change |
|---|---|
| 2026-04-23 | Initial doc created |
|