# reward.md — Reward System Reference

`core/reward.py` — Task-aware reward orchestrator.

---

## Overview

Two reward functions are available:

| Function | Used when |
|---|---|
| `compute_reward(...)` | Legacy / no-task episodes |
| `compute_task_reward(...)` | All task-driven episodes (v2.0+) |

---

## `compute_task_reward` — Components

```
reward = (0.35 × milestone)          # Reaching key progress markers
       + (0.25 × completion)         # Final goal achievement (binary 1.0 if any goal met)
       + (0.15 × outcome)            # Isolated local metric improvement
       + (0.10 × replan_bonus)       # Recovery after ExoEvents
       + (0.10 × efficiency)         # Resource preservation relative to delta
       + (0.05 × reasoning)          # Logical coherence & action alignment
       + penalties
```

### Penalties

| Penalty | Value | Level | Trigger |
|---|---|---|---|
| `INACTION_PENALTY` | `-0.40` | Step | `actions_taken == 0` |
| `TASK_INACTION_PENALTY` | `-0.20` | Task | `actions_taken == 0` (additive to step penalty) |
| `CRITICAL_FLOOR_VIOLATION` | `-0.50` | Step | Any metric drops below 20 |
| `DEAD_END` | `-0.50` | Task | All viable routes closed without success |
| `CASCADE_SPREAD_WIDER` | `-0.30` | Step | Changes spread wider than disruption baseline |
| `RELATIONSHIP_COLLAPSE` | `-0.15` | Step | Relationships drop more than 20 points in one step |
| `CUMULATIVE_RELATIONSHIP_EROSION` | `-0.15` | Episode | Cumulative relationship drop more than 20 points |
| `PLAUSIBILITY_VIOLATION` | `-0.10 to -0.30` | Step | Implausible metric/cost ratio |
| `TIMEOUT` | `-0.20` | Task | Max steps reached without resolution |

---

## Return Value

Both functions return `(reward: float, breakdown: dict)`, but the component keys differ slightly.

```python
breakdown = {
    "components": {
        # compute_reward(...)
        "outcome": float,
        "containment": float,
        "efficiency": float,
        "preservation": float,
        "format_compliance": float,
        "plausibility": float,
        "reasoning_alignment": float,

        # compute_task_reward(...)
        "local_metric_delta": float,
        "milestone": float,
        "completion": float,
        "replan": float,
        "reasoning": float,
        "timeout_penalty": float,
    },
    "penalties_fired": list[str],
    "base_reward": float,
    "penalties_total": float,
}
```

---

## Change Log

| Date | Change |
|---|---|
| 2026-04-23 | Initial doc created |