SystemTruth / docs /REWARD_DESIGN.md
Madhav189's picture
hackathon sprint: grader collapse + coliseum rename + training pipeline
c9baa73

Reward design

Why the rubric collapses to 5 components, why each component is shaped the way it is, and why the [0.65, 0.80] heuristic ceiling β€” distinct from the β‰₯0.90 scripted-expert reference β€” is the load-bearing GRPO invariant.

This document explains the reward decisions for the Triage tier, which is the surface every tier inherits. The Triage rubric is implemented in unified_incident_env/server/grader.py; the Strategy and Operations runners chain Triage episodes and decay their per-phase rewards over the horizon.


1. The 5-component rubric

outcome          0.45    root-cause action correct + recovery confirmed
action_validity  0.20    fraction of step() actions that are well-formed
format           0.10    submit_hypothesis was called before declare_resolved
anticheat        0.15    declare_resolved blocked unless β‰₯1 query-action ran
efficiency       0.10    exp(-current_tick / optimal_ticks_for_template)
                 ----
composite        1.00    public score, clamped to [0.01, 0.99]

Plus a per-tick shaped reward (the change in incident-health potential) computed inside UnifiedIncidentEnvironment.step β€” see Β§3.

The rubric is a Pydantic RubricScore model whose composite property bakes in the weights. Calling compute_breakdown() returns the five new keys plus zeroed legacy keys for any caller that still reads the 7-dim shape β€” see unified_incident_env/server/grader.py.


2. Why this rubric, not the previous 7-dim one

The earlier rubric (recovery 0.25 + containment 0.15 + verification 0.20 + impact 0.05 + efficiency 0.05 + speed_bonus 0.10 + noise_handling 0.05) summed to 0.85, was always quietly clamped, and double-counted causality:

  • recovery and verification both rewarded "the env reached a healthy state" β€” one via service-status weights, the other via the explicit end_to_end check. A successful resolve always lit both. The two dimensions were ~92% correlated across all 60 procgen variants, which flattened the GRPO advantage signal: trajectories that resolved cleanly clustered too tightly.
  • impact (0.05) was a deterministic function of recovery for every Triage scenario β€” the post-rollback impact targets are scenario constants. It contributed zero variance.
  • speed_bonus (0.10) only fired conditional on full verification, so it duplicated the verification gate.

The 5-component rubric strips those overlaps. Outcome (0.45) is the fused recovery+verification+impact dimension β€” a single 0/0.5/1 ladder that asks one question: did the agent fix the right thing and verify it? All the per-component partial credit lives inside that ladder. Efficiency (0.10) stays. The other three components β€” action_validity, format, anticheat β€” are new, and Β§4 explains why each one is necessary.

The composite now spans the full [0, 1] range without clamping; the public clamp to [0.01, 0.99] is just numerical hygiene at the API boundary.


3. The shaped per-tick reward (unchanged)

The grader computes composite from terminal state. Per-tick rewards are still shaped by the change in incident-health potential, which is what gives GRPO dense intermediate signal:

potential = 0.55 * sum(weight[s] * service_status_score[s])  for s in critical_services
          + 0.20 * (1 - user_impact)
          + 0.15 * (1 - slo_burn_rate)
          + 0.10 * containment_applied

per_tick_reward = -step_cost
                 + (potential_after - potential_before)
                 + bonus_from_action
                 - penalty_from_action

A correct rollback raises potential because services move toward healthy and containment_applied flips True. A wrong rollback or premature restart leaves potential flat (and pays an explicit unsafe_action_penalty). The shaping is potential-based (Ng et al. 1999): the optimal policy under shaped rewards is identical to the optimal policy under unshaped rewards, but the gradient finds the optimum much faster β€” necessary for the 12-hour A100 GRPO budget.


4. Why each new component is load-bearing

Outcome (0.45). The single biggest weight, because it's the single thing that matters. A 0/0.5/1 ladder:

  • 1.0 β€” cause_removed AND end_to_end check passed. The incident is genuinely fixed.
  • 0.5 β€” agent submitted a hypothesis with the correct root cause but never remediated. Diagnosis without action is half credit: it shows the agent understands what's wrong without proving it can act.
  • 0.0 β€” neither remediated nor correctly diagnosed.

The 0.5 step is what gives the [0.65, 0.80] heuristic ceiling its top edge (Β§5).

Action validity (0.20). Computed as (step_count - invalid_action_count) / step_count. Pydantic catches malformed actions at the HTTP boundary, so reaching step() already implies a structurally valid action. This dimension fires on the env's unsupported_action failure path β€” actions whose type isn't part of the bounded action set. It rewards "did the agent figure out the protocol" without conflating that with semantic correctness.

Format (0.10). A binary gate: 1.0 iff submit_hypothesis was called before declare_resolved (or declare_resolved was never called). This forces the agent to commit to a diagnosis before claiming victory β€” without it, an agent can shortcut by repeatedly trying declare_resolved and skipping the diagnosis step entirely.

Anticheat (0.15). 1.0 iff at least one query_* action ran before the first declare_resolved attempt. Blocks the cheapest possible cheat: submit_hypothesis(any) β†’ declare_resolved β†’ done. Note that run_check and submit_hypothesis do not count as evidence-gathering for this dimension β€” only the four golden-signal queries (query_logs, query_metrics, query_dependencies, query_deploys).

Efficiency (0.10). min(1.0, exp(-current_tick / optimal_ticks)). Smooth, monotonic, no thresholds. A scripted-optimal solve at exactly optimal_ticks scores e^-1 β‰ˆ 0.37. A trained agent that resolves in 0.5 * optimal_ticks scores e^-0.5 β‰ˆ 0.61. The exponential shape is deliberate: the marginal value of one fewer tick is highest at the start (where the agent must figure out what to do) and lowest at the end (where it's just running out the clock).

The five components are independently testable, and unified_incident_env/tests/test_environment.py has one assertion per component covering the canonical fail/pass paths.


5. The two ceiling bands

The rubric has two reference scores, not one:

5a. Heuristic ceiling: [0.65, 0.80]

A naive heuristic that gathers evidence + submits the correct hypothesis but never remediates lands in this band. Enforced across all 12 templates by test_heuristic_ceiling_is_in_band. Composition:

0.45 * 0.5  (outcome β€” correct hypothesis, no remediation)
0.20 * 1.0  (action_validity β€” all actions well-formed)
0.10 * 1.0  (format β€” hypothesis before any resolve attempt)
0.15 * 1.0  (anticheat β€” queries before any resolve attempt)
0.10 * x    (efficiency β€” depends on tick count when episode evaluated)
= 0.675 + 0.10 * x   β†’   in [0.675, 0.775] for x in [0, 1]

This is the ceiling a non-trained agent can hit by reading the action protocol carefully. The 0.20 gap from 0.80 β†’ 1.00 is the GRPO training target β€” every percentage point above 0.80 represents a remediation step the heuristic doesn't take.

5b. Scripted-expert reference: β‰₯ 0.90

The scripted-optimal baselines in extra_baselines() execute the canonical solve path: queries β†’ hypothesis β†’ rollback β†’ run_check β†’ declare_resolved. Under the new rubric they score in the [0.93, 0.95] band β€” outcome=1.0, action_validity=1.0, format=1.0, anticheat=1.0, and efficiency=e^(-optimal_ticks/optimal_ticks)=0.37. This is the demonstration shape: it shows the env is solvable cleanly, and it's what the SFT seed trajectories are derived from. Enforced by test_round2_baseline_resolves (β‰₯0.90) and test_baseline_resolves_honestly family.

A trained agent that beats the scripted baseline does so by trimming ticks β€” the only dimension with headroom once the other four are saturated.


6. Adversarial verification

The previous rubric had no explicit anti-cheat dimension; the absence was caught by spot-checking trajectories. The 5-component rubric makes adversarial behavior first-class:

Cheat strategy Blocked by Mechanism
declare_resolved immediately anticheat (0.15) requires β‰₯1 query before first resolve attempt
Skip submit_hypothesis to save a tick format (0.10) requires hypothesis before resolve
Spam hypotheses to fish for partial credit hypothesis idempotence second identical hypothesis returns 0 reward
Send malformed actions to flood the trace action_validity (0.20) invalid actions reduce the validity ratio
declare_resolved before checks pass outcome (0.45) outcome=0.0 unless cause_removed AND end_to_end
Restart the right service before rollback shaped reward + premature_restart penalty re-inherits bad state, pays explicit penalty
Rollback the wrong service shaped reward + wrong_remediation_target penalty leaves cause in place, pays penalty, no outcome credit

Three of those (anticheat, format, action_validity) are dimension-level blocks: even if the rest of the rubric saturates, a cheating agent loses 0.45 of weight. The other four are step-penalty blocks inside UnifiedIncidentEnvironment.step.

The unsafe_action_penalty referenced elsewhere is not a rubric component β€” it's an action-level step penalty applied during episode rollout. It surfaces in the per-tick reward but doesn't enter the composite. Conflating the two is a common mistake when reading the env.


7. Penalty structure (per-tick, not rubric)

Source Magnitude Triggers
step_cost 0.01/tick always (efficiency pressure)
unsafe_action_penalty 0.08 (medium) / 0.12 (hard) rollback wrong service, isolate wrong service, unsupported action
premature_resolution_penalty 0.20 / 0.30 declare_resolved before checks pass
low_value_restart (half-strength) 0.04 / 0.06 restart wrong service
premature_restart 0.08 / 0.12 restart correct service before cause removed

The asymmetry between low_value_restart (half-penalty) and premature_restart (full-penalty) reflects the SRE judgment that "restarting the right thing too early" is worse than "restarting the wrong thing" β€” restarting too early re-inherits the bad state and resets progress, while restarting the wrong thing is just wasted action.


8. Hypothesis reward (per-tick, not rubric)

submit_hypothesis pays an in-episode bonus on top of the rubric:

  • 0.04 for correct root_cause (RootCauseType match against scenario truth)
  • 0.03 Γ— overlap for affected_services overlap with truth
  • 0.03 Γ— quality for recommended_next_action quality (bonus if right, -0.4 if wrong)
  • 0.02 Γ— calibration for confidence calibration (bonus if confident-and-right, penalty if confident-and-wrong)

Total: up to ~0.12 per scenario, accruing into cumulative_reward (visible in observations and the live UI). submit_hypothesis is idempotent: a second identical hypothesis returns 0 bonus. An agent that spams hypotheses to harvest partial credit gets one shot per unique hypothesis.

This bonus is separate from the rubric: a correct hypothesis lifts both the per-tick reward (via the bonus) and the terminal outcome dimension (because hypothesis_root_cause_correct flips True). The two pathways are deliberate β€” the per-tick bonus rewards the moment of insight, the rubric dimension rewards the fact of insight.


9. Cross-tier reward shape

The Strategy and Operations tiers reuse the Triage rubric; they don't introduce new rubric dimensions. What they do add is a horizon decay applied to per-phase composite scores:

strategy_final_reward = horizon_decay_factor * mean(per_phase_composite)
operations_final_reward = horizon_decay_factor * mean(per_phase_composite)

horizon_decay_factor ∈ [0, 1] is computed by the runner from unresolved phases (Strategy) and chaos-pattern composition (Operations). The Triage rubric's properties β€” bounded composition, anti-cheat dimensions, smooth efficiency β€” all carry through. That's why the tier structure is a curriculum, not a benchmark substitution: a Triage-trained agent's outcome and format priors transfer directly into Strategy and Operations training.


10. Test discipline

Test Asserts
test_heuristic_ceiling_is_in_band naive heuristic in [0.65, 0.80] across all 12 templates
test_round2_baseline_resolves scripted-optimal β‰₯ 0.90 across the 6 round-2 templates
test_baseline_resolves_honestly family scripted-optimal resolves cleanly (incident_resolved=True, both checks pass)
test_fast_solve_beats_scripted_baseline a 6-step correct solve scores above the 10-step scripted baseline
test_declare_resolved_requires_checks premature resolve fires premature_resolution failure path
test_duplicate_hypothesis_bonus_is_not_farmable hypothesis-spam blocked by idempotence
test_blast_radius_increments_on_mitigations per-tick blast-radius counter is honest
test_fast_solve_beats_scripted_baseline a faster correct solve outscores the 10-tick baseline via the efficiency dim

Each rubric component has an end-to-end assertion; each adversarial cheat has a regression test. Adding a sixth rubric component would require: extending RubricScore, updating composite weight math to re-sum to 1.0, adding a test for both the cap and the floor, and updating the heuristic-ceiling math. That's the cost of the composable-rubric design β€” and it's deliberate.