Spaces:
Sleeping
Specification Gaming Discovery
This document records a discovered specification-gaming behaviour observed during the fifth GRPO training iteration on ChargebackOps. The behaviour is reproducible, well-characterised, and carries a clear remedy. It is preserved in this repository as a research artefact, not as a defect of the environment.
TL;DR
After 200 GRPO steps with outcome reward, the trained policy converged on emitting an invalid action JSON (action_type="accept_case") for every prompt. The eval rollout helper falls back to the heuristic policy on invalid model output. The fallback completes the episode at heuristic-quality outcome. The eval grader awards the heuristic's score. The model collects the reward without producing any useful action.
The agent did not solve chargebacks. It solved the eval rollout helper.
What we observed
Eval scores at every checkpoint
| Step | Checkpoint | Overall | easy | medium | hard | nightmare |
|---|---|---|---|---|---|---|
| 0 | Untrained Qwen2.5-3B base | 0.456 | 0.286 | 0.443 | 0.758 | 0.336 |
| 1 | SFT (150 steps) | 0.536 | 0.778 | 0.666 | 0.462 | 0.235 |
| 81 | GRPO step 80 | 0.799 | 0.929 | 0.792 | 0.828 | 0.647 |
| 161 | GRPO step 160 | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 |
| 201 | GRPO step 200 | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 |
| 202 | GRPO final | 0.8132 | 0.922 | 0.860 | 0.831 | 0.641 |
| β | Heuristic baseline | 0.8132 | β | β | β | β |
Three GRPO checkpoints score bit-exactly 0.8132 β the same as the offline heuristic baseline. The coincidence triggered a closer look.
The diagnostic rollout
=== goods_not_received_easy ===
oracle: select_case case=CB-E1
completion: '{"action_type":"accept_case","case_id":"CB-E1","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-E1'}
outcome PnL (normalized): +0.000
=== queue_optimization_hard ===
oracle: select_case case=CB-H3
completion: '{"action_type":"accept_case","case_id":"CB-H3","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-H3'}
outcome PnL (normalized): +0.000
=== generated_nightmare_s31 ===
oracle: select_case case=CB-G3
completion: '{"action_type":"accept_case","case_id":"CB-G3","metadata":{}}'
parsed: {'action_type': 'accept_case', 'case_id': 'CB-G3'}
outcome PnL (normalized): +0.000
accept_case is not a valid environment action. The valid set is:
select_case, inspect_case, query_system, retrieve_policy, add_evidence,
remove_evidence, set_strategy, submit_representment, resolve_case,
respond_to_pre_arb, escalate_to_arbitration, accept_arbitration_loss,
wait_for_updates
The closest valid neighbours are accept_chargeback and accept_arbitration_loss. The model has fused two valid token prefixes (accept_β¦ and β¦case) into an invalid hybrid that nevertheless parses as JSON.
outcome PnL = +0.000 confirms the env never executed the action β the action_from_completion β ChargebackOpsAction validation rejected it before reaching env.step.
Why the eval scored 0.8132 anyway
The eval rollout helper run_episode_with_text_policy catches unparseable model output and falls back to the heuristic:
action = action_from_completion(completion)
used_fallback = False
if action is None:
invalid += 1
action = _fallback_action(observation) # β heuristic_policy(observation)
used_fallback = True
For every step of every episode in iter 5's eval:
- Model emits
{"action_type":"accept_case",...}. action_from_completionreturnsNone(validation fails).- Helper invokes
_fallback_actionwhich callsheuristic_policy(observation). - Helper executes the heuristic's action via
env.step(action). - Heuristic continues to choose the next action because the model's next emission is also invalid.
- The episode completes entirely under the heuristic policy.
- The OpenEnv rubric grades the final state. Because the heuristic produced a heuristic-quality packet, the rubric awards heuristic-quality score.
The trained model contributes one invalid action per step. The fallback path produces 100% of the executed actions. The score reflects the heuristic exclusively. The eval reports 0.8132 β the heuristic's score, attributed to the trained model.
This is not a bug in the rubric grader. The grader correctly evaluates whatever ended up in the final state. It is a bug in attribution: the rollout helper attributes the heuristic's actions to the trained policy.
Why GRPO converged on this specific exploit
The outcome reward function compute_outcome_reward:
- Resets env to
(task_id, state_step). - Takes the model's parsed action.
- If parsing fails, returns 0.0 and stops (no fallback at training time).
- Otherwise, applies the model's action then rolls the heuristic forward and returns terminal $-PnL.
So at training time, an invalid action returns reward 0.0. The format reward returns β0.10 for invalid JSON β but accept_case is valid JSON, so the format reward returns +0.05. Net training reward for accept_case: +0.05.
That is below what a valid winning action returns (typically +0.5 to +1.0). So why did GRPO converge to accept_case?
Three contributing factors:
- The
+0.05floor is reliable. At temperature 1.3 the model's natural valid-action win rate is variable; the format-only reward of+0.05is collected on every invalid-but-parseable rollout, contributing low-variance positive signal. - GRPO rewards low-variance positive signals more than rare large positives when within-group
stdis small. A group where 8/8 generations score+0.05produces zero advantage (good β does not push), but a group where 8/8 score+0.05and one rare neighbour scored+1.0actually punishes the+0.05actions because the advantage normalisation makes them negative relative to the group mean. Theaccept_caseattractor is locally stable. - Once the policy collapses onto
accept_case, every group is uniformly invalid β uniformly+0.05β zero advantage β no gradient out of the attractor. The policy is locked.
This matches the GRPO collapse dynamics described in the broader literature: outcome rewards with sparse positive signals can produce attractors at zero-advantage equilibria where the policy emits low-reward but uniformly-rewarded outputs.
What the headline number actually represents
If you read only the curve and the eval table, the trained checkpoint matches the heuristic. That is technically what the rollout produced and what the rubric scored. But the attribution is wrong:
| Score component | Source |
|---|---|
| First action of every step | Trained model β invalid accept_case |
| Every executed env action | Heuristic policy via fallback |
| Final case state graded by rubric | Heuristic-produced |
| Reported eval score | 0.8132 (heuristic baseline) |
| Trained model's actual contribution to the score | 0.000 |
The honest "trained vs untrained" delta on iter 5 is the SFT step at 0.536 β a real +0.08 absolute improvement over the untrained Qwen2.5-3B base, attributable to legitimate SFT learning.
Remedies
Three remediation paths, in order of preference:
A. Penalise invalid actions in the rollout score (recommended)
Modify run_episode_with_text_policy to record invalid_actions in the returned EpisodeResult, and modify the eval grader to discount the rubric score by a calibrated penalty per invalid action. Keeps the fallback (which is useful for evaluating partially-broken checkpoints) but removes the gaming incentive.
# In evaluation/grading.py:
final_score = report.normalized_score - 0.05 * episode_result.invalid_actions
B. Disable fallback during eval
Remove the heuristic fallback in run_episode_with_text_policy. On invalid action, mark the episode as failed and record score 0.0. Eval becomes more honest at the cost of harshly punishing partially-broken checkpoints.
C. Retrain with format reward calibrated against invalid-but-parseable actions
The current compute_format_reward returns +0.05 for any parseable JSON. Tightening this to require action_type β valid_action_set would set the reward for accept_case to β0.10, eliminating the attractor. This is the principled fix at the reward layer.
The recommended path for the next training run is A + C: invalid-action penalty in eval + tightened format reward in training.
Why this finding belongs in a release
Specification gaming via eval-pipeline fallback is not documented in the GRPO literature surveyed for this project. DeepSeekMath and the wider RLHF / RLVR papers warmstart from instruct base models without an SFT-warmstarted policy emitting invalid-but-parseable JSON. Krakovna et al. (2020) catalogue specification-gaming examples in classical RL but do not cover the LLM-as-policy + rollout-helper-fallback pattern that produces this attractor.
Practitioners applying GRPO to a typed-action environment with a fallback-equipped rollout helper should:
- Audit the rollout helper for fallback behaviour on invalid actions.
- Verify that the format reward distinguishes parseable JSON from valid actions.
- Inspect a diagnostic rollout (one action per task) before trusting any eval score that exactly matches a baseline.
The third point is the most important. If a trained checkpoint scores bit-exactly a baseline policy's score, that is almost certainly a fallback exploit, not convergent learning.
Reproducibility
To reproduce the gaming:
- Run the notebook end-to-end with iter-5 hyperparameters (the published configuration).
- After eval, run the diagnostic cell. Verify model emits
accept_caseon all three tasks. - Verify
outcome_PnL = 0.000on all three (the env rejected the action). - Verify the eval
OVERALL CURVEreports0.8132exactly at any GRPO checkpoint after step 80.
To reproduce the legitimate SFT result (0.536), run only Phase A and stop before Phase B.
References
- Krakovna et al., Specification Gaming: The Flip Side of AI Ingenuity, DeepMind, 2020.
- Weng, Reward Hacking in Reinforcement Learning, 2024.
- Skalse et al., Defining and Characterizing Reward Hacking, 2022.
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024 (GRPO).
