911 / docs /exploit_analysis.md
Abhinav31122006
feat: exploit analysis, architecture docs, observation depth, citation
0b2675d

Exploit Analysis — 911 Dispatch Supervisor

Known Attack Vectors Considered & Closed

1. Reward Farming via Repeated Dispatch-Cancel Cycles

Vector: Agent dispatches a unit, immediately cancels, re-dispatches to collect partial response_time reward on each cycle without ever resolving incidents. Mitigation: Cancel actions return the unit to AVAILABLE but do NOT reset the incident survival clock. The incident continues counting down regardless of agent action, so farming cancel-dispatch cycles accelerates incident escalation and triggers the Safety Gate, collapsing the score to ≤0.2.

2. Safety Gate Bypass via P2-Only Dispatching

Vector: Agent ignores all P1 incidents and only resolves P2/P3 incidents to accumulate triage and response_time rewards without triggering the Safety Gate. Mitigation: The Safety Gate activates if ANY P1 incident existed during the episode and its survival score is 0.0. The agent cannot avoid P1 incidents existing — they are spawned deterministically by the scenario fixture.

3. Coverage Score Farming via Staging

Vector: Agent repeatedly stages all units in one district to maximize coverage score for that district while ignoring active incidents. Mitigation: Coverage score is computed across ALL districts simultaneously. Concentrating units in one district reduces coverage elsewhere, and staged units cannot respond to incidents without an explicit dispatch action, allowing incident survival clocks to expire and triggering escalation penalties.

4. Phraseology Score Inflation via Notes Stuffing

Vector: Agent fills the notes field with every possible dispatch phrase to maximize token overlap with canonical phrases. Mitigation: PhraseologyJudge uses token overlap normalized by notes length. Stuffing long notes with irrelevant text reduces precision, keeping the score low. Only notes that match the specific action type and incident type score highly.

5. Determinism Exploitation

Vector: Agent memorizes the exact incident sequence (seed=42) and hardcodes optimal actions rather than learning dispatch reasoning. Mitigation: This is intentional for reproducibility. However, the wave spawn system introduces timing-dependent incident locations with small perturbations, meaning hardcoded action sequences fail when unit positions vary. The environment is designed for evaluation, not training-time generalization.

Conclusion

No reward farming exploit was found that allows an agent to score >0.6 without genuinely resolving Priority-1 incidents with correct unit types within survival clock windows.