911 / docs /exploit_analysis.md
Abhinav31122006
feat: exploit analysis, architecture docs, observation depth, citation
0b2675d
# Exploit Analysis — 911 Dispatch Supervisor
## Known Attack Vectors Considered & Closed
### 1. Reward Farming via Repeated Dispatch-Cancel Cycles
**Vector:** Agent dispatches a unit, immediately cancels, re-dispatches to collect
partial response_time reward on each cycle without ever resolving incidents.
**Mitigation:** Cancel actions return the unit to AVAILABLE but do NOT reset the
incident survival clock. The incident continues counting down regardless of agent
action, so farming cancel-dispatch cycles accelerates incident escalation and
triggers the Safety Gate, collapsing the score to ≤0.2.
### 2. Safety Gate Bypass via P2-Only Dispatching
**Vector:** Agent ignores all P1 incidents and only resolves P2/P3 incidents to
accumulate triage and response_time rewards without triggering the Safety Gate.
**Mitigation:** The Safety Gate activates if ANY P1 incident existed during the
episode and its survival score is 0.0. The agent cannot avoid P1 incidents
existing — they are spawned deterministically by the scenario fixture.
### 3. Coverage Score Farming via Staging
**Vector:** Agent repeatedly stages all units in one district to maximize coverage
score for that district while ignoring active incidents.
**Mitigation:** Coverage score is computed across ALL districts simultaneously.
Concentrating units in one district reduces coverage elsewhere, and staged units
cannot respond to incidents without an explicit dispatch action, allowing incident
survival clocks to expire and triggering escalation penalties.
### 4. Phraseology Score Inflation via Notes Stuffing
**Vector:** Agent fills the notes field with every possible dispatch phrase to
maximize token overlap with canonical phrases.
**Mitigation:** PhraseologyJudge uses token overlap normalized by notes length.
Stuffing long notes with irrelevant text reduces precision, keeping the score low.
Only notes that match the specific action type and incident type score highly.
### 5. Determinism Exploitation
**Vector:** Agent memorizes the exact incident sequence (seed=42) and hardcodes
optimal actions rather than learning dispatch reasoning.
**Mitigation:** This is intentional for reproducibility. However, the wave spawn
system introduces timing-dependent incident locations with small perturbations,
meaning hardcoded action sequences fail when unit positions vary. The environment
is designed for evaluation, not training-time generalization.
## Conclusion
No reward farming exploit was found that allows an agent to score >0.6 without
genuinely resolving Priority-1 incidents with correct unit types within survival
clock windows.