Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

911 / docs /reward_design.md

Abhinav31122006

feat: exploit analysis, architecture docs, observation depth, citation

0b2675d about 2 months ago

preview code

raw

history blame contribute delete

2.12 kB

	# Reward Design — 911 Dispatch Supervisor

	## Philosophy

	The reward function is designed around one principle: life before property, speed before coverage. Every component weight reflects real dispatch priority doctrine.

	## Components

	\| Component \| Weight \| What it measures \|
	\|---\|---\|---\|
	\| Response Time \| 30% \| How fast the correct unit reaches the incident \|
	\| Triage \| 25% \| Whether unit type matches incident type (MEDIC→medical, ENGINE→fire) \|
	\| Survival \| 25% \| Whether P1 patients survive to resolution \|
	\| Coverage \| 12% \| Whether city districts have available units nearby \|
	\| Protocol \| 8% \| Whether dispatch notes use realistic radio phraseology \|

	## Safety Gate

	If any Priority-1 incident results in zero survival (patient died, or unit never arrived), the total episode reward is hard-capped at 0.2 — regardless of how well the agent performed on all other incidents.

	This is not a bug. It reflects real dispatch accountability: no amount of good coverage or fast response on secondary incidents excuses a preventable P1 death.

	## Partial Progress

	Rewards are non-sparse. An agent receives signal every step for:
	- Units moving toward incidents (ETA decreasing)
	- Correct unit types being dispatched
	- Districts maintaining coverage

	This means even a weak agent that dispatches randomly receives informative gradient signal, making the environment suitable for both RL training and LLM evaluation.

	## Difficulty Gradient

	\| Task \| Random Score \| LLM Expected \| Design Intent \|
	\|---\|---\|---\|---\|
	\| single_incident \| 0.20 \| 0.55–0.75 \| One decision, one unit — tests basic triage \|
	\| multi_incident \| 0.31 \| 0.40–0.60 \| Competing P1s — tests priority ordering \|
	\| mass_casualty \| ~0.28 \| 0.30–0.50 \| Surprise waves — tests adaptability \|
	\| shift_surge \| ~0.25 \| 0.25–0.40 \| Resource scarcity — tests planning under constraint \|

	The gap between random and LLM scores is the signal this benchmark measures.
	A model that scores 0.70 on single_incident but 0.25 on shift_surge is demonstrating
	exactly the capability boundary the environment is designed to expose.