Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

911 / _patcher.py

SayedZahur786

feat: phase3 improvements - reward clarity, survival clocks, MCP endpoint, phraseology docs

1d762f3 about 1 month ago

raw

history blame contribute delete

3.87 kB

	import re

	with open('README.md', 'r', encoding='utf-8') as f:
	readme = f.read()

	# A3 replacements
	readme = readme.replace(
	"What a good agent does: Immediately dispatches `MED-1 → INC-001`.",
	"What a good agent does: Immediately dispatches `MED-1 → INC-001`.\n\nScoring: 50% resolution + 30% correct unit type used + 20% response speed."
	)

	readme = readme.replace(
	"What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.",
	"What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.\n\nScoring: 50% P1 resolution + 30% overall resolution − 20% escalation penalty."
	)

	readme = readme.replace(
	"What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.",
	"What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.\n\nScoring: 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded."
	)

	readme = readme.replace(
	"Why it's hard: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.",
	"Why it's hard: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.\n\nScoring: 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty."
	)

	# A4 replacements
	a4_addition = """
	### What the scores mean

	A random agent scoring 0.20 on the easiest task confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.

	A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score 0.55–0.75 on single_incident and 0.30–0.45 on shift_surge, demonstrating the environment meaningfully differentiates agent capability.
	"""

	# We'll insert A4 right after the NOTE blockquote below the baseline score table.
	# Existing note text: > Note: Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.

	readme = readme.replace(
	"clamped to `[0.0, 1.0]`.\n",
	f"clamped to `[0.0, 1.0]`.\n\n{a4_addition.strip()}\n"
	)

	# D1 replacements (Phraseology examples)
	d1_addition = """
	#### Dispatch Phraseology (bonus scoring)

	The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.

	\| Action \| Example notes value \|
	\|---\|---\|
	\| Dispatch MEDIC to cardiac \| `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` \|
	\| Dispatch ENGINE to fire \| `"Engine 2 responding to structure fire, Code 3, all units advised"` \|
	\| Mutual aid request \| `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` \|
	\| Stage unit \| `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` \|
	"""
	readme = readme.replace(
	"\| `DOWNGRADE` \| Decrease incident severity \| New severity must be strictly lower than current \|\n",
	"\| `DOWNGRADE` \| Decrease incident severity \| New severity must be strictly lower than current \|\n\n" + d1_addition.strip() + "\n"
	)

	with open('README.md', 'w', encoding='utf-8') as f:
	f.write(readme)
	print("Finished A3 A4 D1.")