Upload README.md

3b0aa48 verified 16 days ago

4.65 kB

	# Agent Cost Optimizer — Living README

	Version: 2026-05-11 (post-critique, post-cleanup)

	## What This Is

	A static cascade router for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.

	## What This Is NOT

	- A trained ML router (we tried, it failed — oracle gap too narrow)
	- A doom detector that terminates runs (58-72% recover, rescue instead)
	- A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
	- A complete Agent Cost Optimizer framework (many components are analysis, not implementation)

	## The One True Result

	\| Strategy \| Solved \| Cost \| $/Solved \|
	\|---\|---\|---\|---\|
	\| Cascade T1→T2→T4 \| 416/500 (83.2%) \| $86.33 \| $0.21 \|
	\| Frontier single \| 391/500 (78.2%) \| $158.34 \| $0.40 \|
	\| Frontier retry \| 420/500 (84.0%) \| $196.77 \| $0.47 \|

	The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but 56% cheaper.

	## Files That Matter

	### Production
	- `aco/aco_live.py` — Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
	- `config.yaml` — Configuration for models, tiers, thresholds.

	### Validation (NEW — this session)
	- `validate_cascade.py` — Full validation pipeline: clone repo → set up conda env → run cascade agent → verify patch via pytest. No Docker needed.
	- `quick_validate.py` — Single-instance quick validation. Fast feedback loop.
	- `dockerless_verify.py` — Docker-less verification utility. Replicates SWE-bench conda environments.

	### Analysis
	- `CORRECTED_REPORT.md` — The corrected analysis with all 5 fixes.
	- `TRUTH.md` — Honest assessment of what works, what doesn't, and what's blocking.
	- `FIXES_COMPLETE.md` — Summary of the 5 critical-review fixes.

	### Dead Ends (documented, not deleted)
	- `docs/trained_router_final_report.md` — Why BERT and XGBoost routing failed.
	- `docs/pareto_frontier_report.md` — Pareto analysis of cost-quality tradeoffs.

	### Supporting Code
	- `aco/router.py` — Model cascade router.
	- `aco/telemetry.py` — Cost telemetry collector.
	- `aco/classifier.py` — Task cost classifier.
	- `aco/context_budgeter.py` — Context budget policy.
	- `aco/tool_gate.py` — Tool-use cost gate.
	- `aco/verifier_budgeter.py` — Verifier budget policy.
	- `aco/retry_optimizer.py` — Retry/recovery optimizer.
	- `aco/doom_detector.py` — Early termination (uses rescue, not terminate).
	- `aco/meta_tool_miner.py` — Macro tool pattern miner.
	- `aco/cache_layout.py` — Cache-aware prompt layout.
	- `aco/per_step_router.py` — Safe proposal routing.

	## Quick Start

	```bash
	# Drop-in cascade wrapper
	from aco.aco_live import CascadeOptimizer

	opt = CascadeOptimizer(strategy="cascade")
	response = opt.run("Fix the null pointer in utils.py", context={...})

	# Full validation (single instance)
	python quick_validate.py django__django-14315

	# Full validation (batch)
	python validate_cascade.py --batch 5 --target cascade-only
	```

	## The Blocking Issue

	No Docker daemon on HF infrastructure. Without Docker, we can't:
	- Run SWE-bench test verification in the exact Docker environments
	- The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround
	- Conda provides the same packages but at different filesystem paths → test patches may not apply cleanly

	To unblock: Any host with Docker daemon + API keys runs:
	```bash
	python cascade_agent.py --batch 50
	```

	## What Was Spent

	- All training: Wasted (~$0, free tier HF Jobs)
	- This session: CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
	- Total compute cost across all sessions: < $20 (mostly unused GPU sandbox time)

	## Next Steps (Priority Order)

	1. Validate cascade on 5-10 instances with conda ← WE ARE HERE
	2. Inspect the cascade-only T2 patches manually
	3. Run Docker agent on 50 instances (needs Docker daemon)
	4. Frontier with equal retries — give frontier 3 attempts, compare
	5. Add gpt-5.2-medium as tier — catches 14 frontier-retry-only instances
	6. Macro tool integration — deterministic subprocess calls
	7. Doom rescue — replace termination with targeted recovery

	## Bottom Line

	The cascade works on paper. The data is solid. The code is functional. Now we need:
	1. Validation — prove the conda-based pipeline works on 3-5 instances
	2. Docker — run the real thing at scale
	3. Inspection — human review of cascade-only patches

	# Agent Cost Optimizer — Living README

	Version: 2026-05-11 (post-critique, post-cleanup)

	## What This Is

	A static cascade router for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.

	## What This Is NOT

	- A trained ML router (we tried, it failed — oracle gap too narrow)
	- A doom detector that terminates runs (58-72% recover, rescue instead)
	- A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
	- A complete Agent Cost Optimizer framework (many components are analysis, not implementation)

	## The One True Result

	\| Strategy \| Solved \| Cost \| $/Solved \|
	\|---\|---\|---\|---\|
	\| Cascade T1→T2→T4 \| 416/500 (83.2%) \| $86.33 \| $0.21 \|
	\| Frontier single \| 391/500 (78.2%) \| $158.34 \| $0.40 \|
	\| Frontier retry \| 420/500 (84.0%) \| $196.77 \| $0.47 \|

	The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but 56% cheaper.

	## Files That Matter

	### Production
	- `aco/aco_live.py` — Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
	- `config.yaml` — Configuration for models, tiers, thresholds.

	### Validation (NEW — this session)
	- `validate_cascade.py` — Full validation pipeline: clone repo → set up conda env → run cascade agent → verify patch via pytest. No Docker needed.
	- `quick_validate.py` — Single-instance quick validation. Fast feedback loop.
	- `dockerless_verify.py` — Docker-less verification utility. Replicates SWE-bench conda environments.

	### Analysis
	- `CORRECTED_REPORT.md` — The corrected analysis with all 5 fixes.
	- `TRUTH.md` — Honest assessment of what works, what doesn't, and what's blocking.
	- `FIXES_COMPLETE.md` — Summary of the 5 critical-review fixes.

	### Dead Ends (documented, not deleted)
	- `docs/trained_router_final_report.md` — Why BERT and XGBoost routing failed.
	- `docs/pareto_frontier_report.md` — Pareto analysis of cost-quality tradeoffs.

	### Supporting Code
	- `aco/router.py` — Model cascade router.
	- `aco/telemetry.py` — Cost telemetry collector.
	- `aco/classifier.py` — Task cost classifier.
	- `aco/context_budgeter.py` — Context budget policy.
	- `aco/tool_gate.py` — Tool-use cost gate.
	- `aco/verifier_budgeter.py` — Verifier budget policy.
	- `aco/retry_optimizer.py` — Retry/recovery optimizer.
	- `aco/doom_detector.py` — Early termination (uses rescue, not terminate).
	- `aco/meta_tool_miner.py` — Macro tool pattern miner.
	- `aco/cache_layout.py` — Cache-aware prompt layout.
	- `aco/per_step_router.py` — Safe proposal routing.

	## Quick Start

	```bash
	# Drop-in cascade wrapper
	from aco.aco_live import CascadeOptimizer

	opt = CascadeOptimizer(strategy="cascade")
	response = opt.run("Fix the null pointer in utils.py", context={...})

	# Full validation (single instance)
	python quick_validate.py django__django-14315

	# Full validation (batch)
	python validate_cascade.py --batch 5 --target cascade-only
	```

	## The Blocking Issue

	No Docker daemon on HF infrastructure. Without Docker, we can't:
	- Run SWE-bench test verification in the exact Docker environments
	- The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround
	- Conda provides the same packages but at different filesystem paths → test patches may not apply cleanly

	To unblock: Any host with Docker daemon + API keys runs:
	```bash
	python cascade_agent.py --batch 50
	```

	## What Was Spent

	- All training: Wasted (~$0, free tier HF Jobs)
	- This session: CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
	- Total compute cost across all sessions: < $20 (mostly unused GPU sandbox time)

	## Next Steps (Priority Order)

	1. Validate cascade on 5-10 instances with conda ← WE ARE HERE
	2. Inspect the cascade-only T2 patches manually
	3. Run Docker agent on 50 instances (needs Docker daemon)
	4. Frontier with equal retries — give frontier 3 attempts, compare
	5. Add gpt-5.2-medium as tier — catches 14 frontier-retry-only instances
	6. Macro tool integration — deterministic subprocess calls
	7. Doom rescue — replace termination with targeted recovery

	## Bottom Line

	The cascade works on paper. The data is solid. The code is functional. Now we need:
	1. Validation — prove the conda-based pipeline works on 3-5 instances
	2. Docker — run the real thing at scale
	3. Inspection — human review of cascade-only patches