# Agent Cost Optimizer — Living README **Version:** 2026-05-11 (post-critique, post-cleanup) ## What This Is A **static cascade router** for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop. ## What This Is NOT - A trained ML router (we tried, it failed — oracle gap too narrow) - A doom detector that terminates runs (58-72% recover, rescue instead) - A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts) - A complete Agent Cost Optimizer framework (many components are analysis, not implementation) ## The One True Result | Strategy | Solved | Cost | $/Solved | |---|---|---|---| | Cascade T1→T2→T4 | 416/500 (83.2%) | $86.33 | $0.21 | | Frontier single | 391/500 (78.2%) | $158.34 | $0.40 | | Frontier retry | 420/500 (84.0%) | $196.77 | $0.47 | The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but **56% cheaper**. ## Files That Matter ### Production - **`aco/aco_live.py`** — Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this. - **`config.yaml`** — Configuration for models, tiers, thresholds. ### Validation (NEW — this session) - **`validate_cascade.py`** — Full validation pipeline: clone repo → set up conda env → run cascade agent → verify patch via pytest. No Docker needed. - **`quick_validate.py`** — Single-instance quick validation. Fast feedback loop. - **`dockerless_verify.py`** — Docker-less verification utility. Replicates SWE-bench conda environments. ### Analysis - **`CORRECTED_REPORT.md`** — The corrected analysis with all 5 fixes. - **`TRUTH.md`** — Honest assessment of what works, what doesn't, and what's blocking. - **`FIXES_COMPLETE.md`** — Summary of the 5 critical-review fixes. ### Dead Ends (documented, not deleted) - **`docs/trained_router_final_report.md`** — Why BERT and XGBoost routing failed. - **`docs/pareto_frontier_report.md`** — Pareto analysis of cost-quality tradeoffs. ### Supporting Code - **`aco/router.py`** — Model cascade router. - **`aco/telemetry.py`** — Cost telemetry collector. - **`aco/classifier.py`** — Task cost classifier. - **`aco/context_budgeter.py`** — Context budget policy. - **`aco/tool_gate.py`** — Tool-use cost gate. - **`aco/verifier_budgeter.py`** — Verifier budget policy. - **`aco/retry_optimizer.py`** — Retry/recovery optimizer. - **`aco/doom_detector.py`** — Early termination (uses rescue, not terminate). - **`aco/meta_tool_miner.py`** — Macro tool pattern miner. - **`aco/cache_layout.py`** — Cache-aware prompt layout. - **`aco/per_step_router.py`** — Safe proposal routing. ## Quick Start ```bash # Drop-in cascade wrapper from aco.aco_live import CascadeOptimizer opt = CascadeOptimizer(strategy="cascade") response = opt.run("Fix the null pointer in utils.py", context={...}) # Full validation (single instance) python quick_validate.py django__django-14315 # Full validation (batch) python validate_cascade.py --batch 5 --target cascade-only ``` ## The Blocking Issue **No Docker daemon on HF infrastructure.** Without Docker, we can't: - Run SWE-bench test verification in the exact Docker environments - The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround - Conda provides the same packages but at different filesystem paths → test patches may not apply cleanly **To unblock:** Any host with Docker daemon + API keys runs: ```bash python cascade_agent.py --batch 50 ``` ## What Was Spent - **All training:** Wasted (~$0, free tier HF Jobs) - **This session:** CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN) - **Total compute cost across all sessions:** < $20 (mostly unused GPU sandbox time) ## Next Steps (Priority Order) 1. **Validate cascade on 5-10 instances with conda** ← **WE ARE HERE** 2. **Inspect the cascade-only T2 patches manually** 3. **Run Docker agent on 50 instances** (needs Docker daemon) 4. **Frontier with equal retries** — give frontier 3 attempts, compare 5. **Add gpt-5.2-medium as tier** — catches 14 frontier-retry-only instances 6. **Macro tool integration** — deterministic subprocess calls 7. **Doom rescue** — replace termination with targeted recovery ## Bottom Line The cascade works on paper. The data is solid. The code is functional. Now we need: 1. **Validation** — prove the conda-based pipeline works on 3-5 instances 2. **Docker** — run the real thing at scale 3. **Inspection** — human review of cascade-only patches