| # Agent Cost Optimizer β Living README |
|
|
| **Version:** 2026-05-11 (post-critique, post-cleanup) |
|
|
| ## What This Is |
|
|
| A **static cascade router** for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop. |
|
|
| ## What This Is NOT |
|
|
| - A trained ML router (we tried, it failed β oracle gap too narrow) |
| - A doom detector that terminates runs (58-72% recover, rescue instead) |
| - A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts) |
| - A complete Agent Cost Optimizer framework (many components are analysis, not implementation) |
|
|
| ## The One True Result |
|
|
| | Strategy | Solved | Cost | $/Solved | |
| |---|---|---|---| |
| | Cascade T1βT2βT4 | 416/500 (83.2%) | $86.33 | $0.21 | |
| | Frontier single | 391/500 (78.2%) | $158.34 | $0.40 | |
| | Frontier retry | 420/500 (84.0%) | $196.77 | $0.47 | |
|
|
| The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but **56% cheaper**. |
|
|
| ## Files That Matter |
|
|
| ### Production |
| - **`aco/aco_live.py`** β Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this. |
| - **`config.yaml`** β Configuration for models, tiers, thresholds. |
|
|
| ### Validation (NEW β this session) |
| - **`validate_cascade.py`** β Full validation pipeline: clone repo β set up conda env β run cascade agent β verify patch via pytest. No Docker needed. |
| - **`quick_validate.py`** β Single-instance quick validation. Fast feedback loop. |
| - **`dockerless_verify.py`** β Docker-less verification utility. Replicates SWE-bench conda environments. |
| |
| ### Analysis |
| - **`CORRECTED_REPORT.md`** β The corrected analysis with all 5 fixes. |
| - **`TRUTH.md`** β Honest assessment of what works, what doesn't, and what's blocking. |
| - **`FIXES_COMPLETE.md`** β Summary of the 5 critical-review fixes. |
| |
| ### Dead Ends (documented, not deleted) |
| - **`docs/trained_router_final_report.md`** β Why BERT and XGBoost routing failed. |
| - **`docs/pareto_frontier_report.md`** β Pareto analysis of cost-quality tradeoffs. |
|
|
| ### Supporting Code |
| - **`aco/router.py`** β Model cascade router. |
| - **`aco/telemetry.py`** β Cost telemetry collector. |
| - **`aco/classifier.py`** β Task cost classifier. |
| - **`aco/context_budgeter.py`** β Context budget policy. |
| - **`aco/tool_gate.py`** β Tool-use cost gate. |
| - **`aco/verifier_budgeter.py`** β Verifier budget policy. |
| - **`aco/retry_optimizer.py`** β Retry/recovery optimizer. |
| - **`aco/doom_detector.py`** β Early termination (uses rescue, not terminate). |
| - **`aco/meta_tool_miner.py`** β Macro tool pattern miner. |
| - **`aco/cache_layout.py`** β Cache-aware prompt layout. |
| - **`aco/per_step_router.py`** β Safe proposal routing. |
|
|
| ## Quick Start |
|
|
| ```bash |
| # Drop-in cascade wrapper |
| from aco.aco_live import CascadeOptimizer |
| |
| opt = CascadeOptimizer(strategy="cascade") |
| response = opt.run("Fix the null pointer in utils.py", context={...}) |
| |
| # Full validation (single instance) |
| python quick_validate.py django__django-14315 |
| |
| # Full validation (batch) |
| python validate_cascade.py --batch 5 --target cascade-only |
| ``` |
|
|
| ## The Blocking Issue |
|
|
| **No Docker daemon on HF infrastructure.** Without Docker, we can't: |
| - Run SWE-bench test verification in the exact Docker environments |
| - The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround |
| - Conda provides the same packages but at different filesystem paths β test patches may not apply cleanly |
|
|
| **To unblock:** Any host with Docker daemon + API keys runs: |
| ```bash |
| python cascade_agent.py --batch 50 |
| ``` |
|
|
| ## What Was Spent |
|
|
| - **All training:** Wasted (~$0, free tier HF Jobs) |
| - **This session:** CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN) |
| - **Total compute cost across all sessions:** < $20 (mostly unused GPU sandbox time) |
| |
| ## Next Steps (Priority Order) |
| |
| 1. **Validate cascade on 5-10 instances with conda** β **WE ARE HERE** |
| 2. **Inspect the cascade-only T2 patches manually** |
| 3. **Run Docker agent on 50 instances** (needs Docker daemon) |
| 4. **Frontier with equal retries** β give frontier 3 attempts, compare |
| 5. **Add gpt-5.2-medium as tier** β catches 14 frontier-retry-only instances |
| 6. **Macro tool integration** β deterministic subprocess calls |
| 7. **Doom rescue** β replace termination with targeted recovery |
| |
| ## Bottom Line |
| |
| The cascade works on paper. The data is solid. The code is functional. Now we need: |
| 1. **Validation** β prove the conda-based pipeline works on 3-5 instances |
| 2. **Docker** β run the real thing at scale |
| 3. **Inspection** β human review of cascade-only patches |
| |