Agent Cost Optimizer — Living README

Version: 2026-05-11 (post-critique, post-cleanup)

What This Is

A static cascade router for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.

What This Is NOT

A trained ML router (we tried, it failed — oracle gap too narrow)
A doom detector that terminates runs (58-72% recover, rescue instead)
A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
A complete Agent Cost Optimizer framework (many components are analysis, not implementation)

The One True Result

Strategy	Solved	Cost	$/Solved
Cascade T1→T2→T4	416/500 (83.2%)	$86.33	$0.21
Frontier single	391/500 (78.2%)	$158.34	$0.40
Frontier retry	420/500 (84.0%)	$196.77	$0.47

The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but 56% cheaper.

Files That Matter

Production

aco/aco_live.py — Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
config.yaml — Configuration for models, tiers, thresholds.

Validation (NEW — this session)

validate_cascade.py — Full validation pipeline: clone repo → set up conda env → run cascade agent → verify patch via pytest. No Docker needed.
quick_validate.py — Single-instance quick validation. Fast feedback loop.
dockerless_verify.py — Docker-less verification utility. Replicates SWE-bench conda environments.

Analysis

CORRECTED_REPORT.md — The corrected analysis with all 5 fixes.
TRUTH.md — Honest assessment of what works, what doesn't, and what's blocking.
FIXES_COMPLETE.md — Summary of the 5 critical-review fixes.

Dead Ends (documented, not deleted)

docs/trained_router_final_report.md — Why BERT and XGBoost routing failed.
docs/pareto_frontier_report.md — Pareto analysis of cost-quality tradeoffs.

Supporting Code

aco/router.py — Model cascade router.
aco/telemetry.py — Cost telemetry collector.
aco/classifier.py — Task cost classifier.
aco/context_budgeter.py — Context budget policy.
aco/tool_gate.py — Tool-use cost gate.
aco/verifier_budgeter.py — Verifier budget policy.
aco/retry_optimizer.py — Retry/recovery optimizer.
aco/doom_detector.py — Early termination (uses rescue, not terminate).
aco/meta_tool_miner.py — Macro tool pattern miner.
aco/cache_layout.py — Cache-aware prompt layout.
aco/per_step_router.py — Safe proposal routing.

Quick Start

# Drop-in cascade wrapper
from aco.aco_live import CascadeOptimizer

opt = CascadeOptimizer(strategy="cascade")
response = opt.run("Fix the null pointer in utils.py", context={...})

# Full validation (single instance)
python quick_validate.py django__django-14315

# Full validation (batch)
python validate_cascade.py --batch 5 --target cascade-only

The Blocking Issue

No Docker daemon on HF infrastructure. Without Docker, we can't:

Run SWE-bench test verification in the exact Docker environments
The dockerless_verify.py + validate_cascade.py scripts use conda as a workaround
Conda provides the same packages but at different filesystem paths → test patches may not apply cleanly

To unblock: Any host with Docker daemon + API keys runs:

python cascade_agent.py --batch 50

What Was Spent

All training: Wasted (~$0, free tier HF Jobs)
This session: CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
Total compute cost across all sessions: < $20 (mostly unused GPU sandbox time)

Next Steps (Priority Order)

Validate cascade on 5-10 instances with conda ← WE ARE HERE
Inspect the cascade-only T2 patches manually
Run Docker agent on 50 instances (needs Docker daemon)
Frontier with equal retries — give frontier 3 attempts, compare
Add gpt-5.2-medium as tier — catches 14 frontier-retry-only instances
Macro tool integration — deterministic subprocess calls
Doom rescue — replace termination with targeted recovery

Bottom Line

The cascade works on paper. The data is solid. The code is functional. Now we need:

Validation — prove the conda-based pipeline works on 3-5 instances
Docker — run the real thing at scale
Inspection — human review of cascade-only patches

Downloads last month: 130

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support