YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Agent Cost Optimizer β€” Living README

Version: 2026-05-11 (post-critique, post-cleanup)

What This Is

A static cascade router for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.

What This Is NOT

  • A trained ML router (we tried, it failed β€” oracle gap too narrow)
  • A doom detector that terminates runs (58-72% recover, rescue instead)
  • A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
  • A complete Agent Cost Optimizer framework (many components are analysis, not implementation)

The One True Result

Strategy Solved Cost $/Solved
Cascade T1β†’T2β†’T4 416/500 (83.2%) $86.33 $0.21
Frontier single 391/500 (78.2%) $158.34 $0.40
Frontier retry 420/500 (84.0%) $196.77 $0.47

The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but 56% cheaper.

Files That Matter

Production

  • aco/aco_live.py β€” Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
  • config.yaml β€” Configuration for models, tiers, thresholds.

Validation (NEW β€” this session)

  • validate_cascade.py β€” Full validation pipeline: clone repo β†’ set up conda env β†’ run cascade agent β†’ verify patch via pytest. No Docker needed.
  • quick_validate.py β€” Single-instance quick validation. Fast feedback loop.
  • dockerless_verify.py β€” Docker-less verification utility. Replicates SWE-bench conda environments.

Analysis

  • CORRECTED_REPORT.md β€” The corrected analysis with all 5 fixes.
  • TRUTH.md β€” Honest assessment of what works, what doesn't, and what's blocking.
  • FIXES_COMPLETE.md β€” Summary of the 5 critical-review fixes.

Dead Ends (documented, not deleted)

  • docs/trained_router_final_report.md β€” Why BERT and XGBoost routing failed.
  • docs/pareto_frontier_report.md β€” Pareto analysis of cost-quality tradeoffs.

Supporting Code

  • aco/router.py β€” Model cascade router.
  • aco/telemetry.py β€” Cost telemetry collector.
  • aco/classifier.py β€” Task cost classifier.
  • aco/context_budgeter.py β€” Context budget policy.
  • aco/tool_gate.py β€” Tool-use cost gate.
  • aco/verifier_budgeter.py β€” Verifier budget policy.
  • aco/retry_optimizer.py β€” Retry/recovery optimizer.
  • aco/doom_detector.py β€” Early termination (uses rescue, not terminate).
  • aco/meta_tool_miner.py β€” Macro tool pattern miner.
  • aco/cache_layout.py β€” Cache-aware prompt layout.
  • aco/per_step_router.py β€” Safe proposal routing.

Quick Start

# Drop-in cascade wrapper
from aco.aco_live import CascadeOptimizer

opt = CascadeOptimizer(strategy="cascade")
response = opt.run("Fix the null pointer in utils.py", context={...})

# Full validation (single instance)
python quick_validate.py django__django-14315

# Full validation (batch)
python validate_cascade.py --batch 5 --target cascade-only

The Blocking Issue

No Docker daemon on HF infrastructure. Without Docker, we can't:

  • Run SWE-bench test verification in the exact Docker environments
  • The dockerless_verify.py + validate_cascade.py scripts use conda as a workaround
  • Conda provides the same packages but at different filesystem paths β†’ test patches may not apply cleanly

To unblock: Any host with Docker daemon + API keys runs:

python cascade_agent.py --batch 50

What Was Spent

  • All training: Wasted (~$0, free tier HF Jobs)
  • This session: CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
  • Total compute cost across all sessions: < $20 (mostly unused GPU sandbox time)

Next Steps (Priority Order)

  1. Validate cascade on 5-10 instances with conda ← WE ARE HERE
  2. Inspect the cascade-only T2 patches manually
  3. Run Docker agent on 50 instances (needs Docker daemon)
  4. Frontier with equal retries β€” give frontier 3 attempts, compare
  5. Add gpt-5.2-medium as tier β€” catches 14 frontier-retry-only instances
  6. Macro tool integration β€” deterministic subprocess calls
  7. Doom rescue β€” replace termination with targeted recovery

Bottom Line

The cascade works on paper. The data is solid. The code is functional. Now we need:

  1. Validation β€” prove the conda-based pipeline works on 3-5 instances
  2. Docker β€” run the real thing at scale
  3. Inspection β€” human review of cascade-only patches
Downloads last month
130
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support