# Agent Cost Optimizer — Living README

**Version:** 2026-05-11 (post-critique, post-cleanup)

## What This Is

A **static cascade router** for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.

## What This Is NOT

- A trained ML router (we tried, it failed — oracle gap too narrow)
- A doom detector that terminates runs (58-72% recover, rescue instead)
- A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
- A complete Agent Cost Optimizer framework (many components are analysis, not implementation)

## The One True Result

| Strategy | Solved | Cost | $/Solved |
|---|---|---|---|
| Cascade T1→T2→T4 | 416/500 (83.2%) | $86.33 | $0.21 |
| Frontier single | 391/500 (78.2%) | $158.34 | $0.40 |
| Frontier retry | 420/500 (84.0%) | $196.77 | $0.47 |

The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but **56% cheaper**.

## Files That Matter

### Production
- **`aco/aco_live.py`** — Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
- **`config.yaml`** — Configuration for models, tiers, thresholds.

### Validation (NEW — this session)
- **`validate_cascade.py`** — Full validation pipeline: clone repo → set up conda env → run cascade agent → verify patch via pytest. No Docker needed.
- **`quick_validate.py`** — Single-instance quick validation. Fast feedback loop.
- **`dockerless_verify.py`** — Docker-less verification utility. Replicates SWE-bench conda environments.

### Analysis
- **`CORRECTED_REPORT.md`** — The corrected analysis with all 5 fixes.
- **`TRUTH.md`** — Honest assessment of what works, what doesn't, and what's blocking.
- **`FIXES_COMPLETE.md`** — Summary of the 5 critical-review fixes.

### Dead Ends (documented, not deleted)
- **`docs/trained_router_final_report.md`** — Why BERT and XGBoost routing failed.
- **`docs/pareto_frontier_report.md`** — Pareto analysis of cost-quality tradeoffs.

### Supporting Code
- **`aco/router.py`** — Model cascade router.
- **`aco/telemetry.py`** — Cost telemetry collector.
- **`aco/classifier.py`** — Task cost classifier.
- **`aco/context_budgeter.py`** — Context budget policy.
- **`aco/tool_gate.py`** — Tool-use cost gate.
- **`aco/verifier_budgeter.py`** — Verifier budget policy.
- **`aco/retry_optimizer.py`** — Retry/recovery optimizer.
- **`aco/doom_detector.py`** — Early termination (uses rescue, not terminate).
- **`aco/meta_tool_miner.py`** — Macro tool pattern miner.
- **`aco/cache_layout.py`** — Cache-aware prompt layout.
- **`aco/per_step_router.py`** — Safe proposal routing.

## Quick Start

```bash
# Drop-in cascade wrapper
from aco.aco_live import CascadeOptimizer

opt = CascadeOptimizer(strategy="cascade")
response = opt.run("Fix the null pointer in utils.py", context={...})

# Full validation (single instance)
python quick_validate.py django__django-14315

# Full validation (batch)
python validate_cascade.py --batch 5 --target cascade-only
```

## The Blocking Issue

**No Docker daemon on HF infrastructure.** Without Docker, we can't:
- Run SWE-bench test verification in the exact Docker environments
- The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround
- Conda provides the same packages but at different filesystem paths → test patches may not apply cleanly

**To unblock:** Any host with Docker daemon + API keys runs:
```bash
python cascade_agent.py --batch 50
```

## What Was Spent

- **All training:** Wasted (~$0, free tier HF Jobs)
- **This session:** CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
- **Total compute cost across all sessions:** < $20 (mostly unused GPU sandbox time)

## Next Steps (Priority Order)

1. **Validate cascade on 5-10 instances with conda** ← **WE ARE HERE**
2. **Inspect the cascade-only T2 patches manually**
3. **Run Docker agent on 50 instances** (needs Docker daemon)
4. **Frontier with equal retries** — give frontier 3 attempts, compare
5. **Add gpt-5.2-medium as tier** — catches 14 frontier-retry-only instances
6. **Macro tool integration** — deterministic subprocess calls
7. **Doom rescue** — replace termination with targeted recovery

## Bottom Line

The cascade works on paper. The data is solid. The code is functional. Now we need:
1. **Validation** — prove the conda-based pipeline works on 3-5 instances
2. **Docker** — run the real thing at scale
3. **Inspection** — human review of cascade-only patches