narcolepticchicken's picture
Upload README.md
3b0aa48 verified
# Agent Cost Optimizer β€” Living README
**Version:** 2026-05-11 (post-critique, post-cleanup)
## What This Is
A **static cascade router** for SWE-bench coding agents. Tries cheap models first, only escalates to frontier when needed. Saves ~56% cost at statistically equivalent quality. No machine learning. Just a for-loop.
## What This Is NOT
- A trained ML router (we tried, it failed β€” oracle gap too narrow)
- A doom detector that terminates runs (58-72% recover, rescue instead)
- A cache-aware prompt optimizer (only 1.5% static prefix in SWE-bench prompts)
- A complete Agent Cost Optimizer framework (many components are analysis, not implementation)
## The One True Result
| Strategy | Solved | Cost | $/Solved |
|---|---|---|---|
| Cascade T1β†’T2β†’T4 | 416/500 (83.2%) | $86.33 | $0.21 |
| Frontier single | 391/500 (78.2%) | $158.34 | $0.40 |
| Frontier retry | 420/500 (84.0%) | $196.77 | $0.47 |
The cascade is statistically tied with frontier-retry on quality (95% CI: [-2.8pp, +1.0pp]) but **56% cheaper**.
## Files That Matter
### Production
- **`aco/aco_live.py`** β€” Drop-in cascade wrapper. Three strategies: cascade, safe_proposal, safe_proposal_t2. Use this.
- **`config.yaml`** β€” Configuration for models, tiers, thresholds.
### Validation (NEW β€” this session)
- **`validate_cascade.py`** β€” Full validation pipeline: clone repo β†’ set up conda env β†’ run cascade agent β†’ verify patch via pytest. No Docker needed.
- **`quick_validate.py`** β€” Single-instance quick validation. Fast feedback loop.
- **`dockerless_verify.py`** β€” Docker-less verification utility. Replicates SWE-bench conda environments.
### Analysis
- **`CORRECTED_REPORT.md`** β€” The corrected analysis with all 5 fixes.
- **`TRUTH.md`** β€” Honest assessment of what works, what doesn't, and what's blocking.
- **`FIXES_COMPLETE.md`** β€” Summary of the 5 critical-review fixes.
### Dead Ends (documented, not deleted)
- **`docs/trained_router_final_report.md`** β€” Why BERT and XGBoost routing failed.
- **`docs/pareto_frontier_report.md`** β€” Pareto analysis of cost-quality tradeoffs.
### Supporting Code
- **`aco/router.py`** β€” Model cascade router.
- **`aco/telemetry.py`** β€” Cost telemetry collector.
- **`aco/classifier.py`** β€” Task cost classifier.
- **`aco/context_budgeter.py`** β€” Context budget policy.
- **`aco/tool_gate.py`** β€” Tool-use cost gate.
- **`aco/verifier_budgeter.py`** β€” Verifier budget policy.
- **`aco/retry_optimizer.py`** β€” Retry/recovery optimizer.
- **`aco/doom_detector.py`** β€” Early termination (uses rescue, not terminate).
- **`aco/meta_tool_miner.py`** β€” Macro tool pattern miner.
- **`aco/cache_layout.py`** β€” Cache-aware prompt layout.
- **`aco/per_step_router.py`** β€” Safe proposal routing.
## Quick Start
```bash
# Drop-in cascade wrapper
from aco.aco_live import CascadeOptimizer
opt = CascadeOptimizer(strategy="cascade")
response = opt.run("Fix the null pointer in utils.py", context={...})
# Full validation (single instance)
python quick_validate.py django__django-14315
# Full validation (batch)
python validate_cascade.py --batch 5 --target cascade-only
```
## The Blocking Issue
**No Docker daemon on HF infrastructure.** Without Docker, we can't:
- Run SWE-bench test verification in the exact Docker environments
- The `dockerless_verify.py` + `validate_cascade.py` scripts use conda as a workaround
- Conda provides the same packages but at different filesystem paths β†’ test patches may not apply cleanly
**To unblock:** Any host with Docker daemon + API keys runs:
```bash
python cascade_agent.py --batch 50
```
## What Was Spent
- **All training:** Wasted (~$0, free tier HF Jobs)
- **This session:** CPU sandbox (free) + a few GPU inference calls (free via HF_TOKEN)
- **Total compute cost across all sessions:** < $20 (mostly unused GPU sandbox time)
## Next Steps (Priority Order)
1. **Validate cascade on 5-10 instances with conda** ← **WE ARE HERE**
2. **Inspect the cascade-only T2 patches manually**
3. **Run Docker agent on 50 instances** (needs Docker daemon)
4. **Frontier with equal retries** β€” give frontier 3 attempts, compare
5. **Add gpt-5.2-medium as tier** β€” catches 14 frontier-retry-only instances
6. **Macro tool integration** β€” deterministic subprocess calls
7. **Doom rescue** β€” replace termination with targeted recovery
## Bottom Line
The cascade works on paper. The data is solid. The code is functional. Now we need:
1. **Validation** β€” prove the conda-based pipeline works on 3-5 instances
2. **Docker** β€” run the real thing at scale
3. **Inspection** β€” human review of cascade-only patches