Agent Cost Optimizer (ACO)
Lightweight cost control layer for AI agent harnesses. Reduces agent run cost while preserving or improving task quality.
Status (verified 2026-06-15): v1 DistilBERT specialists are the production models. The v2 ModernBERT experiment did not improve results and regressed the tool-gater (β19.6pp F1) β use v1. See AUDIT.md for the independent verification, and TRUTH.md for the honest project state. The headline "56% cost reduction" is a trace-based simulation result, not a live Docker test-verified result (see caveats below).
What ACO Does
| Module | Problem | Solution | Impact |
|---|---|---|---|
| Model Cascade Router | Overusing frontier models | Routes to cheapest capable model | ~56% cost reduction (simulated, iso-quality) |
| Tool-Use Cost Gate | Unnecessary tool calls | Predicts whether tool call is worth cost | gate F1=0.92 (verified) |
| Verifier Budgeter | Verifying everything | Selectively gating verification | gate F1=0.65 (verified, weak) |
| Context Compressor | Growing context windows | ACON-pattern state-preserving compression | 30-54% token reduction (literature) |
| Cache-Aware Layout | Cache-unfriendly prompts | Static-at-prefix optimization | 50-90% input cost reduction (literature) |
| Intervention Threshold | Verifier-caused regressions | Only verify when safe (d/(d+r) threshold) | Prevents collapse (literature) |
| Token Budget Estimator | Verbose outputs | Per-query optimal length prediction | 15-25% output token reduction (literature) |
| Doom Detector | Continuing doomed runs | Terminate/rescue on failure signals | Prevents wasted cost |
| Meta-Tool Miner | Repeated workflows | Compress successful traces into macros | LLM calls saved on repeats |
Evidence levels: "verified" = independently re-evaluated on held-out data in this repo. "simulated" = derived from trace simulation (
aco/benchmark.py, SWE-Router traces), not live test execution. "literature" = reported by cited papers, not independently measured here.
Verified vs. Simulated β what is actually proven
| Claim | Status | Evidence |
|---|---|---|
| Tool-gater F1 = 0.92, acc = 0.98 | β Verified | Independent re-eval, N=1695 held-out |
| Tier-router F1 = 0.67, acc = 0.70 | β Verified | Independent re-eval, N=1260 held-out |
| Verifier-gater F1 = 0.65, acc = 0.65 | β Verified (weak model) | Independent re-eval, N=863 held-out |
| v2 ModernBERT beats v1 | β False | v2 regressed tool-gater β19.6pp F1; flat elsewhere |
| Cascade β 56% cost reduction at iso-quality | β οΈ Simulated only | Trace simulation; frontier-retry actually solves +4 instances |
| "4/4 Django instances solved by T1" | β οΈ Misleading | Patches passed git apply --check only β not test-verified |
| Live SWE-bench resolution by cascade agent | β Not demonstrated | No Docker/test-verified results file exists in repo |
See AUDIT.md for full detail.
Architecture
Agent Request
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β ACO Control Layer β
β β
β βββββββββββ βββββββββββββ β
β βClassifierβββΆβTier Routerββββ β
β βββββββββββ βββββββββββββ β β
β βΌ β
β ββββββββββββ βββββββββββββ β
β βCompressorβ βCache Layoutβ β
β ββββββββββββ βββββββββββββ β
β β β β
β βββββββ¬βββββββ β
β βΌ β
β ββββββββββββ βββββββββββββ β
β βTool Gate β βVer. Budgetβ β
β ββββββββββββ βββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββββββ β
β β Retry + Recovery + Doom β β
β ββββββββββββββββββββββββββββ β
β β β
βββββββββββββββΌβββββββββββββββββββββ
βΌ
Agent API Call
Quick Start
from aco import ACOPipeline
# v1 DistilBERT specialists are the production models (v2 regressed β do not use).
aco = ACOPipeline(use_v2=False)
result = aco.optimize(
request="Fix the connection pool timeout in database.py and run the tests",
available_tools=["edit_file", "run_tests", "git_diff"],
context={"system_prompt": "...", "tool_definitions": "..."}
)
print(result.routing) # Which model to use
print(result.tool_decisions) # Which tools to call
print(result.needs_verify) # Whether to verify output
print(result.context_budget) # How much context to include
print(result.cost_estimate) # Expected cost
Install
pip install transformers torch datasets
git clone https://huggingface.co/narcolepticchicken/agent-cost-optimizer
cd agent-cost-optimizer
pip install -e .
Specialist Models (v1 β production)
Metrics below are independently verified on held-out test splits (not training logs):
| Model | Base | Params | Accuracy | F1 Macro | Test N |
|---|---|---|---|---|---|
| Tier Router | DistilBERT | 67M | 0.70 | 0.67 | 1260 |
| Tool Gater | DistilBERT | 67M | 0.98 | 0.92 | 1695 |
| Verifier Gater | DistilBERT | 67M | 0.65 | 0.65 | 863 |
v2 models (ModernBERT-base, 149M): trained but not recommended. The architecture upgrade
- focal loss did not help: tier-router flat (+0.7pp), verifier-gater flat (+1.3pp), and tool-gater
regressed β19.6pp F1 (ModernBERT overfits the 5.6% minority class). Raw comparison in
v2_verification_results.json. Use the v1 models above.
Key Papers
| Finding | Paper |
|---|---|
| BERT classifier routing β 2x cost reduction | RouteLLM (2406.18665) |
| Guideline compression β 26-54% token reduction | ACON (2510.00615) |
| Critics cause collapse when overused | Intervention Paradox |
| Systematic component optimization β lower cost | Efficient Agents (2508.02694) |
| Cascade + self-consistency β cheaper | In-Context Distillation (2512.02543) |
Documentation
- AUDIT.md β independent verification of all metric & cost claims
- TRUTH.md β honest state of the project
- CORRECTED_REPORT.md β corrected cascade cost analysis
- Literature Review
- Context Compression β ACON + TALE + cache layout
License
MIT
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
- Downloads last month
- 91