| # Model-Routing Loop |
|
|
| ## Objective |
|
|
| Keep agent work on the right model for each task by routing on measured quality, latency, privacy, and cost, instead of pinning everything to one model and hoping. |
|
|
| ## Trigger |
|
|
| - Schedule: daily or weekly review of routing decisions against outcomes. |
| - Event: a new model ships, a price or latency change lands, or a quality or cost threshold is crossed. |
| - Manual bootstrap/debug command: "review model routing for <workflow> and propose a safer or cheaper split." |
|
|
| ## Intake |
|
|
| - Per-task telemetry: model used, success rate, latency, token cost, and retries. |
| - The routing policy: task classes, model options, privacy tiers, and fallbacks. |
| - Eval results and known-sensitive task types that must stay on approved models. |
|
|
| ## Context |
|
|
| - Required files: routing policy, privacy and data-residency rules, eval baselines. |
| - Runtime sources: recent traces, cost and latency dashboards, model availability and pricing. |
|
|
| ## Agents |
|
|
| - Analyst: clusters tasks by class and measures quality, latency, and cost per model. |
| - Proposer: suggests routing changes such as cheaper models for easy classes or fallbacks for hard ones. |
| - Verifier: replays a representative sample on the proposed routing to confirm quality holds. |
| - Reporter: records the proposed policy, the evidence, and the privacy constraints checked. |
|
|
| ## Workspace And Permissions |
|
|
| - Read access to telemetry, eval results, pricing, and the current routing policy. |
| - Allowed to run offline replays and open a routing-policy change proposal. |
| - Disallowed from changing production routing, moving a task to a non-approved model, or crossing a privacy tier without review. |
| - Production routing changes and any privacy-tier change require human approval. |
|
|
| ## Durable State |
|
|
| - Task-class definitions, per-model metrics, proposed routes, replay results, and privacy checks. |
| - A routing decision log so changes are auditable and reversible. |
|
|
| ## Loop Steps |
|
|
| 1. Load telemetry, the routing policy, and eval baselines. |
| 1. Cluster tasks by class and compute quality, latency, and cost per model. |
| 1. Identify classes that are over-served (too expensive) or under-served (too weak). |
| 1. Propose the smallest routing change that preserves quality and privacy. |
| 1. Replay a representative sample on the proposed routing and compare against baseline. |
| 1. Persist the proposal, evidence, and privacy checks; open a change proposal. |
| 1. Stop when a safe proposal is ready, no change is warranted, or a tradeoff needs an owner. |
|
|
| ## Verification Gates |
|
|
| - Proposed routes are replayed on a representative sample, not argued from price alone. |
| - Quality on each affected class stays within the agreed tolerance of baseline. |
| - Privacy and data-residency constraints are checked for every rerouted class. |
| - Cost and latency deltas are reported with sample size and variance. |
|
|
| ## Budget And Exit |
|
|
| - Max retries: 2 replay-and-adjust passes per task class. |
| - Max runtime: 60-120 minutes per routing review. |
| - Stop when a safe proposal is ready, the current routing is already optimal, or a tradeoff needs owner approval. |
|
|
| ## Escalation |
|
|
| Escalate for quality-versus-cost tradeoffs, privacy-tier changes, data-residency questions, a model deprecation that forces migration, or customer-impacting latency changes. |
|
|
| ## Loop Instruction |
|
|
| ```text |
| Review model routing for <workflow>. |
| Cluster tasks by class and measure quality, latency, and cost per model. |
| Propose the smallest routing change that preserves quality and privacy, then replay a representative sample to confirm. |
| Report cost and latency deltas with sample size; check privacy and residency for every rerouted class. |
| Do not change production routing or cross a privacy tier without human approval. |
| ``` |
|
|
| Example automation: run weekly, or trigger when a new model ships or a cost or latency threshold is crossed, then open a routing-policy proposal for review. |
|
|
| ## Failure Modes |
|
|
| - Routing on price alone and quietly degrading quality on hard task classes. |
| - Optimizing on an unrepresentative sample that misses the long tail. |
| - Moving a sensitive task to a cheaper model that violates a privacy tier. |
| - Flapping between models as short-term metrics wobble. |
|
|
| ## Safety Notes |
|
|
| - Privacy tier and data residency are hard constraints, never traded for cost. |
| - Keep a fast rollback to the prior routing if a change regresses in production. |
| - Verify quality with replays and evals before any production routing change. |
|
|
| ## Example Contract |
|
|
| - [`examples/model-routing-loop.json`](../examples/model-routing-loop.json) |
|
|
| ## References |
|
|
| - [Integrations and observability](https://developers.openai.com/api/docs/guides/agents/integrations-observability) - Traces as the basis for measuring and routing agent work. |
| - [Better Harness: A Recipe for Harness Hill-Climbing with Evals](https://www.langchain.com/blog/better-harness-a-recipe-for-harness-hill-climbing-with-evals) - Using evals as the signal for changing how agents run. |
|
|