--- license: apache-2.0 language: - en library_name: onnxruntime pipeline_tag: text-classification base_model: microsoft/deberta-v3-small tags: - coding-agent - routing - multi-head-classifier - onnx - deberta-v3 - model-router metrics: - accuracy - f1 --- # spawn-router A compact, fast, **local-first** multi-head classifier for **coding-agent task routing**. Given a task prompt at kickoff, it predicts stable task properties; a downstream policy/config then maps those properties to a model, provider, and execution behavior. The classifier predicts the *ontology*; your config owns the *orchestration*. This is the model component of **spawn** — see [`Afterbuild/spawn-router`](https://github.com/Afterbuild/spawn-router) for the training code and [`spawn-gateway`](https://github.com/Afterbuild) for the local gateway that wraps Claude Code / Codex and routes with these weights. - **Backbone:** `microsoft/deberta-v3-small` (multi-head) - **Checkpoint:** v6 (final text-only training run) - **Inference:** torch-free ONNX path, ~7 ms/prompt CPU, ~140 MB deps - **Input:** text only (`current_text`), 256-token max ## What it predicts ``` complexity: easy | medium | hard + sub-dims (0..1 regression): reasoning_depth, scope_breadth, domain_knowledge, spec_completeness (inverted: low spec = harder) task_type: bugfix | feature | refactor | test | design | docs | migration | exploration risk: low | medium | high + sub-dims (0..1 regression): security_surface, data_sensitivity, production_exposure, reversal_cost + per-head confidences (post-hoc temperature-scaled) and overall_confidence ``` - `complexity` → capability tier (small / mid / large model) - `task_type` → model specialty (e.g. design → Claude, systems → GPT, docs → small) - `risk` → tier bumper (easy + high-risk still routes capable) and confirmation gate The ONNX graphs emit the three classification logits plus all eight regression sub-dimension scores. Output names are: - `complexity_logits`, `task_type_logits`, `risk_logits` - `complexity_sub_reasoning_depth`, `complexity_sub_scope_breadth`, `complexity_sub_domain_knowledge`, `complexity_sub_spec_completeness` - `risk_sub_security_surface`, `risk_sub_data_sensitivity`, `risk_sub_production_exposure`, `risk_sub_reversal_cost` Routing is **kickoff-only**: classify once at task start and lock the model for the whole task cycle (no per-turn re-routing → no context thrash). ## Files | File | What | |---|---| | `spawn_router.int8.onnx` | int8-quantized graph — **recommended for serving** (~164 MB) | | `spawn_router.onnx` | fp32 graph (~540 MB) | | `model.pt` | PyTorch state dict — for fine-tuning / sub-dim outputs (~565 MB) | | `spm.model` + `*tokenizer*.json` | SentencePiece (DeBERTa-v2/spm) tokenizer | | `model_config.json` | architecture + label maps | | `temperature_scaling.json` | per-head calibration temperatures | | `*_metrics.json`, `battery_results.json` | evaluation results | ## Usage (ONNX, torch-free) Needs only `onnxruntime`, `numpy`, and `sentencepiece` — no torch, no transformers. ```python import numpy as np import onnxruntime as ort import sentencepiece as spm MODEL_DIR = "." # dir containing spawn_router.int8.onnx, spm.model, *.json MAX_LEN = 256 LABELS = { "complexity_logits": ["easy", "medium", "hard"], "task_type_logits": ["bugfix", "feature", "refactor", "test", "design", "docs", "migration", "exploration"], "risk_logits": ["low", "medium", "high"], } TEMPS = { # from temperature_scaling.json; output name -> head temperature "complexity_logits": 0.891251, "task_type_logits": 0.707946, "risk_logits": 1.059254, } sp = spm.SentencePieceProcessor(model_file=f"{MODEL_DIR}/spm.model") sess = ort.InferenceSession(f"{MODEL_DIR}/spawn_router.int8.onnx", providers=["CPUExecutionProvider"]) def classify(text: str) -> dict: # DeBERTa-v3 spm tokenizer: [CLS]=1 + pieces (truncated) + [SEP]=2 pieces = sp.encode(f"Current: {text}", out_type=int)[: MAX_LEN - 2] ids = [1, *pieces, 2] feeds = { # structured inputs are text-only sentinels (no interaction context) "input_ids": np.array([ids], dtype=np.int64), "attention_mask": np.ones((1, len(ids)), dtype=np.int64), "previous_action_id": np.array([0], dtype=np.int64), # "none" "previous_outcome_id": np.array([4], dtype=np.int64), # "unknown" "log_recency_seconds": np.array([0.0], dtype=np.float32), "has_interaction": np.array([0], dtype=np.int64), "has_recency": np.array([0], dtype=np.int64), } out = {o.name: v for o, v in zip(sess.get_outputs(), sess.run(None, feeds))} result = {} for name, labels in LABELS.items(): logits = out[name][0] / TEMPS[name] p = np.exp(logits - logits.max()); p /= p.sum() i = int(p.argmax()) result[name.replace("_logits", "")] = { "label": labels[i], "confidence": round(float(p[i]), 4), } result["complexity_sub"] = { name.replace("complexity_sub_", ""): round(float(out[name][0]), 4) for name in out if name.startswith("complexity_sub_") } result["risk_sub"] = { name.replace("risk_sub_", ""): round(float(out[name][0]), 4) for name in out if name.startswith("risk_sub_") } return result print(classify("refactor JWT key rotation in prod")) # {'complexity': {'label': ...}, 'task_type': {'label': ...}, 'risk': {'label': ...}, # 'complexity_sub': {'reasoning_depth': ...}, 'risk_sub': {'security_surface': ...}} ``` ## Evaluation Two complementary measures (eval scripts in [`Afterbuild/spawn-router`](https://github.com/Afterbuild/spawn-router): `scripts/eval_battery.py`, `eval.py`): **Locked kickoff battery** (83 hand-labeled probes, never in training — the canonical cross-version benchmark): | Metric | v6 | |---|---| | Unified kickoff score | **69.5%** | | Exact match (all 3 heads) | 37.4% | | complexity | 65.1% | | task_type | 78.3% | | risk | 65.1% | **Held-out test split** (n=174, mirrors the training distribution): | Head | Accuracy | Macro F1 | |---|---|---| | complexity | 67.8% | 68.1% | | task_type | 86.8% | 87.2% | | risk | 66.7% | 62.2% | | **Exact match** | **39.1%** | — | Sub-dimension regression R² (PyTorch model): reasoning_depth 0.51, scope_breadth 0.47, spec_completeness 0.34, domain_knowledge 0.30; reversal_cost 0.55, production_exposure 0.51, data_sensitivity 0.25, security_surface 0.19. Calibration: per-head temperature scaling fit on validation. ECE on the held-out split is ~0.37 at the 0.8 automation threshold — **confidence is not yet well-calibrated for aggressive automation**; gate on it conservatively. ## Intended use - Pick a capability tier / provider for a coding task **at kickoff**, before the first expensive agent call. - Drive a confirmation gate for high-blast-radius work (risk/security/prod). - Spread work across tiers to reduce rate-limit pressure. **Out of scope:** per-turn routing; non-coding prompts; high-stakes autonomous action without a human gate; languages other than English (trained on English). ## Limitations - **Cold-start ceiling.** Effort/blast-radius isn't fully derivable from prompt text — `complexity=medium` and `risk=high` are the weakest bands, especially on short imperatives. Production signals (overrides, retries, session duration) are the intended path past this; this checkpoint predates that loop. - **Synthetic-label ceiling.** Much training data is LLM-labeled; expect a ~75–80% ceiling per head until real disagreement signals are mixed in. - **Quantized serving tradeoff.** The fp32 ONNX graph matches the PyTorch model on the locked battery, including sub-dimension scores. The int8 graph is the recommended low-dependency serving artifact and preserves the established v6 serving behavior, but dynamic quantization can move borderline labels and regression values. ## Training - Backbone `microsoft/deberta-v3-small`, attention pooling, head dependencies, 3 softmax heads + 8 regression heads; `current_text_only` feature mode. - 5 epochs, batch 16, encoder LR 2e-5, head LR 1e-4, weight decay 0.01, warmup 0.1, seed 13; post-hoc per-head temperature scaling on validation. - Data: v6 mixed set (train 1141 / val 174 / test 174) — a mix of synthetic coding-task prompts and real coding-agent kickoff prompts. **The merged training set is not distributed** (it embeds third-party trace text and personal usage traces); the synthetic seed data and the full data pipeline are in the code repo. See "Training data provenance" below. ## Training data provenance Disclosed in full so downstream users can do their own diligence: - **Synthetic coding-task prompts** (majority of the mix) — written by Claude sub-agents and hand-labeled; included in the code repo. - **SWE-bench problem statements** — used only as Claude-paraphrased short-imperative prompts (no code, patches, or full issue text). The SWE-bench benchmark code is MIT; the aggregated issue text is owned by its authors and the HF dataset card carries no license tag. - **Public coding-agent trace datasets** (`badlogicgames/pi-mono`, `armand0e/gpt-5.5-agent`, `lewtun/ml-intern-sessions`) — kickoff prompts extracted and labeled. These carry `license: other` or no license; their raw text is **not redistributed** here. - **The author's own local agent traces** — first-task prompts only; not redistributed. - Labels and paraphrases were produced with **Anthropic Claude**; per Anthropic's Commercial Terms, outputs are customer-owned. No other provider's models were used for generation or labeling. The model is a non-generative classifier (three softmax heads over 256-token inputs); it emits logits, not text, and cannot reproduce training data. ## Credits & provenance - Scaffolding began as a fork of **[tiny-router](https://github.com/UdaraJay/tiny-router) by Udara Jay** (MIT); the ontology, data, heads, and serving path were rebuilt for coding-agent routing. - Backbone: **DeBERTa-v3** (He et al.; `microsoft/deberta-v3-small`, MIT). - Related prior art: Vercel v0 Auto and NVIDIA's prompt-task-and-complexity-classifier. - **License: Apache-2.0** (weights), with the training-data provenance disclosed above; the training/serving code repo is MIT.