---
license: apache-2.0
language:
- en
library_name: onnxruntime
pipeline_tag: text-classification
base_model: microsoft/deberta-v3-small
tags:
- coding-agent
- routing
- multi-head-classifier
- onnx
- deberta-v3
- model-router
metrics:
- accuracy
- f1
---

<!-- This file is the Hugging Face model card: published to
huggingface.co/afterbuild/spawn-router as README.md. Frontmatter must stay
at the very top of the file or HF won't parse the metadata. -->

# spawn-router

A compact, fast, **local-first** multi-head classifier for **coding-agent task
routing**. Given a task prompt at kickoff, it predicts stable task properties; a
downstream policy/config then maps those properties to a model, provider, and
execution behavior. The classifier predicts the *ontology*; your config owns the
*orchestration*.

This is the model component of **spawn** — see
[`Afterbuild/spawn-router`](https://github.com/Afterbuild/spawn-router) for the
training code and [`spawn-gateway`](https://github.com/Afterbuild) for the local
gateway that wraps Claude Code / Codex and routes with these weights.

- **Backbone:** `microsoft/deberta-v3-small` (multi-head)
- **Checkpoint:** v6 (final text-only training run)
- **Inference:** torch-free ONNX path, ~7 ms/prompt CPU, ~140 MB deps
- **Input:** text only (`current_text`), 256-token max

## What it predicts

```
complexity: easy | medium | hard
  + sub-dims (0..1 regression): reasoning_depth, scope_breadth,
    domain_knowledge, spec_completeness (inverted: low spec = harder)
task_type:  bugfix | feature | refactor | test | design | docs | migration | exploration
risk:       low | medium | high
  + sub-dims (0..1 regression): security_surface, data_sensitivity,
    production_exposure, reversal_cost
+ per-head confidences (post-hoc temperature-scaled) and overall_confidence
```

- `complexity` → capability tier (small / mid / large model)
- `task_type` → model specialty (e.g. design → Claude, systems → GPT, docs → small)
- `risk` → tier bumper (easy + high-risk still routes capable) and confirmation gate

The ONNX graphs emit the three classification logits plus all eight regression
sub-dimension scores. Output names are:

- `complexity_logits`, `task_type_logits`, `risk_logits`
- `complexity_sub_reasoning_depth`, `complexity_sub_scope_breadth`,
  `complexity_sub_domain_knowledge`, `complexity_sub_spec_completeness`
- `risk_sub_security_surface`, `risk_sub_data_sensitivity`,
  `risk_sub_production_exposure`, `risk_sub_reversal_cost`

Routing is **kickoff-only**: classify once at task start and lock the model for
the whole task cycle (no per-turn re-routing → no context thrash).

## Files

| File | What |
|---|---|
| `spawn_router.int8.onnx` | int8-quantized graph — **recommended for serving** (~164 MB) |
| `spawn_router.onnx` | fp32 graph (~540 MB) |
| `model.pt` | PyTorch state dict — for fine-tuning / sub-dim outputs (~565 MB) |
| `spm.model` + `*tokenizer*.json` | SentencePiece (DeBERTa-v2/spm) tokenizer |
| `model_config.json` | architecture + label maps |
| `temperature_scaling.json` | per-head calibration temperatures |
| `*_metrics.json`, `battery_results.json` | evaluation results |

## Usage (ONNX, torch-free)

Needs only `onnxruntime`, `numpy`, and `sentencepiece` — no torch, no transformers.

```python
import numpy as np
import onnxruntime as ort
import sentencepiece as spm

MODEL_DIR = "."  # dir containing spawn_router.int8.onnx, spm.model, *.json
MAX_LEN = 256
LABELS = {
    "complexity_logits": ["easy", "medium", "hard"],
    "task_type_logits": ["bugfix", "feature", "refactor", "test",
                          "design", "docs", "migration", "exploration"],
    "risk_logits": ["low", "medium", "high"],
}
TEMPS = {  # from temperature_scaling.json; output name -> head temperature
    "complexity_logits": 0.891251,
    "task_type_logits": 0.707946,
    "risk_logits": 1.059254,
}

sp = spm.SentencePieceProcessor(model_file=f"{MODEL_DIR}/spm.model")
sess = ort.InferenceSession(f"{MODEL_DIR}/spawn_router.int8.onnx",
                            providers=["CPUExecutionProvider"])

def classify(text: str) -> dict:
    # DeBERTa-v3 spm tokenizer: [CLS]=1 + pieces (truncated) + [SEP]=2
    pieces = sp.encode(f"Current: {text}", out_type=int)[: MAX_LEN - 2]
    ids = [1, *pieces, 2]
    feeds = {  # structured inputs are text-only sentinels (no interaction context)
        "input_ids": np.array([ids], dtype=np.int64),
        "attention_mask": np.ones((1, len(ids)), dtype=np.int64),
        "previous_action_id": np.array([0], dtype=np.int64),    # "none"
        "previous_outcome_id": np.array([4], dtype=np.int64),   # "unknown"
        "log_recency_seconds": np.array([0.0], dtype=np.float32),
        "has_interaction": np.array([0], dtype=np.int64),
        "has_recency": np.array([0], dtype=np.int64),
    }
    out = {o.name: v for o, v in zip(sess.get_outputs(), sess.run(None, feeds))}
    result = {}
    for name, labels in LABELS.items():
        logits = out[name][0] / TEMPS[name]
        p = np.exp(logits - logits.max()); p /= p.sum()
        i = int(p.argmax())
        result[name.replace("_logits", "")] = {
            "label": labels[i], "confidence": round(float(p[i]), 4),
        }
    result["complexity_sub"] = {
        name.replace("complexity_sub_", ""): round(float(out[name][0]), 4)
        for name in out if name.startswith("complexity_sub_")
    }
    result["risk_sub"] = {
        name.replace("risk_sub_", ""): round(float(out[name][0]), 4)
        for name in out if name.startswith("risk_sub_")
    }
    return result

print(classify("refactor JWT key rotation in prod"))
# {'complexity': {'label': ...}, 'task_type': {'label': ...}, 'risk': {'label': ...},
#  'complexity_sub': {'reasoning_depth': ...}, 'risk_sub': {'security_surface': ...}}
```

## Evaluation

Two complementary measures (eval scripts in
[`Afterbuild/spawn-router`](https://github.com/Afterbuild/spawn-router):
`scripts/eval_battery.py`, `eval.py`):

**Locked kickoff battery** (83 hand-labeled probes, never in training — the
canonical cross-version benchmark):

| Metric | v6 |
|---|---|
| Unified kickoff score | **69.5%** |
| Exact match (all 3 heads) | 37.4% |
| complexity | 65.1% |
| task_type | 78.3% |
| risk | 65.1% |

**Held-out test split** (n=174, mirrors the training distribution):

| Head | Accuracy | Macro F1 |
|---|---|---|
| complexity | 67.8% | 68.1% |
| task_type | 86.8% | 87.2% |
| risk | 66.7% | 62.2% |
| **Exact match** | **39.1%** | — |

Sub-dimension regression R² (PyTorch model): reasoning_depth 0.51, scope_breadth
0.47, spec_completeness 0.34, domain_knowledge 0.30; reversal_cost 0.55,
production_exposure 0.51, data_sensitivity 0.25, security_surface 0.19.

Calibration: per-head temperature scaling fit on validation. ECE on the held-out
split is ~0.37 at the 0.8 automation threshold — **confidence is not yet
well-calibrated for aggressive automation**; gate on it conservatively.

## Intended use

- Pick a capability tier / provider for a coding task **at kickoff**, before the
  first expensive agent call.
- Drive a confirmation gate for high-blast-radius work (risk/security/prod).
- Spread work across tiers to reduce rate-limit pressure.

**Out of scope:** per-turn routing; non-coding prompts; high-stakes autonomous
action without a human gate; languages other than English (trained on English).

## Limitations

- **Cold-start ceiling.** Effort/blast-radius isn't fully derivable from prompt
  text — `complexity=medium` and `risk=high` are the weakest bands, especially on
  short imperatives. Production signals (overrides, retries, session duration) are
  the intended path past this; this checkpoint predates that loop.
- **Synthetic-label ceiling.** Much training data is LLM-labeled; expect a
  ~75–80% ceiling per head until real disagreement signals are mixed in.
- **Quantized serving tradeoff.** The fp32 ONNX graph matches the PyTorch model
  on the locked battery, including sub-dimension scores. The int8 graph is the
  recommended low-dependency serving artifact and preserves the established v6
  serving behavior, but dynamic quantization can move borderline labels and
  regression values.

## Training

- Backbone `microsoft/deberta-v3-small`, attention pooling, head dependencies,
  3 softmax heads + 8 regression heads; `current_text_only` feature mode.
- 5 epochs, batch 16, encoder LR 2e-5, head LR 1e-4, weight decay 0.01, warmup
  0.1, seed 13; post-hoc per-head temperature scaling on validation.
- Data: v6 mixed set (train 1141 / val 174 / test 174) — a mix of synthetic
  coding-task prompts and real coding-agent kickoff prompts. **The merged
  training set is not distributed** (it embeds third-party trace text and
  personal usage traces); the synthetic seed data and the full data pipeline
  are in the code repo. See "Training data provenance" below.

## Training data provenance

Disclosed in full so downstream users can do their own diligence:

- **Synthetic coding-task prompts** (majority of the mix) — written by Claude
  sub-agents and hand-labeled; included in the code repo.
- **SWE-bench problem statements** — used only as Claude-paraphrased
  short-imperative prompts (no code, patches, or full issue text). The
  SWE-bench benchmark code is MIT; the aggregated issue text is owned by its
  authors and the HF dataset card carries no license tag.
- **Public coding-agent trace datasets** (`badlogicgames/pi-mono`,
  `armand0e/gpt-5.5-agent`, `lewtun/ml-intern-sessions`) — kickoff prompts
  extracted and labeled. These carry `license: other` or no license; their raw
  text is **not redistributed** here.
- **The author's own local agent traces** — first-task prompts only; not
  redistributed.
- Labels and paraphrases were produced with **Anthropic Claude**; per
  Anthropic's Commercial Terms, outputs are customer-owned. No other
  provider's models were used for generation or labeling.

The model is a non-generative classifier (three softmax heads over 256-token
inputs); it emits logits, not text, and cannot reproduce training data.

## Credits & provenance

- Scaffolding began as a fork of **[tiny-router](https://github.com/UdaraJay/tiny-router)
  by Udara Jay** (MIT); the ontology, data, heads, and serving path were rebuilt
  for coding-agent routing.
- Backbone: **DeBERTa-v3** (He et al.; `microsoft/deberta-v3-small`, MIT).
- Related prior art: Vercel v0 Auto and NVIDIA's
  prompt-task-and-complexity-classifier.
- **License: Apache-2.0** (weights), with the training-data provenance
  disclosed above; the training/serving code repo is MIT.