| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - router |
| - prompt-router |
| - structured-output |
| - json |
| - code |
| - qwen2.5-coder |
| - gguf |
| --- |
| |
| # DevRouter-1.5B |
|
|
| **A tiny, fast router that turns a raw developer prompt into a single structured JSON decision.** |
|
|
| DevRouter-1.5B reads a raw coding prompt and returns one JSON object that (1) rewrites the prompt |
| into a cleaner version, (2) classifies its **intent**, **complexity**, and a suggested **route** |
| (which model tier to send it to), and (3) flags **missing context** the developer should have |
| included. It is meant to sit *in front of* your expensive models and make a cheap, deterministic |
| triage call in ~1–3 seconds on a single consumer GPU. |
|
|
| It is a fine-tune of **Qwen2.5-Coder-1.5B-Instruct** (Apache 2.0), distilled on a dataset of |
| developer prompts labelled by a stronger teacher model. |
|
|
| ## Output schema |
|
|
| Every response is a single JSON object with exactly these five fields: |
|
|
| | field | type | values | |
| |---|---|---| |
| | `rewrite` | string | a clearer version of the prompt, preserving the original intent | |
| | `intent` | enum | `debug` · `refactor` · `feature` · `explain` · `documentation` · `boilerplate` · `architecture` · `review` · `optimize` · `other` | |
| | `complexity` | enum | `low` · `medium` · `high` | |
| | `route` | enum | `small_local` · `medium_api` · `large_api` | |
| | `missing` | array of strings | context the prompt should have included (empty if nothing) | |
|
|
| ### Example |
|
|
| **Input (user message):** |
| > My Flask app 500s on POST /upload with RequestEntityTooLarge, how do I fix it? |
|
|
| **Output:** |
| ```json |
| { |
| "rewrite": "I'm encountering a 500 Internal Server Error on my Flask app when handling POST /upload. The error is 'RequestEntityTooLarge'. How can I resolve this issue?", |
| "intent": "debug", |
| "complexity": "low", |
| "route": "small_local", |
| "missing": ["Flask version", "Python version"] |
| } |
| ``` |
|
|
| ## Quick start |
|
|
| The router system prompt is **baked into the model's chat template**, so you do not need to supply a |
| system prompt — just send the raw developer prompt as the user message. |
|
|
| ### Ollama |
| ```bash |
| # from the -GGUF repo (Modelfile + DevRouter-1.5B-Q8_0.gguf) |
| ollama create devrouter -f Modelfile |
| ollama run devrouter "refactor this giant function into smaller ones" |
| ``` |
|
|
| ### llama.cpp |
| ```bash |
| llama-server -m DevRouter-1.5B-Q8_0.gguf -c 8192 -ngl 99 --parallel 1 |
| # then POST to /v1/chat/completions with just a user message, temperature 0 |
| ``` |
|
|
| ### Transformers |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| tok = AutoTokenizer.from_pretrained("aipster/DevRouter-1.5B") |
| model = AutoModelForCausalLM.from_pretrained("aipster/DevRouter-1.5B") |
| msgs = [{"role": "user", "content": "write a FastAPI POST /items endpoint with a Pydantic model"}] |
| inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt") |
| out = model.generate(inputs, max_new_tokens=1408, do_sample=False) |
| print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| Use **greedy decoding (`temperature=0`)** for stable, parseable JSON. |
|
|
| ## Evaluation |
|
|
| Evaluated on a held-out, intent-stratified validation split (`val`, in-distribution) and an |
| out-of-distribution split (`holdout`, no GitHub-issue sources). Metrics: rate of valid JSON |
| (strict schema parse) and accuracy of `intent` / `route` / `complexity` against the teacher labels. |
|
|
| | metric | fp16 (val / holdout) | Q8_0 GGUF (val / holdout) | |
| |---|---|---| |
| | JSON validity | 0.973 / 0.955 | 0.965 / 0.946 | |
| | intent accuracy | 0.708 / 0.613 | 0.665 / 0.586 | |
| | route accuracy | 0.739 / 0.604 | 0.719 / 0.631 | |
| | complexity accuracy | 0.719 / 0.685 | 0.708 / 0.676 | |
| |
| Per-intent accuracy (Q8_0, val): |
|
|
| | intent | acc | intent | acc | |
| |---|---|---|---| |
| | debug | 0.82 | architecture | 0.72 | |
| | refactor | 0.72 | documentation | 0.56 | |
| | explain | 0.73 | boilerplate | 0.64 | |
| | feature | 0.58 | review | 0.43 | |
| | optimize | 0.50 | | | |
|
|
| ## Performance |
|
|
| Single RTX 3090, Q8_0 GGUF via llama.cpp (single stream): |
| |
| - **Generation: ~280 tokens/s** (~3.6 ms/token, constant across output lengths) |
| - **Prompt eval (prefill): ~10,000–13,000 tokens/s** |
| - **Latency per routing call: ~1–3 s** (up to ~5 s for the longest outputs) |
| |
| Throughput scales further with batching/concurrency. |
| |
| ## Limitations |
| |
| - **No PII detection.** An earlier schema included a PII flag; it was removed from v1.1 because the |
| training data had too few PII-positive examples (~2.4%) to learn it reliably. Do **not** use this |
| model for privacy/safety filtering. (Planned for a future data-focused release.) |
| - **Weaker on rare intents.** `review` and `documentation` are under-represented and noisier in the |
| training data, and accuracy on them is lower. |
| - **Source bias / OOD gap.** Training prompts skew heavily toward GitHub-issue-style text, so intent |
| accuracy drops ~10 points on out-of-distribution prompts. Treat `route`/`complexity` as advisory. |
| - **Quantization sensitivity.** This is a small model doing strict structured output. **Q6_K and |
| below break the JSON** (validity collapses); ship **Q8_0** or **F16** only. Always validate a |
| quant before relying on it. |
| |
| ## Training |
| |
| - **Base:** Qwen2.5-Coder-1.5B-Instruct (Apache 2.0) |
| - **Method:** QLoRA (4-bit), rank 16, 2 epochs, effective batch 16, `train_on_responses_only` |
| - **Data:** ~2.6k developer prompts, each labelled by a stronger teacher model into the 5-field |
| schema, filtered by a judge model and capped per intent. Distillation sources permit training and |
| open release of derived weights. |
|
|
| ## Intended use |
|
|
| Pre-routing / triage of developer prompts in an LLM application: rewriting, intent/complexity |
| classification, and model-tier selection. **Not** intended for safety filtering, PII detection, or |
| as a general assistant. |
|
|
| ## License |
|
|
| Apache 2.0, inherited from the base model. You are free to use, modify, and redistribute, including |
| commercially. |
|
|