--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct language: - en pipeline_tag: text-generation tags: - router - prompt-router - structured-output - json - code - qwen2.5-coder - gguf --- # DevRouter-1.5B **A tiny, fast router that turns a raw developer prompt into a single structured JSON decision.** DevRouter-1.5B reads a raw coding prompt and returns one JSON object that (1) rewrites the prompt into a cleaner version, (2) classifies its **intent**, **complexity**, and a suggested **route** (which model tier to send it to), and (3) flags **missing context** the developer should have included. It is meant to sit *in front of* your expensive models and make a cheap, deterministic triage call in ~1–3 seconds on a single consumer GPU. It is a fine-tune of **Qwen2.5-Coder-1.5B-Instruct** (Apache 2.0), distilled on a dataset of developer prompts labelled by a stronger teacher model. ## Output schema Every response is a single JSON object with exactly these five fields: | field | type | values | |---|---|---| | `rewrite` | string | a clearer version of the prompt, preserving the original intent | | `intent` | enum | `debug` · `refactor` · `feature` · `explain` · `documentation` · `boilerplate` · `architecture` · `review` · `optimize` · `other` | | `complexity` | enum | `low` · `medium` · `high` | | `route` | enum | `small_local` · `medium_api` · `large_api` | | `missing` | array of strings | context the prompt should have included (empty if nothing) | ### Example **Input (user message):** > My Flask app 500s on POST /upload with RequestEntityTooLarge, how do I fix it? **Output:** ```json { "rewrite": "I'm encountering a 500 Internal Server Error on my Flask app when handling POST /upload. The error is 'RequestEntityTooLarge'. How can I resolve this issue?", "intent": "debug", "complexity": "low", "route": "small_local", "missing": ["Flask version", "Python version"] } ``` ## Quick start The router system prompt is **baked into the model's chat template**, so you do not need to supply a system prompt — just send the raw developer prompt as the user message. ### Ollama ```bash # from the -GGUF repo (Modelfile + DevRouter-1.5B-Q8_0.gguf) ollama create devrouter -f Modelfile ollama run devrouter "refactor this giant function into smaller ones" ``` ### llama.cpp ```bash llama-server -m DevRouter-1.5B-Q8_0.gguf -c 8192 -ngl 99 --parallel 1 # then POST to /v1/chat/completions with just a user message, temperature 0 ``` ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("aipster/DevRouter-1.5B") model = AutoModelForCausalLM.from_pretrained("aipster/DevRouter-1.5B") msgs = [{"role": "user", "content": "write a FastAPI POST /items endpoint with a Pydantic model"}] inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt") out = model.generate(inputs, max_new_tokens=1408, do_sample=False) print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) ``` Use **greedy decoding (`temperature=0`)** for stable, parseable JSON. ## Evaluation Evaluated on a held-out, intent-stratified validation split (`val`, in-distribution) and an out-of-distribution split (`holdout`, no GitHub-issue sources). Metrics: rate of valid JSON (strict schema parse) and accuracy of `intent` / `route` / `complexity` against the teacher labels. | metric | fp16 (val / holdout) | Q8_0 GGUF (val / holdout) | |---|---|---| | JSON validity | 0.973 / 0.955 | 0.965 / 0.946 | | intent accuracy | 0.708 / 0.613 | 0.665 / 0.586 | | route accuracy | 0.739 / 0.604 | 0.719 / 0.631 | | complexity accuracy | 0.719 / 0.685 | 0.708 / 0.676 | Per-intent accuracy (Q8_0, val): | intent | acc | intent | acc | |---|---|---|---| | debug | 0.82 | architecture | 0.72 | | refactor | 0.72 | documentation | 0.56 | | explain | 0.73 | boilerplate | 0.64 | | feature | 0.58 | review | 0.43 | | optimize | 0.50 | | | ## Performance Single RTX 3090, Q8_0 GGUF via llama.cpp (single stream): - **Generation: ~280 tokens/s** (~3.6 ms/token, constant across output lengths) - **Prompt eval (prefill): ~10,000–13,000 tokens/s** - **Latency per routing call: ~1–3 s** (up to ~5 s for the longest outputs) Throughput scales further with batching/concurrency. ## Limitations - **No PII detection.** An earlier schema included a PII flag; it was removed from v1.1 because the training data had too few PII-positive examples (~2.4%) to learn it reliably. Do **not** use this model for privacy/safety filtering. (Planned for a future data-focused release.) - **Weaker on rare intents.** `review` and `documentation` are under-represented and noisier in the training data, and accuracy on them is lower. - **Source bias / OOD gap.** Training prompts skew heavily toward GitHub-issue-style text, so intent accuracy drops ~10 points on out-of-distribution prompts. Treat `route`/`complexity` as advisory. - **Quantization sensitivity.** This is a small model doing strict structured output. **Q6_K and below break the JSON** (validity collapses); ship **Q8_0** or **F16** only. Always validate a quant before relying on it. ## Training - **Base:** Qwen2.5-Coder-1.5B-Instruct (Apache 2.0) - **Method:** QLoRA (4-bit), rank 16, 2 epochs, effective batch 16, `train_on_responses_only` - **Data:** ~2.6k developer prompts, each labelled by a stronger teacher model into the 5-field schema, filtered by a judge model and capped per intent. Distillation sources permit training and open release of derived weights. ## Intended use Pre-routing / triage of developer prompts in an LLM application: rewriting, intent/complexity classification, and model-tier selection. **Not** intended for safety filtering, PII detection, or as a general assistant. ## License Apache 2.0, inherited from the base model. You are free to use, modify, and redistribute, including commercially.