DevRouter-1.5B / README.md
azaiats's picture
Upload folder using huggingface_hub
fa62872 verified
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
language:
- en
pipeline_tag: text-generation
tags:
- router
- prompt-router
- structured-output
- json
- code
- qwen2.5-coder
- gguf
---
# DevRouter-1.5B
**A tiny, fast router that turns a raw developer prompt into a single structured JSON decision.**
DevRouter-1.5B reads a raw coding prompt and returns one JSON object that (1) rewrites the prompt
into a cleaner version, (2) classifies its **intent**, **complexity**, and a suggested **route**
(which model tier to send it to), and (3) flags **missing context** the developer should have
included. It is meant to sit *in front of* your expensive models and make a cheap, deterministic
triage call in ~1–3 seconds on a single consumer GPU.
It is a fine-tune of **Qwen2.5-Coder-1.5B-Instruct** (Apache 2.0), distilled on a dataset of
developer prompts labelled by a stronger teacher model.
## Output schema
Every response is a single JSON object with exactly these five fields:
| field | type | values |
|---|---|---|
| `rewrite` | string | a clearer version of the prompt, preserving the original intent |
| `intent` | enum | `debug` · `refactor` · `feature` · `explain` · `documentation` · `boilerplate` · `architecture` · `review` · `optimize` · `other` |
| `complexity` | enum | `low` · `medium` · `high` |
| `route` | enum | `small_local` · `medium_api` · `large_api` |
| `missing` | array of strings | context the prompt should have included (empty if nothing) |
### Example
**Input (user message):**
> My Flask app 500s on POST /upload with RequestEntityTooLarge, how do I fix it?
**Output:**
```json
{
"rewrite": "I'm encountering a 500 Internal Server Error on my Flask app when handling POST /upload. The error is 'RequestEntityTooLarge'. How can I resolve this issue?",
"intent": "debug",
"complexity": "low",
"route": "small_local",
"missing": ["Flask version", "Python version"]
}
```
## Quick start
The router system prompt is **baked into the model's chat template**, so you do not need to supply a
system prompt — just send the raw developer prompt as the user message.
### Ollama
```bash
# from the -GGUF repo (Modelfile + DevRouter-1.5B-Q8_0.gguf)
ollama create devrouter -f Modelfile
ollama run devrouter "refactor this giant function into smaller ones"
```
### llama.cpp
```bash
llama-server -m DevRouter-1.5B-Q8_0.gguf -c 8192 -ngl 99 --parallel 1
# then POST to /v1/chat/completions with just a user message, temperature 0
```
### Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("aipster/DevRouter-1.5B")
model = AutoModelForCausalLM.from_pretrained("aipster/DevRouter-1.5B")
msgs = [{"role": "user", "content": "write a FastAPI POST /items endpoint with a Pydantic model"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt")
out = model.generate(inputs, max_new_tokens=1408, do_sample=False)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
```
Use **greedy decoding (`temperature=0`)** for stable, parseable JSON.
## Evaluation
Evaluated on a held-out, intent-stratified validation split (`val`, in-distribution) and an
out-of-distribution split (`holdout`, no GitHub-issue sources). Metrics: rate of valid JSON
(strict schema parse) and accuracy of `intent` / `route` / `complexity` against the teacher labels.
| metric | fp16 (val / holdout) | Q8_0 GGUF (val / holdout) |
|---|---|---|
| JSON validity | 0.973 / 0.955 | 0.965 / 0.946 |
| intent accuracy | 0.708 / 0.613 | 0.665 / 0.586 |
| route accuracy | 0.739 / 0.604 | 0.719 / 0.631 |
| complexity accuracy | 0.719 / 0.685 | 0.708 / 0.676 |
Per-intent accuracy (Q8_0, val):
| intent | acc | intent | acc |
|---|---|---|---|
| debug | 0.82 | architecture | 0.72 |
| refactor | 0.72 | documentation | 0.56 |
| explain | 0.73 | boilerplate | 0.64 |
| feature | 0.58 | review | 0.43 |
| optimize | 0.50 | | |
## Performance
Single RTX 3090, Q8_0 GGUF via llama.cpp (single stream):
- **Generation: ~280 tokens/s** (~3.6 ms/token, constant across output lengths)
- **Prompt eval (prefill): ~10,000–13,000 tokens/s**
- **Latency per routing call: ~1–3 s** (up to ~5 s for the longest outputs)
Throughput scales further with batching/concurrency.
## Limitations
- **No PII detection.** An earlier schema included a PII flag; it was removed from v1.1 because the
training data had too few PII-positive examples (~2.4%) to learn it reliably. Do **not** use this
model for privacy/safety filtering. (Planned for a future data-focused release.)
- **Weaker on rare intents.** `review` and `documentation` are under-represented and noisier in the
training data, and accuracy on them is lower.
- **Source bias / OOD gap.** Training prompts skew heavily toward GitHub-issue-style text, so intent
accuracy drops ~10 points on out-of-distribution prompts. Treat `route`/`complexity` as advisory.
- **Quantization sensitivity.** This is a small model doing strict structured output. **Q6_K and
below break the JSON** (validity collapses); ship **Q8_0** or **F16** only. Always validate a
quant before relying on it.
## Training
- **Base:** Qwen2.5-Coder-1.5B-Instruct (Apache 2.0)
- **Method:** QLoRA (4-bit), rank 16, 2 epochs, effective batch 16, `train_on_responses_only`
- **Data:** ~2.6k developer prompts, each labelled by a stronger teacher model into the 5-field
schema, filtered by a judge model and capped per intent. Distillation sources permit training and
open release of derived weights.
## Intended use
Pre-routing / triage of developer prompts in an LLM application: rewriting, intent/complexity
classification, and model-tier selection. **Not** intended for safety filtering, PII detection, or
as a general assistant.
## License
Apache 2.0, inherited from the base model. You are free to use, modify, and redistribute, including
commercially.