---
license: apache-2.0
language:
- en
tags:
- prompt-routing
- complexity-classifier
- deberta-v3
- llm-router
- cost-optimization
datasets:
- RowRed/ComplexityRouter
- OpenAssistant/oasst2
base_model:
- microsoft/deberta-v3-base
---

# ComplexityRouter: A Complexity based LLM Router

Introducing ComplexityRouter, a lightweight prompt complexity classifier finetuned from **microsoft/deberta-v3-base**. Using prompts from [Open Assistant Conversations Dataset Release 2 (OASST2)](https://huggingface.co/datasets/OpenAssistant/oasst2), some from myself, and some made more complex by Qwen3.5-4B (non-thinking mode), I created a synthetic dataset classifying 4,400 of the prompts using [Qwen3.5-4B Non Thinking Mode](https://huggingface.co/Qwen/Qwen3.5-4B). 

It assigns prompts to one of 4 complexity levels, making it useful for routing queries to the appropriate LLM tier.

## Model Details

### Model Description

- **Model type:** Text Classification (multi‑class)
- **Language:** English
- **Backbone:** microsoft/deberta-v3-base
- **License:** Apache‑2.0
- **Finetuned from model:** microsoft/deberta-v3-base
- **Training data:** OASST2 + synthetic augmentations + manually created prompts  
  Labels generated by **Qwen3.5‑4B** (non‑thinking mode).

### Model Sources

- **Dataset repository:** https://huggingface.co/datasets/RowRed/ComplexityRouter

## Uses

### Direct Use

Route prompts to appropriate LLM tiers based on predicted complexity:

| Level | Meaning | Suggested LLM Tier |
|-------|---------|-------------------|
| 0 (Trivial) | Simple lookups, basic Q&A (e.g., “What is 2+2?”) | Fast/cheap local model |
| 1 (Simple) | Moderate reasoning, basic domain knowledge | Mid‑tier model |
| 2 (Moderate) | Complex reasoning, deep knowledge required | Strong model |
| 3 (Complex) | Very complex reasoning, niche expertise | Frontier API model |

**Recommended routing strategy:** Group levels **0 and 1** together (fast/cheap tier), level 2 as standard, level 3 as premium. The model achieves **93.0% adjacent accuracy** on my test, meaning it rarely misroutes by more than one tier.

### Out‑of‑Scope Use

- Multi‑turn conversation routing (single prompts only).
- Non‑English prompts (training data was English‑only).
- Prompts requiring image or multimodal understanding.

## Bias, Risks, and Limitations

- Training data is synthetic and may not represent all real‑world prompt distributions.
- Level 1 (Simple) and Level 2 (Moderate) have lower per‑class F1 scores – boundary cases are inherently ambiguous.
- The model may struggle with very domain‑specific technical jargon.
- Performance may degrade on prompts that are very different from the training distribution.

## Notice
This is my first attempt making a widespread finetune. There are probably lots of issues, but thought the idea was sound. I might make a second (hopefully better) version eventually, but am not sure where to get lots of high-quality open source data.

## Training Details
### Training Data

| Split       | Samples | Source File          | Notes |
|-------------|---------|----------------------|-------|
| Training    | 2,800   | TRAINING.jsonl       | Used for model training |
| Validation  | 600     | TRAINING.jsonl       | Used for early stopping / hyperparameter tuning |
| Test (internal) | 600 | TRAINING.jsonl      | Used for in‑distribution evaluation |
| Test (held‑out) | 400 | TEST.jsonl   | Fully independent test set (reported results) |

**Total unique prompts:** 4,400

Class distribution (training):
Level 0: 762 (27.2%) • Level 1: 674 (24.1%) • Level 2: 795 (28.4%) • Level 3: 569 (20.3%)

## Training Procedure
- Hardware: NVIDIA T4 (16 GB VRAM, Google Colab)
- Framework: PyTorch 2.11 + Hugging Face Transformers
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: Linear warmup (10% of steps) → linear decay
- Loss: Weighted Cross‑Entropy (classification) + MSE (regression)
- Batch size: 16 (effective 32 with gradient accumulation)
- Epochs: 7 (early stopping patience = 3)
- Training time: ~18 minutes
- Class balancing: sqrt‑scaled class weights + weighted random sampler

## Evaluation Results

Reported on 600 held‑out samples from TRAINING.jsonl (internal test).
|Metric|Value|
|----|----|
|Exact Match Accuracy|64.5%|
|Adjacent (±1) Accuracy|93.0%|
|Macro F1|0.663|
|Weighted F1|0.653|

Per‑Class Performance (internal test, 600 samples)
|Level|Precision|Recall|F1|Support|
|----|----|----|----|----|
|L0 (Trivial)|0.658|0.626|0.642|163|
|L1 (Simple)|0.457|0.628|0.529|145|
|L2 (Moderate)|0.683|0.571|0.622|170|
|L3 (Complex)|0.933|0.795|0.858|122|

Confusion Matrix (internal test, 600 samples)

|    |Pred L0|Pred L1|Pred L2|Pred L3|
|----|----|----|----|----|
|True L0|102|46|13|2|
|True L1|35|91|18|1|
|True L2|15|54|97|4|
|True L3|3|8|14|97|

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn

class PromptComplexityRouter(nn.Module):
    def __init__(self, backbone="microsoft/deberta-v3-base", num_labels=4):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(backbone)
        hidden_size = self.backbone.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(hidden_size, 256),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_labels),
        )
    
    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]
        return self.classifier(cls_output)

# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("RowRed/ComplexityRouter")
model = PromptComplexityRouter()
model.load_state_dict(
    torch.load("pytorch_model.bin", map_location=device),
    strict=False
)
model.to(device)
model.eval()

# Predict
prompts = ["What is 2+2?", "Explain quantum entanglement in detail."]
encoded = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(device)
with torch.no_grad():
    logits = model(encoded["input_ids"], encoded["attention_mask"])
    probs = torch.softmax(logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1)

for prompt, level in zip(prompts, predictions):
    print(f"Level {level.item()}: {prompt}")
```

## Citation
If you use this model, please cite:

```bibtex
@software{ComplexityRouter,
  author = {RowRed},
  title = {ComplexityRouter},
  year = {2026},
  url = {https://huggingface.co/RowRed/ComplexityRouter}
}
```

Additionally, acknowledge the base dataset and labeling model:

```bibtex
@dataset{oasst2,
  author = {OpenAssistant Contributors},
  title = {Open Assistant Conversations Dataset Release 2},
  year = {2023},
  url = {https://huggingface.co/datasets/OpenAssistant/oasst2}
}

@software{qwen3.5-4b,
  author = {Qwen Team},
  title = {Qwen3.5-4B},
  year = {2026},
  url = {https://huggingface.co/Qwen/Qwen3.5-4B}
}
```

## License
This model is released under Apache‑2.0.
The backbone (microsoft/deberta-v3-base) is MIT‑licensed.
The training dataset is derived from OASST2 (Apache‑2.0) and Qwen3.5‑4B outputs (Apache‑2.0).