π Model Router β Intelligent AI Gateway Router
An autonomous AI gateway router that intelligently routes incoming API requests to the most appropriate backend model. Built with LoRA fine-tuning on Qwen2.5-0.5B-Instruct + a classification head, achieving 100% routing accuracy with 1.44ms average latency.
β¨ Highlights
| Metric | Value |
|---|---|
| Routing Accuracy | 100% |
| Macro F1 | 1.0 |
| Avg Latency | 1.44ms |
| P50 Latency | 0.62ms |
| Base Model | Qwen2.5-0.5B-Instruct |
| Training | 8x NVIDIA H200 GPUs (DDP) |
ποΈ Architecture
Input: "Analyze this research paper..."
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Qwen2.5-0.5B-Instruct (LoRA-adapted) β
β Target modules: q/k/v/o/gate/up/down β
β LoRA rank: 64, alpha: 64 β
β Output: Last token hidden state [896] β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Classification Head β
β Dropout(0.1) β Linear(896 β 6) β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Output: "gpt-4-turbo" (probability: 0.92)
π― Supported Routes
| Route | Use Case |
|---|---|
gpt-4-turbo |
Complex reasoning, advanced coding, creative writing, long context analysis |
gpt-3.5-turbo |
Simple QA, basic summarization, casual conversation, quick translation |
claude-3-opus |
Deep research synthesis, long document analysis, nuanced analysis |
claude-3-sonnet |
Balanced analysis, code assistance, general writing, data interpretation |
gemini-pro |
Multimodal content, factual QA, web-grounded generation, visual reasoning |
mixtral-8x7b |
Fast inference, code generation, roleplay, instruction following |
π Evaluation Results
Per-Class Performance (Test Set: 1,001 samples)
| Backend Model | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| gpt-4-turbo | 1.00 | 1.00 | 1.00 | 149 |
| gpt-3.5-turbo | 1.00 | 1.00 | 1.00 | 711 |
| claude-3-opus | 1.00 | 1.00 | 1.00 | 49 |
| claude-3-sonnet | 1.00 | 1.00 | 1.00 | 56 |
| gemini-pro | 1.00 | 1.00 | 1.00 | 13 |
| mixtral-8x7b | 1.00 | 1.00 | 1.00 | 23 |
Training Convergence
| Epoch | Train Loss | Eval Accuracy |
|---|---|---|
| 1 | 1.0108 | 76.8% |
| 2 | 0.2813 | 100.0% |
| 3 | 0.0602 | 100.0% |
| 10 | ~0.0 | 100.0% |
π Quick Start
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import json
# Load model
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "dknguyen2304/model-router")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
# Load classifier head
classifier = torch.nn.Sequential(
torch.nn.Dropout(0.1),
torch.nn.Linear(896, 6)
)
classifier.load_state_dict(torch.load("classifier.pt", map_location="cpu"))
# Label mapping
labels = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-opus",
"claude-3-sonnet", "gemini-pro", "mixtral-8x7b"]
# Inference
prompt = "Write a complex recursive algorithm to solve the Tower of Hanoi"
inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = outputs.hidden_states[-1][:, -1, :] # last token
logits = classifier(hidden)
prediction = labels[logits.argmax(dim=-1).item()]
print(f"Route to: {prediction}")
π Model Files
βββ adapter_model.safetensors # LoRA adapter weights
βββ adapter_config.json # PEFT/LoRA configuration
βββ classifier.pt # Classification head weights
βββ router_config.json # Router configuration
βββ label_mapping.json # Label β ID mappings
βββ config/
βββ training_config.yaml # Training hyperparameters
βββ deepspeed_config.json # DeepSpeed config
βοΈ Training Details
| Parameter | Value |
|---|---|
| Base Model | unsloth/Qwen2.5-0.5B-Instruct |
| LoRA Rank (r) | 64 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.1 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning Rate | 1e-3 |
| Batch Size | 8 per GPU Γ 8 GPUs Γ 4 grad accum = 256 effective |
| Epochs | 10 |
| Max Seq Length | 512 |
| Optimizer | AdamW |
| Scheduler | Cosine with warmup (5%) |
| Precision | BF16 |
| Hardware | 8x NVIDIA H200 (143 GB each) |
| Training Data | 10,000 synthetic samples (80/10/10 split) |
| Total Steps | 350 |
π Pipeline
The model was trained via a fully autonomous 5-stage pipeline:
- Data Generation β 10,000 synthetic requests with controlled class balance
- LLM-as-Judge Labeling β Keyword matching (60%) + semantic scoring (40%)
- Distributed Fine-tuning β DDP training on 8x H200 GPUs
- Evaluation β Batch inference with latency measurement
- Export β Production-ready artifacts
β οΈ Limitations & Production Notes
Current Limitations
- Trained on synthetic data β real-world distribution may differ
- Fixed label set β only routes to 6 predefined models
- No confidence calibration β consider adding uncertainty thresholds for production
- Model sensitive to tensor formatting (FP32 vs BFloat16, pad token position)
Production Recommendations
Fix Tensor Formatting
- Confirm and pin BFloat16 dtype at inference
- Fix padding rules to prevent Classification Head bias toward Label Index 0
Train on Real Data
- Train additional epochs on real production user prompts
- Synthetic data doesn't cover natural user typing patterns
Implement Async Support
- Add SSE/Stream support for non-blocking responses
- Handle timeout gracefully when routing to large LLMs
Timeout Handling
- Large upstream models (DeepSeek, Kimi) may timeout (>30-60s)
- Router must not be synchronous blocking
π License
Apache 2.0
π Citation
@misc{model-router-2026,
title={Model Router: Intelligent AI Gateway Request Routing via LoRA Fine-tuning},
author={dknguyen2304},
year={2026},
url={https://huggingface.co/dknguyen2304/model-router}
}
- Downloads last month
- -
Model tree for dknguyen2304/model-router
Base model
Qwen/Qwen2.5-0.5B Finetuned
Qwen/Qwen2.5-0.5B-Instruct Finetuned
unsloth/Qwen2.5-0.5B-InstructEvaluation results
- Routing Accuracyself-reported1.000
- Macro F1self-reported1.000
- Avg Latency (ms)self-reported1.440