🚀 Model Router — Intelligent AI Gateway Router

An autonomous AI gateway router that intelligently routes incoming API requests to the most appropriate backend model. Built with LoRA fine-tuning on Qwen2.5-0.5B-Instruct + a classification head, achieving 100% routing accuracy with 1.44ms average latency.

✨ Highlights

Metric	Value
Routing Accuracy	100%
Macro F1	1.0
Avg Latency	1.44ms
P50 Latency	0.62ms
Base Model	Qwen2.5-0.5B-Instruct
Training	8x NVIDIA H200 GPUs (DDP)

🏗️ Architecture

Input: "Analyze this research paper..."
         │
         ▼
┌─────────────────────────────────────────┐
│  Qwen2.5-0.5B-Instruct (LoRA-adapted)  │
│  Target modules: q/k/v/o/gate/up/down   │
│  LoRA rank: 64, alpha: 64               │
│  Output: Last token hidden state [896]   │
└─────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│  Classification Head                     │
│  Dropout(0.1) → Linear(896 → 6)         │
└─────────────────────────────────────────┘
         │
         ▼
Output: "gpt-4-turbo" (probability: 0.92)

🎯 Supported Routes

Route	Use Case
`gpt-4-turbo`	Complex reasoning, advanced coding, creative writing, long context analysis
`gpt-3.5-turbo`	Simple QA, basic summarization, casual conversation, quick translation
`claude-3-opus`	Deep research synthesis, long document analysis, nuanced analysis
`claude-3-sonnet`	Balanced analysis, code assistance, general writing, data interpretation
`gemini-pro`	Multimodal content, factual QA, web-grounded generation, visual reasoning
`mixtral-8x7b`	Fast inference, code generation, roleplay, instruction following

📊 Evaluation Results

Per-Class Performance (Test Set: 1,001 samples)

Backend Model	Precision	Recall	F1	Support
gpt-4-turbo	1.00	1.00	1.00	149
gpt-3.5-turbo	1.00	1.00	1.00	711
claude-3-opus	1.00	1.00	1.00	49
claude-3-sonnet	1.00	1.00	1.00	56
gemini-pro	1.00	1.00	1.00	13
mixtral-8x7b	1.00	1.00	1.00	23

Training Convergence

Epoch	Train Loss	Eval Accuracy
1	1.0108	76.8%
2	0.2813	100.0%
3	0.0602	100.0%
10	~0.0	100.0%

🚀 Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import json

# Load model
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "dknguyen2304/model-router")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")

# Load classifier head
classifier = torch.nn.Sequential(
    torch.nn.Dropout(0.1),
    torch.nn.Linear(896, 6)
)
classifier.load_state_dict(torch.load("classifier.pt", map_location="cpu"))

# Label mapping
labels = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-opus",
          "claude-3-sonnet", "gemini-pro", "mixtral-8x7b"]

# Inference
prompt = "Write a complex recursive algorithm to solve the Tower of Hanoi"
inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    hidden = outputs.hidden_states[-1][:, -1, :]  # last token
    logits = classifier(hidden)
    prediction = labels[logits.argmax(dim=-1).item()]

print(f"Route to: {prediction}")

📁 Model Files

├── adapter_model.safetensors   # LoRA adapter weights
├── adapter_config.json         # PEFT/LoRA configuration
├── classifier.pt               # Classification head weights
├── router_config.json          # Router configuration
├── label_mapping.json          # Label ↔ ID mappings
└── config/
    ├── training_config.yaml    # Training hyperparameters
    └── deepspeed_config.json   # DeepSpeed config

⚙️ Training Details

Parameter	Value
Base Model	`unsloth/Qwen2.5-0.5B-Instruct`
LoRA Rank (r)	64
LoRA Alpha	64
LoRA Dropout	0.1
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate	1e-3
Batch Size	8 per GPU × 8 GPUs × 4 grad accum = 256 effective
Epochs	10
Max Seq Length	512
Optimizer	AdamW
Scheduler	Cosine with warmup (5%)
Precision	BF16
Hardware	8x NVIDIA H200 (143 GB each)
Training Data	10,000 synthetic samples (80/10/10 split)
Total Steps	350

🔄 Pipeline

The model was trained via a fully autonomous 5-stage pipeline:

Data Generation — 10,000 synthetic requests with controlled class balance
LLM-as-Judge Labeling — Keyword matching (60%) + semantic scoring (40%)
Distributed Fine-tuning — DDP training on 8x H200 GPUs
Evaluation — Batch inference with latency measurement
Export — Production-ready artifacts

⚠️ Limitations & Production Notes

Current Limitations

Trained on synthetic data — real-world distribution may differ
Fixed label set — only routes to 6 predefined models
No confidence calibration — consider adding uncertainty thresholds for production
Model sensitive to tensor formatting (FP32 vs BFloat16, pad token position)

Production Recommendations

Fix Tensor Formatting
- Confirm and pin BFloat16 dtype at inference
- Fix padding rules to prevent Classification Head bias toward Label Index 0
Train on Real Data
- Train additional epochs on real production user prompts
- Synthetic data doesn't cover natural user typing patterns
Implement Async Support
- Add SSE/Stream support for non-blocking responses
- Handle timeout gracefully when routing to large LLMs
Timeout Handling
- Large upstream models (DeepSeek, Kimi) may timeout (>30-60s)
- Router must not be synchronous blocking

📜 License

Apache 2.0

📖 Citation

@misc{model-router-2026,
  title={Model Router: Intelligent AI Gateway Request Routing via LoRA Fine-tuning},
  author={dknguyen2304},
  year={2026},
  url={https://huggingface.co/dknguyen2304/model-router}
}

Downloads last month: -

Model tree for dknguyen2304/model-router

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

unsloth/Qwen2.5-0.5B-Instruct

Adapter

(381)

this model

Evaluation results

Routing Accuracy
self-reported

1.000
Macro F1
self-reported

1.000
Avg Latency (ms)
self-reported

1.440