πŸš€ Model Router β€” Intelligent AI Gateway Router

An autonomous AI gateway router that intelligently routes incoming API requests to the most appropriate backend model. Built with LoRA fine-tuning on Qwen2.5-0.5B-Instruct + a classification head, achieving 100% routing accuracy with 1.44ms average latency.

✨ Highlights

Metric Value
Routing Accuracy 100%
Macro F1 1.0
Avg Latency 1.44ms
P50 Latency 0.62ms
Base Model Qwen2.5-0.5B-Instruct
Training 8x NVIDIA H200 GPUs (DDP)

πŸ—οΈ Architecture

Input: "Analyze this research paper..."
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Qwen2.5-0.5B-Instruct (LoRA-adapted)  β”‚
β”‚  Target modules: q/k/v/o/gate/up/down   β”‚
β”‚  LoRA rank: 64, alpha: 64               β”‚
β”‚  Output: Last token hidden state [896]   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Classification Head                     β”‚
β”‚  Dropout(0.1) β†’ Linear(896 β†’ 6)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Output: "gpt-4-turbo" (probability: 0.92)

🎯 Supported Routes

Route Use Case
gpt-4-turbo Complex reasoning, advanced coding, creative writing, long context analysis
gpt-3.5-turbo Simple QA, basic summarization, casual conversation, quick translation
claude-3-opus Deep research synthesis, long document analysis, nuanced analysis
claude-3-sonnet Balanced analysis, code assistance, general writing, data interpretation
gemini-pro Multimodal content, factual QA, web-grounded generation, visual reasoning
mixtral-8x7b Fast inference, code generation, roleplay, instruction following

πŸ“Š Evaluation Results

Per-Class Performance (Test Set: 1,001 samples)

Backend Model Precision Recall F1 Support
gpt-4-turbo 1.00 1.00 1.00 149
gpt-3.5-turbo 1.00 1.00 1.00 711
claude-3-opus 1.00 1.00 1.00 49
claude-3-sonnet 1.00 1.00 1.00 56
gemini-pro 1.00 1.00 1.00 13
mixtral-8x7b 1.00 1.00 1.00 23

Training Convergence

Epoch Train Loss Eval Accuracy
1 1.0108 76.8%
2 0.2813 100.0%
3 0.0602 100.0%
10 ~0.0 100.0%

πŸš€ Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import json

# Load model
base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "dknguyen2304/model-router")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-0.5B-Instruct")

# Load classifier head
classifier = torch.nn.Sequential(
    torch.nn.Dropout(0.1),
    torch.nn.Linear(896, 6)
)
classifier.load_state_dict(torch.load("classifier.pt", map_location="cpu"))

# Label mapping
labels = ["gpt-4-turbo", "gpt-3.5-turbo", "claude-3-opus",
          "claude-3-sonnet", "gemini-pro", "mixtral-8x7b"]

# Inference
prompt = "Write a complex recursive algorithm to solve the Tower of Hanoi"
inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    hidden = outputs.hidden_states[-1][:, -1, :]  # last token
    logits = classifier(hidden)
    prediction = labels[logits.argmax(dim=-1).item()]

print(f"Route to: {prediction}")

πŸ“ Model Files

β”œβ”€β”€ adapter_model.safetensors   # LoRA adapter weights
β”œβ”€β”€ adapter_config.json         # PEFT/LoRA configuration
β”œβ”€β”€ classifier.pt               # Classification head weights
β”œβ”€β”€ router_config.json          # Router configuration
β”œβ”€β”€ label_mapping.json          # Label ↔ ID mappings
└── config/
    β”œβ”€β”€ training_config.yaml    # Training hyperparameters
    └── deepspeed_config.json   # DeepSpeed config

βš™οΈ Training Details

Parameter Value
Base Model unsloth/Qwen2.5-0.5B-Instruct
LoRA Rank (r) 64
LoRA Alpha 64
LoRA Dropout 0.1
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate 1e-3
Batch Size 8 per GPU Γ— 8 GPUs Γ— 4 grad accum = 256 effective
Epochs 10
Max Seq Length 512
Optimizer AdamW
Scheduler Cosine with warmup (5%)
Precision BF16
Hardware 8x NVIDIA H200 (143 GB each)
Training Data 10,000 synthetic samples (80/10/10 split)
Total Steps 350

πŸ”„ Pipeline

The model was trained via a fully autonomous 5-stage pipeline:

  1. Data Generation β€” 10,000 synthetic requests with controlled class balance
  2. LLM-as-Judge Labeling β€” Keyword matching (60%) + semantic scoring (40%)
  3. Distributed Fine-tuning β€” DDP training on 8x H200 GPUs
  4. Evaluation β€” Batch inference with latency measurement
  5. Export β€” Production-ready artifacts

⚠️ Limitations & Production Notes

Current Limitations

  • Trained on synthetic data β€” real-world distribution may differ
  • Fixed label set β€” only routes to 6 predefined models
  • No confidence calibration β€” consider adding uncertainty thresholds for production
  • Model sensitive to tensor formatting (FP32 vs BFloat16, pad token position)

Production Recommendations

  1. Fix Tensor Formatting

    • Confirm and pin BFloat16 dtype at inference
    • Fix padding rules to prevent Classification Head bias toward Label Index 0
  2. Train on Real Data

    • Train additional epochs on real production user prompts
    • Synthetic data doesn't cover natural user typing patterns
  3. Implement Async Support

    • Add SSE/Stream support for non-blocking responses
    • Handle timeout gracefully when routing to large LLMs
  4. Timeout Handling

    • Large upstream models (DeepSeek, Kimi) may timeout (>30-60s)
    • Router must not be synchronous blocking

πŸ“œ License

Apache 2.0

πŸ“– Citation

@misc{model-router-2026,
  title={Model Router: Intelligent AI Gateway Request Routing via LoRA Fine-tuning},
  author={dknguyen2304},
  year={2026},
  url={https://huggingface.co/dknguyen2304/model-router}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dknguyen2304/model-router

Adapter
(375)
this model

Evaluation results