Vibe Router v3 โ ModernBERT
A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.
How it works
Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold route to cloud; below routes to device.
Recommended thresholds
The optimal threshold depends on your use case. Higher thresholds send more traffic to the device model, saving cost and latency at the expense of quality.
| Threshold |
Cloud % |
Use case |
| 0.526 |
~100% |
Maximum quality โ only trivially easy prompts go to device |
| 0.90 |
~85% |
Conservative โ most traffic still goes to cloud |
| 0.95 |
~65% |
Balanced (recommended) โ simple queries go to device, complex to cloud |
| 0.97 |
~55% |
Cost-saving โ more device routing, slight quality tradeoff |
| 0.99 |
~78% cloud on test set |
Aggressive device routing |
Start with threshold=0.95 for a good balance between quality and cost savings. Adjust based on your device model's capabilities.
Training
Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 49,700 labeled prompt pairs with soft teacher labels derived from dual-judge pairwise comparison (GPT-4o + Claude Sonnet 4).
| Hyperparameter |
Value |
| Learning rate |
5e-5 |
| ฮฒ_kl |
0.05 |
| Weight decay |
0.01 |
| Warmup ratio |
0.1 |
| Epochs |
7 (early stopping, patience=3) |
| Batch size |
128 |
| Precision |
bf16 |
| Hardware |
NVIDIA H200 141GB |
| Training time |
~16 min (best config) |
HP sweep results
| Config |
Learning rate |
Val loss |
Time |
| 1 |
1e-5 |
0.08041 |
23 min |
| 2 |
2e-5 |
0.07781 |
23 min |
| 3 (best) |
5e-5 |
0.07019 |
16 min |
Performance
| Metric |
Value |
| Utility |
0.9721 |
| Cloud rate (t=0.526) |
99.97% |
| Regret |
0.0121 |
| Catastrophic miss rate |
0.0% |
| ECE (uncalibrated) |
0.0049 |
| ECE (calibrated) |
0.0024 |
| Temperature (calibration) |
1.083 |
Baselines
| Model |
Utility |
Cloud% |
Regret |
Cat. miss |
| Always device |
0.028 |
0% |
0.956 |
95.9% |
| Always cloud |
0.972 |
100% |
0.012 |
0.0% |
| ModernBERT v3 (PID) |
0.972 |
100% |
0.012 |
0.0% |
Threshold sweep (test set)
| Threshold |
Utility |
Cloud % |
Regret |
Cat. miss |
| 0.53 |
0.9721 |
100.0% |
0.0121 |
0.0% |
| 0.63 |
0.9718 |
99.9% |
0.0123 |
0.0% |
| 0.78 |
0.9712 |
99.8% |
0.0130 |
0.1% |
| 0.89 |
0.9693 |
99.4% |
0.0149 |
0.4% |
| 0.94 |
0.9609 |
98.3% |
0.0233 |
1.3% |
| 0.99 |
0.7849 |
78.3% |
0.1992 |
19.6% |
Latency
| Prompt length |
H200 GPU |
Apple Silicon MPS |
| Short (1-5 tokens) |
~8ms |
~11ms |
| Medium (10-20 tokens) |
~8.5ms |
~35ms |
| Long (30+ tokens) |
~8.8ms |
~45ms |
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_id = "darkolorin/vibe-router-modernbert-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()
prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
p_cloud = torch.sigmoid(logits).item()
threshold = 0.95
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} โ {decision}")
Routing examples (threshold=0.95)
| Prompt |
p(cloud) |
Decision |
| hi |
0.932 |
device |
| hello |
0.847 |
device |
| what is 2+2? |
0.938 |
device |
| how are you? |
0.897 |
device |
| tell me a joke |
0.907 |
device |
| what day is it today? |
0.844 |
device |
| translate hello to spanish |
0.963 |
cloud |
| define photosynthesis |
0.990 |
cloud |
| explain recursion |
0.993 |
cloud |
| Write a thread-safe LRU cache in Python |
0.997 |
cloud |
| Explain quantum entanglement |
0.996 |
cloud |
| Design a distributed consensus algorithm |
0.937 |
device |
| Implement a transformer attention mechanism |
0.998 |
cloud |
| Explain quantum error correction codes |
0.999 |
cloud |
Dataset
- 49,700 samples from diverse HuggingFace conversation datasets
- Both models (LFM2.5-1.2B-Instruct and GPT-5.2) generate responses for each prompt
- Dual-judge pairwise comparison: GPT-4o and Claude Sonnet 4 compare outputs side-by-side
- Soft teacher labels via win-rate aggregation with temperature ฯ=0.2
- 96.1% cloud-preferred, reflecting genuine capability gap between 1.2B and GPT-5.2
Changes from v1
|
v1 |
v3 |
| Training samples |
5,318 |
49,700 |
| Judge |
GPT-4o only |
GPT-4o + Claude Sonnet 4 |
| Cloud responses |
19% empty |
0.04% empty |
| Precision |
fp32 |
bf16 |
| ECE |
0.173 |
0.005 |
| Calibration |
None |
Temperature scaling (T=1.083) |
License
Apache 2.0
Citation
@misc{vibe-router-2026,
title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
author={Mirai},
year={2026},
url={https://github.com/trymirai/vibe_router}
}