Vibe Router v3 โ€” ModernBERT

A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.

How it works

Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold route to cloud; below routes to device.

Recommended thresholds

The optimal threshold depends on your use case. Higher thresholds send more traffic to the device model, saving cost and latency at the expense of quality.

Threshold Cloud % Use case
0.526 ~100% Maximum quality โ€” only trivially easy prompts go to device
0.90 ~85% Conservative โ€” most traffic still goes to cloud
0.95 ~65% Balanced (recommended) โ€” simple queries go to device, complex to cloud
0.97 ~55% Cost-saving โ€” more device routing, slight quality tradeoff
0.99 ~78% cloud on test set Aggressive device routing

Start with threshold=0.95 for a good balance between quality and cost savings. Adjust based on your device model's capabilities.

Training

Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 49,700 labeled prompt pairs with soft teacher labels derived from dual-judge pairwise comparison (GPT-4o + Claude Sonnet 4).

Hyperparameter Value
Learning rate 5e-5
ฮฒ_kl 0.05
Weight decay 0.01
Warmup ratio 0.1
Epochs 7 (early stopping, patience=3)
Batch size 128
Precision bf16
Hardware NVIDIA H200 141GB
Training time ~16 min (best config)

HP sweep results

Config Learning rate Val loss Time
1 1e-5 0.08041 23 min
2 2e-5 0.07781 23 min
3 (best) 5e-5 0.07019 16 min

Performance

Metric Value
Utility 0.9721
Cloud rate (t=0.526) 99.97%
Regret 0.0121
Catastrophic miss rate 0.0%
ECE (uncalibrated) 0.0049
ECE (calibrated) 0.0024
Temperature (calibration) 1.083

Baselines

Model Utility Cloud% Regret Cat. miss
Always device 0.028 0% 0.956 95.9%
Always cloud 0.972 100% 0.012 0.0%
ModernBERT v3 (PID) 0.972 100% 0.012 0.0%

Threshold sweep (test set)

Threshold Utility Cloud % Regret Cat. miss
0.53 0.9721 100.0% 0.0121 0.0%
0.63 0.9718 99.9% 0.0123 0.0%
0.78 0.9712 99.8% 0.0130 0.1%
0.89 0.9693 99.4% 0.0149 0.4%
0.94 0.9609 98.3% 0.0233 1.3%
0.99 0.7849 78.3% 0.1992 19.6%

Latency

Prompt length H200 GPU Apple Silicon MPS
Short (1-5 tokens) ~8ms ~11ms
Medium (10-20 tokens) ~8.5ms ~35ms
Long (30+ tokens) ~8.8ms ~45ms

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "darkolorin/vibe-router-modernbert-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()

prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    p_cloud = torch.sigmoid(logits).item()

threshold = 0.95  # recommended balanced threshold
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} โ†’ {decision}")

Routing examples (threshold=0.95)

Prompt p(cloud) Decision
hi 0.932 device
hello 0.847 device
what is 2+2? 0.938 device
how are you? 0.897 device
tell me a joke 0.907 device
what day is it today? 0.844 device
translate hello to spanish 0.963 cloud
define photosynthesis 0.990 cloud
explain recursion 0.993 cloud
Write a thread-safe LRU cache in Python 0.997 cloud
Explain quantum entanglement 0.996 cloud
Design a distributed consensus algorithm 0.937 device
Implement a transformer attention mechanism 0.998 cloud
Explain quantum error correction codes 0.999 cloud

Dataset

  • 49,700 samples from diverse HuggingFace conversation datasets
  • Both models (LFM2.5-1.2B-Instruct and GPT-5.2) generate responses for each prompt
  • Dual-judge pairwise comparison: GPT-4o and Claude Sonnet 4 compare outputs side-by-side
  • Soft teacher labels via win-rate aggregation with temperature ฯ„=0.2
  • 96.1% cloud-preferred, reflecting genuine capability gap between 1.2B and GPT-5.2

Changes from v1

v1 v3
Training samples 5,318 49,700
Judge GPT-4o only GPT-4o + Claude Sonnet 4
Cloud responses 19% empty 0.04% empty
Precision fp32 bf16
ECE 0.173 0.005
Calibration None Temperature scaling (T=1.083)

License

Apache 2.0

Citation

@misc{vibe-router-2026,
  title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
  author={Mirai},
  year={2026},
  url={https://github.com/trymirai/vibe_router}
}
Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for darkolorin/vibe-router-modernbert-v3

Finetuned
(1094)
this model