Vibe Router β€” ModernBERT v2

A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.

How it works

Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold (0.371) route to cloud; below routes to device.

Changes from v1

  • 10x more data: 50K prompts from LMSYS-Chat-1M, WildChat-1M, UltraChat, OpenAssistant, Alpaca, No Robots (v1 used 5.3K)
  • Pairwise judging: Dual-judge system (GPT-4o + Claude Sonnet 4) with randomized presentation order, replacing single-judge absolute scoring
  • Temperature scaling: Post-training calibration for well-calibrated probabilities
  • GPU device inference: Device model (LFM2.5-1.2B) run on H100 GPU via HuggingFace transformers instead of local API

Training

Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 50K labeled prompt pairs with soft teacher labels derived from pairwise dual-judge comparison.

Hyperparameter Value
Learning rate 5e-5
Ξ²_kl 0.05
Weight decay 0.01
Warmup ratio 0.1
Epochs 3 (early stopping)
Batch size 32
Hardware NVIDIA H100 80GB

Performance

Metric v2 v1
Utility 0.8734 0.9762
Cloud rate 80.3% 79.4%
Regret 0.1119 0.0064
Catastrophic miss rate 5.1% 0.0%
ECE (uncalibrated) 0.026 0.173
ECE (calibrated) 0.028 β€”
Temperature (T) 1.083 β€”
Best threshold 0.371 0.371

Note: v1 and v2 utility/regret metrics are not directly comparable β€” v1 used absolute quality scores (0-1) while v2 uses pairwise win rates, which produces a different scale. ECE improved dramatically (0.173 β†’ 0.026).

Known issue

~19% of cloud model (GPT-5.2) responses were empty in the training data, which caused those prompts to be incorrectly labeled as "device-preferred". This introduced a bias in the routing logic. A v3 retraining with filtered data is planned.

Latency

~7ms per inference on GPU, ~10ms on CPU (Apple Silicon MPS).

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "darkolorin/vibe-router-modernbert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()

prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    p_cloud = torch.sigmoid(logits).item()

threshold = 0.371
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} β†’ {decision}")

License

Apache 2.0

Citation

@misc{vibe-router-2026,
  title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
  author={Mirai},
  year={2026},
  url={https://github.com/trymirai/vibe_router}
}
Downloads last month
35
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for darkolorin/vibe-router-modernbert

Finetuned
(1094)
this model