Vibe Router — ModernBERT v2

A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.

How it works

Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold (0.371) route to cloud; below routes to device.

Device model: LiquidAI/LFM2.5-1.2B-Instruct (runs locally via MLX)
Cloud model: GPT-5.2

Changes from v1

10x more data: 50K prompts from LMSYS-Chat-1M, WildChat-1M, UltraChat, OpenAssistant, Alpaca, No Robots (v1 used 5.3K)
Pairwise judging: Dual-judge system (GPT-4o + Claude Sonnet 4) with randomized presentation order, replacing single-judge absolute scoring
Temperature scaling: Post-training calibration for well-calibrated probabilities
GPU device inference: Device model (LFM2.5-1.2B) run on H100 GPU via HuggingFace transformers instead of local API

Training

Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 50K labeled prompt pairs with soft teacher labels derived from pairwise dual-judge comparison.

Hyperparameter	Value
Learning rate	5e-5
β_kl	0.05
Weight decay	0.01
Warmup ratio	0.1
Epochs	3 (early stopping)
Batch size	32
Hardware	NVIDIA H100 80GB

Performance

Metric	v2	v1
Utility	0.8734	0.9762
Cloud rate	80.3%	79.4%
Regret	0.1119	0.0064
Catastrophic miss rate	5.1%	0.0%
ECE (uncalibrated)	0.026	0.173
ECE (calibrated)	0.028	—
Temperature (T)	1.083	—
Best threshold	0.371	0.371

Note: v1 and v2 utility/regret metrics are not directly comparable — v1 used absolute quality scores (0-1) while v2 uses pairwise win rates, which produces a different scale. ECE improved dramatically (0.173 → 0.026).

Known issue

~19% of cloud model (GPT-5.2) responses were empty in the training data, which caused those prompts to be incorrectly labeled as "device-preferred". This introduced a bias in the routing logic. A v3 retraining with filtered data is planned.

Latency

~7ms per inference on GPU, ~10ms on CPU (Apple Silicon MPS).

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "darkolorin/vibe-router-modernbert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()

prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    p_cloud = torch.sigmoid(logits).item()

threshold = 0.371
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} → {decision}")

License

Apache 2.0

Citation

@misc{vibe-router-2026,
  title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
  author={Mirai},
  year={2026},
  url={https://github.com/trymirai/vibe_router}
}

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for darkolorin/vibe-router-modernbert

Base model

answerdotai/ModernBERT-base

Finetuned

(1351)

this model