Vibe Router β ModernBERT v2
A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.
How it works
Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold (0.371) route to cloud; below routes to device.
- Device model: LiquidAI/LFM2.5-1.2B-Instruct (runs locally via MLX)
- Cloud model: GPT-5.2
Changes from v1
- 10x more data: 50K prompts from LMSYS-Chat-1M, WildChat-1M, UltraChat, OpenAssistant, Alpaca, No Robots (v1 used 5.3K)
- Pairwise judging: Dual-judge system (GPT-4o + Claude Sonnet 4) with randomized presentation order, replacing single-judge absolute scoring
- Temperature scaling: Post-training calibration for well-calibrated probabilities
- GPU device inference: Device model (LFM2.5-1.2B) run on H100 GPU via HuggingFace transformers instead of local API
Training
Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 50K labeled prompt pairs with soft teacher labels derived from pairwise dual-judge comparison.
| Hyperparameter | Value |
|---|---|
| Learning rate | 5e-5 |
| Ξ²_kl | 0.05 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Epochs | 3 (early stopping) |
| Batch size | 32 |
| Hardware | NVIDIA H100 80GB |
Performance
| Metric | v2 | v1 |
|---|---|---|
| Utility | 0.8734 | 0.9762 |
| Cloud rate | 80.3% | 79.4% |
| Regret | 0.1119 | 0.0064 |
| Catastrophic miss rate | 5.1% | 0.0% |
| ECE (uncalibrated) | 0.026 | 0.173 |
| ECE (calibrated) | 0.028 | β |
| Temperature (T) | 1.083 | β |
| Best threshold | 0.371 | 0.371 |
Note: v1 and v2 utility/regret metrics are not directly comparable β v1 used absolute quality scores (0-1) while v2 uses pairwise win rates, which produces a different scale. ECE improved dramatically (0.173 β 0.026).
Known issue
~19% of cloud model (GPT-5.2) responses were empty in the training data, which caused those prompts to be incorrectly labeled as "device-preferred". This introduced a bias in the routing logic. A v3 retraining with filtered data is planned.
Latency
~7ms per inference on GPU, ~10ms on CPU (Apple Silicon MPS).
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_id = "darkolorin/vibe-router-modernbert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()
prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
p_cloud = torch.sigmoid(logits).item()
threshold = 0.371
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} β {decision}")
License
Apache 2.0
Citation
@misc{vibe-router-2026,
title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
author={Mirai},
year={2026},
url={https://github.com/trymirai/vibe_router}
}
- Downloads last month
- 35
Model tree for darkolorin/vibe-router-modernbert
Base model
answerdotai/ModernBERT-base