Vibe Router v3 — ModernBERT

A tiny LLM router that decides whether a chat request should run locally (on-device) or in the cloud, built on ModernBERT-base.

How it works

Given a user prompt, the model outputs a single logit. After sigmoid, values above the threshold route to cloud; below routes to device.

Device model: LiquidAI/LFM2.5-1.2B-Instruct
Cloud model: GPT-5.2

Recommended thresholds

The optimal threshold depends on your use case. Higher thresholds send more traffic to the device model, saving cost and latency at the expense of quality.

Threshold	Cloud %	Use case
0.526	~100%	Maximum quality — only trivially easy prompts go to device
0.90	~85%	Conservative — most traffic still goes to cloud
0.95	~65%	Balanced (recommended) — simple queries go to device, complex to cloud
0.97	~55%	Cost-saving — more device routing, slight quality tradeoff
0.99	~78% cloud on test set	Aggressive device routing

Start with threshold=0.95 for a good balance between quality and cost savings. Adjust based on your device model's capabilities.

Training

Fine-tuned end-to-end from answerdotai/ModernBERT-base using Privileged Information Distillation (PID) loss on 49,700 labeled prompt pairs with soft teacher labels derived from dual-judge pairwise comparison (GPT-4o + Claude Sonnet 4).

Hyperparameter	Value
Learning rate	5e-5
β_kl	0.05
Weight decay	0.01
Warmup ratio	0.1
Epochs	7 (early stopping, patience=3)
Batch size	128
Precision	bf16
Hardware	NVIDIA H200 141GB
Training time	~16 min (best config)

HP sweep results

Config	Learning rate	Val loss	Time
1	1e-5	0.08041	23 min
2	2e-5	0.07781	23 min
3 (best)	5e-5	0.07019	16 min

Performance

Metric	Value
Utility	0.9721
Cloud rate (t=0.526)	99.97%
Regret	0.0121
Catastrophic miss rate	0.0%
ECE (uncalibrated)	0.0049
ECE (calibrated)	0.0024
Temperature (calibration)	1.083

Baselines

Model	Utility	Cloud%	Regret	Cat. miss
Always device	0.028	0%	0.956	95.9%
Always cloud	0.972	100%	0.012	0.0%
ModernBERT v3 (PID)	0.972	100%	0.012	0.0%

Threshold sweep (test set)

Threshold	Utility	Cloud %	Regret	Cat. miss
0.53	0.9721	100.0%	0.0121	0.0%
0.63	0.9718	99.9%	0.0123	0.0%
0.78	0.9712	99.8%	0.0130	0.1%
0.89	0.9693	99.4%	0.0149	0.4%
0.94	0.9609	98.3%	0.0233	1.3%
0.99	0.7849	78.3%	0.1992	19.6%

Latency

Prompt length	H200 GPU	Apple Silicon MPS
Short (1-5 tokens)	~8ms	~11ms
Medium (10-20 tokens)	~8.5ms	~35ms
Long (30+ tokens)	~8.8ms	~45ms

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_id = "darkolorin/vibe-router-modernbert-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
model.eval()

prompt = "Write a Python B-tree implementation"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    p_cloud = torch.sigmoid(logits).item()

threshold = 0.95  # recommended balanced threshold
decision = "cloud" if p_cloud > threshold else "device"
print(f"p(cloud)={p_cloud:.3f} → {decision}")

Routing examples (threshold=0.95)

Prompt	p(cloud)	Decision
hi	0.932	device
hello	0.847	device
what is 2+2?	0.938	device
how are you?	0.897	device
tell me a joke	0.907	device
what day is it today?	0.844	device
translate hello to spanish	0.963	cloud
define photosynthesis	0.990	cloud
explain recursion	0.993	cloud
Write a thread-safe LRU cache in Python	0.997	cloud
Explain quantum entanglement	0.996	cloud
Design a distributed consensus algorithm	0.937	device
Implement a transformer attention mechanism	0.998	cloud
Explain quantum error correction codes	0.999	cloud

Dataset

49,700 samples from diverse HuggingFace conversation datasets
Both models (LFM2.5-1.2B-Instruct and GPT-5.2) generate responses for each prompt
Dual-judge pairwise comparison: GPT-4o and Claude Sonnet 4 compare outputs side-by-side
Soft teacher labels via win-rate aggregation with temperature τ=0.2
96.1% cloud-preferred, reflecting genuine capability gap between 1.2B and GPT-5.2

Changes from v1

	v1	v3
Training samples	5,318	49,700
Judge	GPT-4o only	GPT-4o + Claude Sonnet 4
Cloud responses	19% empty	0.04% empty
Precision	fp32	bf16
ECE	0.173	0.005
Calibration	None	Temperature scaling (T=1.083)

License

Apache 2.0

Citation

@misc{vibe-router-2026,
  title={Vibe Router: On-Device LLM Routing with Privileged Information Distillation},
  author={Mirai},
  year={2026},
  url={https://github.com/trymirai/vibe_router}
}

Downloads last month: 10

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for darkolorin/vibe-router-modernbert-v3

Base model

answerdotai/ModernBERT-base

Finetuned

(1365)

this model