Update to v2: 50K pairwise-labeled dataset, dual-judge, temperature scaling

Browse files

Files changed (4) hide show

README.md +25 -31
model.safetensors +1 -1
router_config.json +9 -5
sweep_results.json +12 -12

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ language:
   - en
 ---
-# Vibe Router — ModernBERT
 A tiny LLM router that decides whether a chat request should run **locally** (on-device) or in the **cloud**, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
@@ -28,13 +28,20 @@ Given a user prompt, the model outputs a single logit. After sigmoid, values abo
 - **Device model**: [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) (runs locally via MLX)
 - **Cloud model**: GPT-5.2
 ## Training
-Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Information Distillation (PID)** loss on 5,318 labeled prompt pairs with soft teacher labels derived from a GPT-4o judge.
 | Hyperparameter | Value |
 |----------------|-------|
-| Learning rate | 2e-5 |
 | β_kl | 0.05 |
 | Weight decay | 0.01 |
 | Warmup ratio | 0.1 |
@@ -44,22 +51,22 @@ Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Info
 ## Performance
-| Metric | Value |
-|--------|-------|
-| Utility | 0.9762 |
-| Cloud rate | 79.4% |
-| Regret | 0.0064 |
-| Catastrophic miss rate | 0.0% |
-| ECE | 0.173 |
-| Best threshold | 0.371 |
-### Baselines
-| Model | Utility | Cloud% | Regret |
-|-------|---------|--------|--------|
-| Always device | 0.879 | 0% | 0.104 |
-| Always cloud | 0.894 | 100% | 0.089 |
-| **ModernBERT (PID)** | **0.976** | **79.4%** | **0.006** |
 ## Latency
@@ -71,7 +78,7 @@ Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Info
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
 import torch
-model_id = "trymirai/vibe-router-modernbert"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
 model.eval()
@@ -88,19 +95,6 @@ decision = "cloud" if p_cloud > threshold else "device"
 print(f"p(cloud)={p_cloud:.3f} → {decision}")
 ```
-## Routing examples
-| Prompt | p(cloud) | Decision |
-|--------|----------|----------|
-| hi | 0.011 | device |
-| 2+2 | 0.009 | device |
-| tell me a joke | 0.012 | device |
-| hello | 0.011 | device |
-| Write a Python B-tree with insert, delete, search | 0.911 | cloud |
-| Implement a REST API with auth and rate limiting | 0.762 | cloud |
-| Derive the volume of a sphere using integration | 0.900 | cloud |
-| Who was the first host of Top Chef? | 0.946 | cloud |
 ## License
 Apache 2.0

   - en
 ---
+# Vibe Router — ModernBERT v2
 A tiny LLM router that decides whether a chat request should run **locally** (on-device) or in the **cloud**, built on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
 - **Device model**: [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) (runs locally via MLX)
 - **Cloud model**: GPT-5.2
+## Changes from v1
+- **10x more data**: 50K prompts from LMSYS-Chat-1M, WildChat-1M, UltraChat, OpenAssistant, Alpaca, No Robots (v1 used 5.3K)
+- **Pairwise judging**: Dual-judge system (GPT-4o + Claude Sonnet 4) with randomized presentation order, replacing single-judge absolute scoring
+- **Temperature scaling**: Post-training calibration for well-calibrated probabilities
+- **GPU device inference**: Device model (LFM2.5-1.2B) run on H100 GPU via HuggingFace transformers instead of local API
 ## Training
+Fine-tuned end-to-end from `answerdotai/ModernBERT-base` using **Privileged Information Distillation (PID)** loss on 50K labeled prompt pairs with soft teacher labels derived from pairwise dual-judge comparison.
 | Hyperparameter | Value |
 |----------------|-------|
+| Learning rate | 5e-5 |
 | β_kl | 0.05 |
 | Weight decay | 0.01 |
 | Warmup ratio | 0.1 |
 ## Performance
+| Metric | v2 | v1 |
+|--------|-----|-----|
+| Utility | 0.8734 | 0.9762 |
+| Cloud rate | 80.3% | 79.4% |
+| Regret | 0.1119 | 0.0064 |
+| Catastrophic miss rate | 5.1% | 0.0% |
+| ECE (uncalibrated) | 0.026 | 0.173 |
+| ECE (calibrated) | 0.028 | — |
+| Temperature (T) | 1.083 | — |
+| Best threshold | 0.371 | 0.371 |
+**Note**: v1 and v2 utility/regret metrics are not directly comparable — v1 used absolute quality scores (0-1) while v2 uses pairwise win rates, which produces a different scale. ECE improved dramatically (0.173 → 0.026).
+### Known issue
+~19% of cloud model (GPT-5.2) responses were empty in the training data, which caused those prompts to be incorrectly labeled as "device-preferred". This introduced a bias in the routing logic. A v3 retraining with filtered data is planned.
 ## Latency
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
 import torch
+model_id = "darkolorin/vibe-router-modernbert"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
 model.eval()
 print(f"p(cloud)={p_cloud:.3f} → {decision}")
 ```
 ## License
 Apache 2.0

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bb128103dab9e2938e447b4079d4f0bb3034e2f26cfd2668159f37aeaa54f67f
 size 598436708

 version https://git-lfs.github.com/spec/v1
+oid sha256:463d9c4384baa384785ea5b4545410ed6f1fe1a116748717263adccf7d6022a1
 size 598436708

router_config.json CHANGED Viewed

@@ -1,9 +1,11 @@
 {
   "base_model": "answerdotai/ModernBERT-base",
   "best_threshold": 0.37105263157894736,
   "loss": "PID",
   "hp": {
-    "lr": 2e-05,
     "beta_kl": 0.05,
     "weight_decay": 0.01,
     "warmup_ratio": 0.1
@@ -11,9 +13,11 @@
   "device_model": "LiquidAI/LFM2.5-1.2B-Instruct",
   "cloud_model": "gpt-5.2",
   "test_results": {
-    "utility": 0.9762406349182129,
-    "cloud_rate": 0.7944862155388471,
-    "regret": 0.006434837356209755,
-    "cat_miss": 0.0
   }
 }

 {
   "base_model": "answerdotai/ModernBERT-base",
   "best_threshold": 0.37105263157894736,
+  "temperature": 1.0832775919732442,
+  "calibrated_threshold": 0.37105263157894736,
   "loss": "PID",
   "hp": {
+    "lr": 5e-05,
     "beta_kl": 0.05,
     "weight_decay": 0.01,
     "warmup_ratio": 0.1
   "device_model": "LiquidAI/LFM2.5-1.2B-Instruct",
   "cloud_model": "gpt-5.2",
   "test_results": {
+    "utility": 0.8734409213066101,
+    "cloud_rate": 0.8029333333333334,
+    "regret": 0.1118897870182991,
+    "cat_miss": 0.0508,
+    "ece_uncalibrated": 0.025725143598516808,
+    "ece_calibrated": 0.027517356761029774
   }
 }

sweep_results.json CHANGED Viewed

@@ -6,8 +6,8 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.05074503788000751,
-    "time_s": 95.0573191291187
   },
   {
     "hp": {
@@ -16,8 +16,8 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.0569811669310373,
-    "time_s": 107.19165365281515
   },
   {
     "hp": {
@@ -26,8 +26,8 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.04958628546137836,
-    "time_s": 106.77600225992501
   },
   {
     "hp": {
@@ -36,8 +36,8 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.05651537539578055,
-    "time_s": 145.05425760895014
   },
   {
     "hp": {
@@ -46,8 +46,8 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.04995061208804449,
-    "time_s": 89.70805354882032
   },
   {
     "hp": {
@@ -56,7 +56,7 @@
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
-    "val_loss": 0.05411159153170129,
-    "time_s": 125.99459161888808
   }
 ]

       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.359864952657737,
+    "time_s": 2119.8089495899912
   },
   {
     "hp": {
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.39046762092440734,
+    "time_s": 2135.0269561820023
   },
   {
     "hp": {
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.3433239631793078,
+    "time_s": 1772.8355041669856
   },
   {
     "hp": {
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.3631987929577921,
+    "time_s": 1771.8289752060082
   },
   {
     "hp": {
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.33285403746015885,
+    "time_s": 1769.0101560780022
   },
   {
     "hp": {
       "weight_decay": 0.01,
       "warmup_ratio": 0.1
     },
+    "val_loss": 0.36301534577911976,
+    "time_s": 1770.7788193659944
   }
 ]