| --- |
| license: apache-2.0 |
| tags: |
| - routing |
| - code |
| - mlx |
| - pid |
| - cascade |
| library_name: mlx |
| --- |
| |
| # Vibe Coding Router v5 |
|
|
| A three-tier cascaded router for coding tasks that routes prompts between: |
|
|
| - **Local**: Qwen3-Coder-Next (80B/3B active MoE, on-device via MLX) |
| - **Sonnet**: Claude Sonnet 4.6 (medium-complexity cloud) |
| - **Opus**: Claude Opus 4.6 (max-capability cloud) |
|
|
| ## What's New in v5 |
|
|
| v4 suffered from **inverted routing** — simple queries went to cloud while complex ones stayed local. Root cause: length-quality anti-correlation in training data combined with PID loss reward-weight amplification. v5 fixes this with: |
|
|
| 1. **7 new complexity features** (45 handcrafted total): `is_coding_task`, `junk_score`, `scope_breadth`, `imperative_verb_density`, `noun_phrase_density`, `interaction_complexity`, `requirement_clause_count` |
| 2. **Centered complexity premium**: Adjusts training margins by `premium * (complexity_score - center)` so complex tasks push toward cloud and simple tasks push toward local |
| 3. **Junk prompt clamping**: 75 junk/greeting prompts neutralized (p_teacher=0.5, margin=0.0) |
| 4. **Reward weight cap**: PID loss reward_weight capped at 0.5 to prevent outlier margin dominance |
|
|
| ## Architecture |
|
|
| Two cascaded binary MLP routers trained with **Privileged Information Distillation (PID)**: |
|
|
| - **Router A** (local vs cloud): 77-dim → [32, 16] → 1, dropout=0.2, LayerNorm+ReLU |
| - **Router B** (sonnet vs opus): 77-dim → [128, 64] → 1, dropout=0.0, LayerNorm+ReLU |
|
|
| Features: 45 handcrafted code features + 32 PCA-reduced sentence embeddings (all-MiniLM-L6-v2). |
|
|
| ## Training |
|
|
| - **Data**: 1,644 coding prompts with real quality scores from all three models |
| - **Judge**: GPT-5.4 scoring correctness, completeness, code quality, explanation |
| - **Loss**: PID (reward-weighted CE + KL divergence), β_kl=0.02, reward_cap=0.5 |
| - **Label smoothing**: ε=0.05, cost-aware margin for Router B (cost_premium=0.03) |
| - **Complexity premium**: 2.0, centered at 0.3 |
| - **HP sweep**: 108 configurations, 3-way split (1150 train / 247 val / 247 test) |
| - **Threshold A**: 0.60 (manually tuned for routing behavior — see note below) |
| - **Threshold B**: 0.474 (calibrated on validation set) |
| |
| ### Threshold Note |
| |
| The utility-optimal Router A threshold (0.01) routes almost nothing to local because cloud quality is genuinely equal or better on nearly all prompts. The manual threshold of 0.60 trades ~1.4% utility for correct routing intuition: simple/fast tasks run locally with zero latency, while complex tasks go to cloud. |
| |
| ## Real-World Routing (28 test queries, threshold_a=0.60) |
|
|
| | Category | Local | Sonnet | Opus | |
| |----------|-------|--------|------| |
| | Simple (8) | 5 (62%) | 0 | 3 (38%) | |
| | Medium (8) | 3 (38%) | 0 | 5 (62%) | |
| | Complex (6) | 1 (17%) | 1 (17%) | 4 (67%) | |
|
|
| v4 comparison: simple→local was 0/8 (now 5/8), complex→local was 6/6 (now 1/6). |
|
|
| ## Test Set Results (calibrated thresholds) |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Utility | 0.6205 | |
| | Oracle Utility | 0.7179 | |
| | Regret | 0.0973 | |
|
|
| ## Files |
|
|
| - `router_a.safetensors` — Router A weights (32×16 MLP, 13KB) |
| - `router_b.safetensors` — Router B weights (128×64 MLP, 76KB) |
| - `config.json` — Model config, thresholds, HP, training results |
| - `scaler.pkl` — StandardScaler for feature normalization |
| - `embedding_extractor.pkl` — PCA-reduced sentence-transformers extractor |
| - `sweep_results.json` — Full 108-config HP sweep results |
|
|
| ## Usage |
|
|
| ```python |
| from router.three_tier_inference import ThreeTierRouter |
| |
| router = ThreeTierRouter("models/three_tier_v5") |
| result = router.route("Write a Python function to sort a list") |
| # result.decision: "local", "sonnet", or "opus" |
| # result.p_cloud: probability of cloud routing |
| # result.p_opus: probability of opus (if routed to cloud) |
| ``` |
|
|