R2-Router: A New Paradigm for LLM Routing with Reasoning

R2-Router intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked #1 on the RouterArena leaderboard.

Paper: R2-Router (arxiv)

RouterArena Performance

RouterArena Leaderboard

Official leaderboard results on 8,400 queries:

Metric Value
Accuracy 71.23%
Cost per 1K Queries $0.061
Arena Score (beta=0.1) 71.60
Robustness Score 45.71%
Rank #1

Quick Start

Installation

We recommend using uv for fast, reliable environment setup:

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create environment and install dependencies
uv venv .venv && source .venv/bin/activate
uv pip install scikit-learn numpy joblib huggingface_hub vllm

With vLLM Server (Recommended)

Start the embedding server once, then route from any process without reloading the model:

# Terminal 1: Start vLLM embedding server (runs once, stays alive)
uv pip install vllm
vllm serve Qwen/Qwen3-0.6B --runner pooling --port 8000
# Terminal 2: Route queries (connects to the running server)
from huggingface_hub import snapshot_download
import sys

path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)

from router import R2Router

router = R2Router.from_pretrained(path, embed_url="http://localhost:8000")
result = router.route_text("Solve this integral")
print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
print(f"Estimated Quality: {result['predicted_quality']:.3f}, Estimated Cost: ${result['predicted_cost']:.6f}")

Adjusting Lambda (Cost-Accuracy Tradeoff)

The lambda parameter controls the tradeoff between accuracy and cost:

  • lambda → 1.0: Minimize cost (routes to cheaper models)
  • lambda → 0.0: Maximize accuracy (routes to the best model regardless of cost)
  • Default: 0.999 (strongly cost-sensitive, as used in our RouterArena submission)
# Cost-sensitive (default, as submitted to RouterArena)
router = R2Router.from_pretrained(path, lambda_val=0.999)

# Balanced accuracy vs cost
router = R2Router.from_pretrained(path, lambda_val=0.5)

# Accuracy-first (ignores cost, always picks highest quality)
router = R2Router.from_pretrained(path, lambda_val=0.0)

# Override lambda per query
result = router.route_text("Solve this integral", lambda_val=0.5)

Train from Scratch

from huggingface_hub import snapshot_download
import sys

path = snapshot_download("JiaqiXue/r2-router")
sys.path.insert(0, path)

from router import R2Router

# Train predictors with custom hyperparameters
router = R2Router.from_training_data(path, k=80, lambda_val=0.999)

Architecture

R2-Router jointly optimizes which model to use and how many tokens to allocate per query.

Routing Formula

risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6
(M*, b*) = argmax risk

Pipeline

Input Query
    |
[1] Embed with Qwen3-0.6B -> 1024-dim vector
    |
[2] For each (model, budget) pair:
      - Predict quality (accuracy)
      - Predict output token count
      - Compute risk = (1-lambda) * quality - lambda * cost
    |
[3] Select (model, budget) with highest risk
    |
Output: (model_name, token_budget)

Model Pool (6 LLMs)

Model Output $/M tokens
Qwen3-235B-A22B $0.463
Qwen3-Next-80B-A3B $1.10
Qwen3-30B-A3B $0.33
Qwen3-Coder-Next $0.30
Gemini 2.5 Flash $2.50
Claude 3 Haiku $1.25

Token Budgets

4 output token limits: 100, 200, 400, 800 tokens.

Key Parameters

Parameter Value
K (neighbors) 80
Lambda 0.999
Distance Metric Cosine
Weights Distance-weighted
Embedding Dim 1024

Repository Contents

config.json             # Router configuration (models, budgets, prices, hyperparams)
router.py               # Self-contained inference code (embed + route)
training_data/
  embeddings.npy        # Sub_10 training embeddings (809 x 1024)
  labels.json           # Per-(model, budget) accuracy & token labels
checkpoints/
  quality_knn_*.joblib  # Pre-fitted quality predictors (18 total)
  token_knn_*.joblib    # Pre-fitted token predictors (6 total)

Ways to Use

Method GPU? Description
route_text() + vLLM server Yes (server) Start vllm serve once, route from anywhere via HTTP
route_text() + local vLLM Yes (local) Auto-loads Qwen3-0.6B on first call, caches it
route(embedding) No Route from pre-computed 1024-dim embedding
from_training_data(path) No Train your own predictors with custom hyperparameters

Training Details

Following chayan, we only use the official sub_10 split (809 queries, 10% of the full 8,400) for training. No full-set data is used during training or hyperparameter tuning.

  • Training Data: RouterArena sub_10 split (809 queries)
  • Method: Nearest-neighbor regression with cosine distance, distance-weighted
  • Evaluation: Full 8,400 RouterArena queries (no data leakage)
  • Training Time: < 1 second

Citation

@article{xue2026r2,
  title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
  author={Xue, Jiaqi and Lou, Qian and Xing, Jiarong and Huang, Heng},
  journal={arXiv preprint arXiv:2602.02823},
  year={2026}
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for JiaqiXue/r2-router