Fix Quick Start: complete runnable example with embedding generation

5bc5cb2 verified 3 months ago

4.84 kB

	---
	license: apache-2.0
	tags:
	- llm-routing
	- model-selection
	- budget-optimization
	- knn
	language:
	- en
	library_name: sklearn
	pipeline_tag: text-classification
	---

	# R2-Router: LLM Router with Joint Model-Budget Optimization

	R2-Router intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked #1 on the [RouterArena](https://routerarena.github.io/) leaderboard.

	Paper: [R2-Router (arxiv)](https://arxiv.org/abs/TODO)

	## RouterArena Performance

	Official leaderboard results on 8,400 queries:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 71.23% \|
	\| Cost per 1K Queries \| $0.061 \|
	\| Arena Score (beta=0.1) \| 71.60 \|
	\| Robustness Score \| 45.71% \|
	\| Rank \| #1 \|

	## Quick Start

	### Installation

	```bash
	pip install scikit-learn numpy joblib huggingface_hub sentence-transformers
	```

	### Complete Example

	```python
	from huggingface_hub import snapshot_download
	from sentence_transformers import SentenceTransformer
	import sys

	# 1. Download router
	path = snapshot_download("JiaqiXue/r2-router")
	sys.path.insert(0, path)

	from router import R2Router

	# 2. Load pre-trained KNN checkpoints
	router = R2Router.from_pretrained(path)

	# 3. Embed your query with Qwen3-0.6B (1024-dim)
	embedder = SentenceTransformer("Qwen/Qwen3-0.6B")
	embedding = embedder.encode("What is the capital of France?")

	# 4. Route!
	result = router.route(embedding)
	print(f"Model: {result['model_full_name']}")
	print(f"Token Budget: {result['token_limit']}")
	print(f"Predicted Quality: {result['predicted_quality']:.3f}")
	```

	### Train from Scratch

	```python
	from huggingface_hub import snapshot_download
	import sys

	path = snapshot_download("JiaqiXue/r2-router")
	sys.path.insert(0, path)

	from router import R2Router

	# Train KNN from the provided sub_10 training data (custom hyperparameters)
	router = R2Router.from_training_data(path, k=80)
	```

	### Alternative: vLLM Embeddings (Faster for Batches)

	```python
	from vllm import LLM
	llm = LLM(model="Qwen/Qwen3-0.6B", runner="pooling")
	outputs = llm.embed(["What is the capital of France?"])
	embedding = outputs[0].outputs.embedding
	```

	Or with vLLM for faster batch inference:

	```python
	from vllm import LLM
	llm = LLM(model="Qwen/Qwen3-0.6B", runner="pooling")
	outputs = llm.embed(["What is the capital of France?"])
	embedding = outputs[0].outputs.embedding
	```

	## Architecture

	R2-Router jointly optimizes which model to use and how many tokens to allocate per query.

	### Routing Formula

	```
	risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6
	(M, b) = argmax risk
	```

	### Pipeline

	```
	Input Query
	\|
	[1] Embed with Qwen3-0.6B -> 1024-dim vector
	\|
	[2] For each (model, budget) pair:
	- KNN predicts quality (accuracy)
	- KNN predicts output token count
	- Compute risk = (1-lambda) * quality - lambda * cost
	\|
	[3] Select (model, budget) with highest risk
	\|
	Output: (model_name, token_budget)
	```

	### Model Pool (6 LLMs)

	\| Model \| Output $/M tokens \|
	\|-------\|------------------\|
	\| Qwen3-235B-A22B \| $0.463 \|
	\| Qwen3-Next-80B-A3B \| $1.10 \|
	\| Qwen3-30B-A3B \| $0.33 \|
	\| Qwen3-Coder-Next \| $0.30 \|
	\| Gemini 2.5 Flash \| $2.50 \|
	\| Claude 3 Haiku \| $1.25 \|

	### Token Budgets

	4 output token limits: 100, 200, 400, 800 tokens.

	### Key Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| KNN K \| 80 \|
	\| Lambda \| 0.999 \|
	\| Distance Metric \| Cosine \|
	\| KNN Weights \| Distance-weighted \|
	\| Embedding Dim \| 1024 \|

	## Repository Contents

	```
	config.json # Router configuration (models, budgets, prices, hyperparams)
	router.py # Self-contained inference code
	training_data/
	embeddings.npy # Sub_10 training embeddings (809 x 1024)
	labels.json # Per-(model, budget) accuracy & token labels
	checkpoints/
	quality_knn_*.joblib # Pre-fitted KNN quality predictors (18 total)
	token_knn_*.joblib # Pre-fitted KNN token predictors (6 total)
	```

	### Two Ways to Use

	1. Load checkpoints (`from_pretrained`): Directly load pre-fitted KNN models. No training needed.
	2. Train from data (`from_training_data`): Use the provided training embeddings and labels to fit your own KNN with custom hyperparameters (e.g., different K, distance metric).

	## Training Details

	- Training Data: RouterArena sub_10 split (809 queries, 10% of full 8,400)
	- Method: KNeighborsRegressor with cosine distance, distance-weighted
	- Evaluation: Full 8,400 RouterArena queries (no data leakage)
	- Training Time: < 1 second (KNN fitting)

	## Citation

	```bibtex
	@article{r2router2026,
	title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
	author={TODO},
	year={2026},
	url={https://arxiv.org/abs/TODO}
	}
	```

	## License

	Apache 2.0