SCOPE: Scalable and Controllable Outcome Performance Estimator
This repository accompanies the paper “Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning”, which introduces SCOPE (Scalable and Controllable Outcome Performance Estimator) — a new framework for large language model (LLM) routing. SCOPE reframes model routing as a pre-hoc estimation problem: instead of directly selecting a model from a fixed candidate set, it predicts each model’s expected performance (correctness) and inference cost (token length) before execution, based on the model’s historical behaviors on similar queries. This enables training-free generalization to unseen models and allows users to flexibly control the trade-off between accuracy and cost through a budget-aware utility function. Overall, SCOPE provides a scalable, explainable, and controllable solution for allocating test-time compute across heterogeneous model portfolios.
Model Description
- Task: Performance prediction for LLMs
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
- Input: Target question + k anchor questions with performance data
- Output: Predicted length (tokens) and correctness (yes/no)
Intended Use
SCOPE is designed to:
- Predict whether an LLM will answer a question correctly before running expensive inference
- Estimate the output token length for resource planning
- Enable efficient LLM routing and selection
Quick Start
Installation
pip install transformers>=4.51.0 torch datasets
# For vLLM inference (optional but recommended)
pip install vllm
Input Format
SCOPE uses the following prompt format:
### Task
You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
### Target Model
{model_name}
Example 1:
Question: {anchor_question_1}
Performance: {len: {length}, correct: {yes/no}}
Example 2:
Question: {anchor_question_2}
Performance: {len: {length}, correct: {yes/no}}
...
### Target Question
{your_target_question}
### Output Format (STRICT)
Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
Predicted Performance: {len: [integer], correct: [yes/no]}
### Output:
Output Format
The model outputs:
Analysis: [Reasoning about the question difficulty based on anchor patterns...]
Predicted Performance: {len: 256, correct: yes}
Inference Methods
Method 1: Using Transformers (Recommended for Single Inference)
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model_name = "Cooolder/SCOPE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Prepare the prompt (see "Prompt Examples" section below)
prompt = """### Task
You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
### Target Model
Qwen/Qwen3-8B-Instruct
Example 1:
Question: What is the capital of France?
Performance: {len: 45, correct: yes}
Example 2:
Question: Solve: 2 + 2 = ?
Performance: {len: 32, correct: yes}
Example 3:
Question: Explain quantum entanglement in simple terms.
Performance: {len: 512, correct: yes}
Example 4:
Question: What is the 50th prime number?
Performance: {len: 128, correct: no}
Example 5:
Question: Write a haiku about programming.
Performance: {len: 78, correct: yes}
### Target Question
What is the derivative of x^3 + 2x^2 - 5x + 7?
### Output Format (STRICT)
Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
Predicted Performance: {len: [integer], correct: [yes/no]}
### Output:"""
# Format as chat message
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
# Generate
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1536,
temperature=0.7,
top_p=0.8,
top_k=20,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
response = tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)
Method 2: Using vLLM (Recommended for Batch Inference)
import os
import re
from vllm import LLM, SamplingParams
# Load model with vLLM
model_name = "Cooolder/SCOPE"
llm = LLM(
model=model_name,
dtype="bfloat16",
gpu_memory_utilization=0.90,
max_model_len=8192,
trust_remote_code=True,
)
# Prepare prompts (batch processing)
prompts = []
raw_prompt = """### Task
You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
### Target Model
Qwen/Qwen3-8B-Instruct
Example 1:
Question: What is the capital of France?
Performance: {len: 45, correct: yes}
Example 2:
Question: Solve: 2 + 2 = ?
Performance: {len: 32, correct: yes}
Example 3:
Question: Explain quantum entanglement in simple terms.
Performance: {len: 512, correct: yes}
Example 4:
Question: What is the 50th prime number?
Performance: {len: 128, correct: no}
Example 5:
Question: Write a haiku about programming.
Performance: {len: 78, correct: yes}
### Target Question
What is the derivative of x^3 + 2x^2 - 5x + 7?
### Output Format (STRICT)
Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
Predicted Performance: {len: [integer], correct: [yes/no]}
### Output:"""
# Wrap in Qwen3 chat template
chat_prompt = f"<|im_start|>user\n{raw_prompt}<|im_end|>\n<|im_start|>assistant\n"
prompts.append(chat_prompt)
# Sampling parameters
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=1536,
top_p=0.95,
top_k=20,
n=8, # Generate multiple samples for better confidence estimation
stop=["<|im_end|>", "<|endoftext|>"],
stop_token_ids=[151645, 151643]
)
# Run inference
outputs = llm.generate(prompts, sampling_params)
# Parse results
for output in outputs:
for single_output in output.outputs:
response = single_output.text.strip()
print(response)
print("-" * 50)
Parsing the Output
import re
def parse_prediction(response: str):
"""Parse SCOPE model output to extract predictions."""
# Clean up formatting variations
response = response.replace('**Analysis**', 'Analysis:')
response = response.replace('**Predicted Performance:**', 'Predicted Performance:')
# Extract analysis
analysis = ""
if 'Analysis:' in response:
analysis_start = response.find('Analysis:') + len('Analysis:')
perf_start = response.find('Predicted Performance:')
if perf_start > analysis_start:
analysis = response[analysis_start:perf_start].strip()
# Parse len and correct
len_match = re.search(r'len:\s*(\d+)', response)
correct_match = re.search(r'correct:\s*(yes|no)', response, re.IGNORECASE)
if not len_match or not correct_match:
return None
return {
'analysis': analysis,
'predicted_length': int(len_match.group(1)),
'predicted_correct': correct_match.group(1).lower()
}
# Example usage
result = parse_prediction(response)
print(f"Predicted Length: {result['predicted_length']}")
print(f"Predicted Correct: {result['predicted_correct']}")
Anchor and Prompt Examples
Example 1: Math Question Prediction
anchor_text = """Example 1:
Question: What is 15 + 27?
Performance: {len: 28, correct: yes}
Example 2:
Question: Calculate the area of a circle with radius 5.
Performance: {len: 156, correct: yes}
Example 3:
Question: Solve the quadratic equation x^2 - 5x + 6 = 0.
Performance: {len: 245, correct: yes}
Example 4:
Question: What is the integral of sin(x)?
Performance: {len: 89, correct: yes}
Example 5:
Question: Prove that the square root of 2 is irrational.
Performance: {len: 478, correct: no}
"""
target_question = "Find the limit of (x^2 - 1)/(x - 1) as x approaches 1."
model_name = "Qwen/Qwen3-8B-Instruct"
Example 2: Coding Question Prediction
anchor_text = """Example 1:
Question: Write a Python function to check if a number is even.
Performance: {len: 67, correct: yes}
Example 2:
Question: Implement binary search in Python.
Performance: {len: 234, correct: yes}
Example 3:
Question: Write a function to reverse a linked list.
Performance: {len: 312, correct: yes}
Example 4:
Question: Implement a LRU cache in Python.
Performance: {len: 456, correct: no}
Example 5:
Question: Write a recursive function to compute Fibonacci numbers.
Performance: {len: 178, correct: yes}
"""
target_question = "Write a Python function to find the longest palindromic substring."
model_name = "deepseek-ai/DeepSeek-V2-Chat"
Example 3: General Knowledge Prediction
anchor_text = """Example 1:
Question: Who wrote "Romeo and Juliet"?
Performance: {len: 34, correct: yes}
Example 2:
Question: What is the chemical formula for water?
Performance: {len: 42, correct: yes}
Example 3:
Question: Explain the theory of relativity.
Performance: {len: 687, correct: yes}
Example 4:
Question: What year did World War II end?
Performance: {len: 51, correct: yes}
Example 5:
Question: Who was the 23rd President of the United States?
Performance: {len: 89, correct: no}
"""
target_question = "What is the speed of light in a vacuum?"
model_name = "meta-llama/Llama-3-70B-Instruct"
Using with Cooolder/kshot_inference Dataset
The model is designed to work with the Cooolder/kshot_inference dataset:
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Cooolder/kshot_inference", split="train")
# Each sample contains:
# - id: unique identifier
# - prompt: pre-formatted prompt with anchors and target question
# - gt_is_correct: ground truth correctness
# - gt_token_count: ground truth token count
# - source_model: the target model being predicted
# - retrieved_anchors: the anchor questions used
# Example: Run inference on the dataset
for sample in dataset:
prompt = sample['prompt']
# Wrap in chat template and run inference...
Performance Tips
- Multiple Sampling: Generate 8+ samples and aggregate predictions for better accuracy
- Temperature: Use 0.6-0.7 for balanced diversity
- Batch Processing: Use vLLM for high-throughput batch inference
- Anchor Selection: Choose anchors similar to your target question domain
Citation
@misc{cao2026modelsscopescalablecontrollable,
title={Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning},
author={Qi Cao and Shuhao Zhang and Ruizhe Zhou and Ruiyi Zhang and Peijia Qin and Pengtao Xie},
year={2026},
eprint={2601.22323},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.22323},
}
License
Apache 2.0
- Downloads last month
- 29
Model tree for Cooolder/SCOPE
Base model
Qwen/Qwen3-4B-Instruct-2507