Cooolder
/

SCOPE

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- Qwen/Qwen3-4B
+pipeline_tag: text-generation
+tags:
+- performance-prediction
+- llm-evaluation
+- meta-learning
+---
+# SCOPE: LLM Performance Prediction Model
+SCOPE is a specialized model that predicts how a target LLM will perform on a given question. Given a target question and a set of anchor questions with known performance results, SCOPE predicts the **output length** and **correctness** of the target model's response.
+## Model Description
+- **Task**: Performance prediction for LLMs
+- **Base Model**: Qwen3-4B
+- **Training**: Supervised Fine-Tuning (SFT) + Reinforcement Learning with Chain-of-Thought reasoning
+- **Input**: Target question + 5 anchor questions with performance data
+- **Output**: Predicted length (tokens) and correctness (yes/no)
+## Intended Use
+SCOPE is designed to:
+- Predict whether an LLM will answer a question correctly before running expensive inference
+- Estimate the output token length for resource planning
+- Enable efficient LLM routing and selection
+## Quick Start
+### Installation
+```bash
+pip install transformers>=4.51.0 torch datasets
+# For vLLM inference (optional but recommended)
+pip install vllm
+```
+### Input Format
+SCOPE uses the following prompt format:
+```
+### Task
+You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
+### Target Model
+{model_name}
+Example 1:
+Question: {anchor_question_1}
+Performance: {len: {length}, correct: {yes/no}}
+Example 2:
+Question: {anchor_question_2}
+Performance: {len: {length}, correct: {yes/no}}
+... (5 anchor examples total)
+### Target Question
+{your_target_question}
+### Output Format (STRICT)
+Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
+Predicted Performance: {len: [integer], correct: [yes/no]}
+### Output:
+```
+### Output Format
+The model outputs:
+```
+Analysis: [Reasoning about the question difficulty based on anchor patterns...]
+Predicted Performance: {len: 256, correct: yes}
+```
+---
+## Inference Methods
+### Method 1: Using Transformers (Recommended for Single Inference)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model
+model_name = "Cooolder/SCOPE"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+# Prepare the prompt (see "Prompt Examples" section below)
+prompt = """### Task
+You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
+### Target Model
+Qwen/Qwen3-8B-Instruct
+Example 1:
+Question: What is the capital of France?
+Performance: {len: 45, correct: yes}
+Example 2:
+Question: Solve: 2 + 2 = ?
+Performance: {len: 32, correct: yes}
+Example 3:
+Question: Explain quantum entanglement in simple terms.
+Performance: {len: 512, correct: yes}
+Example 4:
+Question: What is the 50th prime number?
+Performance: {len: 128, correct: no}
+Example 5:
+Question: Write a haiku about programming.
+Performance: {len: 78, correct: yes}
+### Target Question
+What is the derivative of x^3 + 2x^2 - 5x + 7?
+### Output Format (STRICT)
+Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
+Predicted Performance: {len: [integer], correct: [yes/no]}
+### Output:"""
+# Format as chat message
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+# Generate
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=1536,
+    temperature=0.7,
+    top_p=0.8,
+    top_k=20,
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+response = tokenizer.decode(output_ids, skip_special_tokens=True)
+print(response)
+```
+### Method 2: Using vLLM (Recommended for Batch Inference)
+```python
+import os
+import re
+from vllm import LLM, SamplingParams
+# Load model with vLLM
+model_name = "Cooolder/SCOPE"
+llm = LLM(
+    model=model_name,
+    dtype="bfloat16",
+    gpu_memory_utilization=0.90,
+    max_model_len=8192,
+    trust_remote_code=True,
+)
+# Prepare prompts (batch processing)
+prompts = []
+raw_prompt = """### Task
+You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
+### Target Model
+Qwen/Qwen3-8B-Instruct
+Example 1:
+Question: What is the capital of France?
+Performance: {len: 45, correct: yes}
+Example 2:
+Question: Solve: 2 + 2 = ?
+Performance: {len: 32, correct: yes}
+Example 3:
+Question: Explain quantum entanglement in simple terms.
+Performance: {len: 512, correct: yes}
+Example 4:
+Question: What is the 50th prime number?
+Performance: {len: 128, correct: no}
+Example 5:
+Question: Write a haiku about programming.
+Performance: {len: 78, correct: yes}
+### Target Question
+What is the derivative of x^3 + 2x^2 - 5x + 7?
+### Output Format (STRICT)
+Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
+Predicted Performance: {len: [integer], correct: [yes/no]}
+### Output:"""
+# Wrap in Qwen3 chat template
+chat_prompt = f"<|im_start|>user\n{raw_prompt}<|im_end|>\n<|im_start|>assistant\n"
+prompts.append(chat_prompt)
+# Sampling parameters
+sampling_params = SamplingParams(
+    temperature=0.6,
+    max_tokens=1536,
+    top_p=0.95,
+    top_k=20,
+    n=8,  # Generate multiple samples for better confidence estimation
+    stop=["<|im_end|>", "<|endoftext|>"],
+    stop_token_ids=[151645, 151643]
+)
+# Run inference
+outputs = llm.generate(prompts, sampling_params)
+# Parse results
+for output in outputs:
+    for single_output in output.outputs:
+        response = single_output.text.strip()
+        print(response)
+        print("-" * 50)
+```
+### Parsing the Output
+```python
+import re
+def parse_prediction(response: str):
+    """Parse SCOPE model output to extract predictions."""
+    # Clean up formatting variations
+    response = response.replace('**Analysis**', 'Analysis:')
+    response = response.replace('**Predicted Performance:**', 'Predicted Performance:')
+    # Extract analysis
+    analysis = ""
+    if 'Analysis:' in response:
+        analysis_start = response.find('Analysis:') + len('Analysis:')
+        perf_start = response.find('Predicted Performance:')
+        if perf_start > analysis_start:
+            analysis = response[analysis_start:perf_start].strip()
+    # Parse len and correct
+    len_match = re.search(r'len:\s*(\d+)', response)
+    correct_match = re.search(r'correct:\s*(yes|no)', response, re.IGNORECASE)
+    if not len_match or not correct_match:
+        return None
+    return {
+        'analysis': analysis,
+        'predicted_length': int(len_match.group(1)),
+        'predicted_correct': correct_match.group(1).lower()
+    }
+# Example usage
+result = parse_prediction(response)
+print(f"Predicted Length: {result['predicted_length']}")
+print(f"Predicted Correct: {result['predicted_correct']}")
+```
+---
+## Anchor and Prompt Examples
+### Example 1: Math Question Prediction
+```python
+anchor_text = """Example 1:
+Question: What is 15 + 27?
+Performance: {len: 28, correct: yes}
+Example 2:
+Question: Calculate the area of a circle with radius 5.
+Performance: {len: 156, correct: yes}
+Example 3:
+Question: Solve the quadratic equation x^2 - 5x + 6 = 0.
+Performance: {len: 245, correct: yes}
+Example 4:
+Question: What is the integral of sin(x)?
+Performance: {len: 89, correct: yes}
+Example 5:
+Question: Prove that the square root of 2 is irrational.
+Performance: {len: 478, correct: no}
+"""
+target_question = "Find the limit of (x^2 - 1)/(x - 1) as x approaches 1."
+model_name = "Qwen/Qwen3-8B-Instruct"
+```
+### Example 2: Coding Question Prediction
+```python
+anchor_text = """Example 1:
+Question: Write a Python function to check if a number is even.
+Performance: {len: 67, correct: yes}
+Example 2:
+Question: Implement binary search in Python.
+Performance: {len: 234, correct: yes}
+Example 3:
+Question: Write a function to reverse a linked list.
+Performance: {len: 312, correct: yes}
+Example 4:
+Question: Implement a LRU cache in Python.
+Performance: {len: 456, correct: no}
+Example 5:
+Question: Write a recursive function to compute Fibonacci numbers.
+Performance: {len: 178, correct: yes}
+"""
+target_question = "Write a Python function to find the longest palindromic substring."
+model_name = "deepseek-ai/DeepSeek-V2-Chat"
+```
+### Example 3: General Knowledge Prediction
+```python
+anchor_text = """Example 1:
+Question: Who wrote "Romeo and Juliet"?
+Performance: {len: 34, correct: yes}
+Example 2:
+Question: What is the chemical formula for water?
+Performance: {len: 42, correct: yes}
+Example 3:
+Question: Explain the theory of relativity.
+Performance: {len: 687, correct: yes}
+Example 4:
+Question: What year did World War II end?
+Performance: {len: 51, correct: yes}
+Example 5:
+Question: Who was the 23rd President of the United States?
+Performance: {len: 89, correct: no}
+"""
+target_question = "What is the speed of light in a vacuum?"
+model_name = "meta-llama/Llama-3-70B-Instruct"
+```
+---
+## Using with Cooolder/kshot_inference Dataset
+The model is designed to work with the [Cooolder/kshot_inference](https://huggingface.co/datasets/Cooolder/kshot_inference) dataset:
+```python
+from datasets import load_dataset
+# Load the dataset
+dataset = load_dataset("Cooolder/kshot_inference", split="train")
+# Each sample contains:
+# - id: unique identifier
+# - prompt: pre-formatted prompt with anchors and target question
+# - gt_is_correct: ground truth correctness
+# - gt_token_count: ground truth token count
+# - source_model: the target model being predicted
+# - retrieved_anchors: the anchor questions used
+# Example: Run inference on the dataset
+for sample in dataset:
+    prompt = sample['prompt']
+    # Wrap in chat template and run inference...
+```
+---
+## Performance Tips
+1. **Multiple Sampling**: Generate 8+ samples and aggregate predictions for better accuracy
+2. **Temperature**: Use 0.6-0.7 for balanced diversity
+3. **Batch Processing**: Use vLLM for high-throughput batch inference
+4. **Anchor Selection**: Choose anchors similar to your target question domain
+## Limitations
+- Performance predictions are estimates based on anchor patterns
+- Accuracy depends on the quality and relevance of anchor questions
+- Works best when anchors are from the same domain as the target question
+## Citation
+```bibtex
+@misc{scope2025,
+  title={SCOPE: LLM Performance Prediction Model},
+  author={Cooolder},
+  year={2025},
+  publisher={Hugging Face},
+  url={https://huggingface.co/Cooolder/SCOPE}
+}
+```
+## License
+Apache 2.0