Cooolder commited on
Commit
d9d5f66
·
verified ·
1 Parent(s): 68097d6

Upload SCOPE_MODEL_CARD.md

Browse files
Files changed (1) hide show
  1. SCOPE_MODEL_CARD.md +420 -0
SCOPE_MODEL_CARD.md ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-4B
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - performance-prediction
10
+ - llm-evaluation
11
+ - meta-learning
12
+ ---
13
+
14
+ # SCOPE: LLM Performance Prediction Model
15
+
16
+ SCOPE is a specialized model that predicts how a target LLM will perform on a given question. Given a target question and a set of anchor questions with known performance results, SCOPE predicts the **output length** and **correctness** of the target model's response.
17
+
18
+ ## Model Description
19
+
20
+ - **Task**: Performance prediction for LLMs
21
+ - **Base Model**: Qwen3-4B
22
+ - **Training**: Supervised Fine-Tuning (SFT) + Reinforcement Learning with Chain-of-Thought reasoning
23
+ - **Input**: Target question + 5 anchor questions with performance data
24
+ - **Output**: Predicted length (tokens) and correctness (yes/no)
25
+
26
+ ## Intended Use
27
+
28
+ SCOPE is designed to:
29
+ - Predict whether an LLM will answer a question correctly before running expensive inference
30
+ - Estimate the output token length for resource planning
31
+ - Enable efficient LLM routing and selection
32
+
33
+ ## Quick Start
34
+
35
+ ### Installation
36
+
37
+ ```bash
38
+ pip install transformers>=4.51.0 torch datasets
39
+ # For vLLM inference (optional but recommended)
40
+ pip install vllm
41
+ ```
42
+
43
+ ### Input Format
44
+
45
+ SCOPE uses the following prompt format:
46
+
47
+ ```
48
+ ### Task
49
+ You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
50
+
51
+ ### Target Model
52
+ {model_name}
53
+
54
+ Example 1:
55
+ Question: {anchor_question_1}
56
+ Performance: {len: {length}, correct: {yes/no}}
57
+
58
+ Example 2:
59
+ Question: {anchor_question_2}
60
+ Performance: {len: {length}, correct: {yes/no}}
61
+
62
+ ... (5 anchor examples total)
63
+
64
+ ### Target Question
65
+ {your_target_question}
66
+
67
+ ### Output Format (STRICT)
68
+ Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
69
+ Predicted Performance: {len: [integer], correct: [yes/no]}
70
+
71
+ ### Output:
72
+ ```
73
+
74
+ ### Output Format
75
+
76
+ The model outputs:
77
+ ```
78
+ Analysis: [Reasoning about the question difficulty based on anchor patterns...]
79
+ Predicted Performance: {len: 256, correct: yes}
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Inference Methods
85
+
86
+ ### Method 1: Using Transformers (Recommended for Single Inference)
87
+
88
+ ```python
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+
91
+ # Load model
92
+ model_name = "Cooolder/SCOPE"
93
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
94
+ model = AutoModelForCausalLM.from_pretrained(
95
+ model_name,
96
+ torch_dtype="auto",
97
+ device_map="auto"
98
+ )
99
+
100
+ # Prepare the prompt (see "Prompt Examples" section below)
101
+ prompt = """### Task
102
+ You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
103
+
104
+ ### Target Model
105
+ Qwen/Qwen3-8B-Instruct
106
+
107
+ Example 1:
108
+ Question: What is the capital of France?
109
+ Performance: {len: 45, correct: yes}
110
+
111
+ Example 2:
112
+ Question: Solve: 2 + 2 = ?
113
+ Performance: {len: 32, correct: yes}
114
+
115
+ Example 3:
116
+ Question: Explain quantum entanglement in simple terms.
117
+ Performance: {len: 512, correct: yes}
118
+
119
+ Example 4:
120
+ Question: What is the 50th prime number?
121
+ Performance: {len: 128, correct: no}
122
+
123
+ Example 5:
124
+ Question: Write a haiku about programming.
125
+ Performance: {len: 78, correct: yes}
126
+
127
+ ### Target Question
128
+ What is the derivative of x^3 + 2x^2 - 5x + 7?
129
+
130
+ ### Output Format (STRICT)
131
+ Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
132
+ Predicted Performance: {len: [integer], correct: [yes/no]}
133
+
134
+ ### Output:"""
135
+
136
+ # Format as chat message
137
+ messages = [{"role": "user", "content": prompt}]
138
+ text = tokenizer.apply_chat_template(
139
+ messages,
140
+ tokenize=False,
141
+ add_generation_prompt=True,
142
+ )
143
+
144
+ # Generate
145
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
146
+ generated_ids = model.generate(
147
+ **model_inputs,
148
+ max_new_tokens=1536,
149
+ temperature=0.7,
150
+ top_p=0.8,
151
+ top_k=20,
152
+ )
153
+
154
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
155
+ response = tokenizer.decode(output_ids, skip_special_tokens=True)
156
+ print(response)
157
+ ```
158
+
159
+ ### Method 2: Using vLLM (Recommended for Batch Inference)
160
+
161
+ ```python
162
+ import os
163
+ import re
164
+ from vllm import LLM, SamplingParams
165
+
166
+ # Load model with vLLM
167
+ model_name = "Cooolder/SCOPE"
168
+ llm = LLM(
169
+ model=model_name,
170
+ dtype="bfloat16",
171
+ gpu_memory_utilization=0.90,
172
+ max_model_len=8192,
173
+ trust_remote_code=True,
174
+ )
175
+
176
+ # Prepare prompts (batch processing)
177
+ prompts = []
178
+ raw_prompt = """### Task
179
+ You are a performance prediction expert. Given a target question, 5 anchor questions with their performance results, and a target AI model, predict how the model will perform on the target question, specifically the output length and correctness after related reasoning analysis.
180
+
181
+ ### Target Model
182
+ Qwen/Qwen3-8B-Instruct
183
+
184
+ Example 1:
185
+ Question: What is the capital of France?
186
+ Performance: {len: 45, correct: yes}
187
+
188
+ Example 2:
189
+ Question: Solve: 2 + 2 = ?
190
+ Performance: {len: 32, correct: yes}
191
+
192
+ Example 3:
193
+ Question: Explain quantum entanglement in simple terms.
194
+ Performance: {len: 512, correct: yes}
195
+
196
+ Example 4:
197
+ Question: What is the 50th prime number?
198
+ Performance: {len: 128, correct: no}
199
+
200
+ Example 5:
201
+ Question: Write a haiku about programming.
202
+ Performance: {len: 78, correct: yes}
203
+
204
+ ### Target Question
205
+ What is the derivative of x^3 + 2x^2 - 5x + 7?
206
+
207
+ ### Output Format (STRICT)
208
+ Analysis: [Your comprehensive analysis covering anchor patterns, target question characteristics, and reasoning.]
209
+ Predicted Performance: {len: [integer], correct: [yes/no]}
210
+
211
+ ### Output:"""
212
+
213
+ # Wrap in Qwen3 chat template
214
+ chat_prompt = f"<|im_start|>user\n{raw_prompt}<|im_end|>\n<|im_start|>assistant\n"
215
+ prompts.append(chat_prompt)
216
+
217
+ # Sampling parameters
218
+ sampling_params = SamplingParams(
219
+ temperature=0.6,
220
+ max_tokens=1536,
221
+ top_p=0.95,
222
+ top_k=20,
223
+ n=8, # Generate multiple samples for better confidence estimation
224
+ stop=["<|im_end|>", "<|endoftext|>"],
225
+ stop_token_ids=[151645, 151643]
226
+ )
227
+
228
+ # Run inference
229
+ outputs = llm.generate(prompts, sampling_params)
230
+
231
+ # Parse results
232
+ for output in outputs:
233
+ for single_output in output.outputs:
234
+ response = single_output.text.strip()
235
+ print(response)
236
+ print("-" * 50)
237
+ ```
238
+
239
+ ### Parsing the Output
240
+
241
+ ```python
242
+ import re
243
+
244
+ def parse_prediction(response: str):
245
+ """Parse SCOPE model output to extract predictions."""
246
+ # Clean up formatting variations
247
+ response = response.replace('**Analysis**', 'Analysis:')
248
+ response = response.replace('**Predicted Performance:**', 'Predicted Performance:')
249
+
250
+ # Extract analysis
251
+ analysis = ""
252
+ if 'Analysis:' in response:
253
+ analysis_start = response.find('Analysis:') + len('Analysis:')
254
+ perf_start = response.find('Predicted Performance:')
255
+ if perf_start > analysis_start:
256
+ analysis = response[analysis_start:perf_start].strip()
257
+
258
+ # Parse len and correct
259
+ len_match = re.search(r'len:\s*(\d+)', response)
260
+ correct_match = re.search(r'correct:\s*(yes|no)', response, re.IGNORECASE)
261
+
262
+ if not len_match or not correct_match:
263
+ return None
264
+
265
+ return {
266
+ 'analysis': analysis,
267
+ 'predicted_length': int(len_match.group(1)),
268
+ 'predicted_correct': correct_match.group(1).lower()
269
+ }
270
+
271
+ # Example usage
272
+ result = parse_prediction(response)
273
+ print(f"Predicted Length: {result['predicted_length']}")
274
+ print(f"Predicted Correct: {result['predicted_correct']}")
275
+ ```
276
+
277
+ ---
278
+
279
+ ## Anchor and Prompt Examples
280
+
281
+ ### Example 1: Math Question Prediction
282
+
283
+ ```python
284
+ anchor_text = """Example 1:
285
+ Question: What is 15 + 27?
286
+ Performance: {len: 28, correct: yes}
287
+
288
+ Example 2:
289
+ Question: Calculate the area of a circle with radius 5.
290
+ Performance: {len: 156, correct: yes}
291
+
292
+ Example 3:
293
+ Question: Solve the quadratic equation x^2 - 5x + 6 = 0.
294
+ Performance: {len: 245, correct: yes}
295
+
296
+ Example 4:
297
+ Question: What is the integral of sin(x)?
298
+ Performance: {len: 89, correct: yes}
299
+
300
+ Example 5:
301
+ Question: Prove that the square root of 2 is irrational.
302
+ Performance: {len: 478, correct: no}
303
+ """
304
+
305
+ target_question = "Find the limit of (x^2 - 1)/(x - 1) as x approaches 1."
306
+ model_name = "Qwen/Qwen3-8B-Instruct"
307
+ ```
308
+
309
+ ### Example 2: Coding Question Prediction
310
+
311
+ ```python
312
+ anchor_text = """Example 1:
313
+ Question: Write a Python function to check if a number is even.
314
+ Performance: {len: 67, correct: yes}
315
+
316
+ Example 2:
317
+ Question: Implement binary search in Python.
318
+ Performance: {len: 234, correct: yes}
319
+
320
+ Example 3:
321
+ Question: Write a function to reverse a linked list.
322
+ Performance: {len: 312, correct: yes}
323
+
324
+ Example 4:
325
+ Question: Implement a LRU cache in Python.
326
+ Performance: {len: 456, correct: no}
327
+
328
+ Example 5:
329
+ Question: Write a recursive function to compute Fibonacci numbers.
330
+ Performance: {len: 178, correct: yes}
331
+ """
332
+
333
+ target_question = "Write a Python function to find the longest palindromic substring."
334
+ model_name = "deepseek-ai/DeepSeek-V2-Chat"
335
+ ```
336
+
337
+ ### Example 3: General Knowledge Prediction
338
+
339
+ ```python
340
+ anchor_text = """Example 1:
341
+ Question: Who wrote "Romeo and Juliet"?
342
+ Performance: {len: 34, correct: yes}
343
+
344
+ Example 2:
345
+ Question: What is the chemical formula for water?
346
+ Performance: {len: 42, correct: yes}
347
+
348
+ Example 3:
349
+ Question: Explain the theory of relativity.
350
+ Performance: {len: 687, correct: yes}
351
+
352
+ Example 4:
353
+ Question: What year did World War II end?
354
+ Performance: {len: 51, correct: yes}
355
+
356
+ Example 5:
357
+ Question: Who was the 23rd President of the United States?
358
+ Performance: {len: 89, correct: no}
359
+ """
360
+
361
+ target_question = "What is the speed of light in a vacuum?"
362
+ model_name = "meta-llama/Llama-3-70B-Instruct"
363
+ ```
364
+
365
+ ---
366
+
367
+ ## Using with Cooolder/kshot_inference Dataset
368
+
369
+ The model is designed to work with the [Cooolder/kshot_inference](https://huggingface.co/datasets/Cooolder/kshot_inference) dataset:
370
+
371
+ ```python
372
+ from datasets import load_dataset
373
+
374
+ # Load the dataset
375
+ dataset = load_dataset("Cooolder/kshot_inference", split="train")
376
+
377
+ # Each sample contains:
378
+ # - id: unique identifier
379
+ # - prompt: pre-formatted prompt with anchors and target question
380
+ # - gt_is_correct: ground truth correctness
381
+ # - gt_token_count: ground truth token count
382
+ # - source_model: the target model being predicted
383
+ # - retrieved_anchors: the anchor questions used
384
+
385
+ # Example: Run inference on the dataset
386
+ for sample in dataset:
387
+ prompt = sample['prompt']
388
+ # Wrap in chat template and run inference...
389
+ ```
390
+
391
+ ---
392
+
393
+ ## Performance Tips
394
+
395
+ 1. **Multiple Sampling**: Generate 8+ samples and aggregate predictions for better accuracy
396
+ 2. **Temperature**: Use 0.6-0.7 for balanced diversity
397
+ 3. **Batch Processing**: Use vLLM for high-throughput batch inference
398
+ 4. **Anchor Selection**: Choose anchors similar to your target question domain
399
+
400
+ ## Limitations
401
+
402
+ - Performance predictions are estimates based on anchor patterns
403
+ - Accuracy depends on the quality and relevance of anchor questions
404
+ - Works best when anchors are from the same domain as the target question
405
+
406
+ ## Citation
407
+
408
+ ```bibtex
409
+ @misc{scope2025,
410
+ title={SCOPE: LLM Performance Prediction Model},
411
+ author={Cooolder},
412
+ year={2025},
413
+ publisher={Hugging Face},
414
+ url={https://huggingface.co/Cooolder/SCOPE}
415
+ }
416
+ ```
417
+
418
+ ## License
419
+
420
+ Apache 2.0