rzvn commited on
Commit
f6b4b31
·
verified ·
1 Parent(s): fb98e5e

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +131 -12
  2. app.py +181 -0
  3. benchmarks/__init__.py +6 -0
  4. benchmarks/benchmark_suite.py +861 -0
  5. requirements.txt +7 -0
  6. test_benchmark.py +138 -0
README.md CHANGED
@@ -1,12 +1,131 @@
1
- ---
2
- title: LLM Benchmark Model Vs Judge
3
- emoji: 👀
4
- colorFrom: green
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.42.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM-Bench1: Language Model Benchmarking Suite
2
+
3
+ A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.
4
+
5
+ ## Features
6
+
7
+ - 🔄 Compare different Ollama models head-to-head
8
+ - 📊 Comprehensive benchmarking across multiple categories:
9
+ - Logical Reasoning
10
+ - Code Generation
11
+ - Mathematical Problem Solving
12
+ - Context Understanding
13
+ - Performance Metrics
14
+ - 📈 Interactive visualization of results
15
+ - 💾 Automatic saving of benchmark results
16
+ - 🎯 Customizable number of test iterations
17
+ - 🤖 Uses a separate judge model for unbiased evaluation
18
+
19
+ ## Requirements
20
+
21
+ - Python 3.8+
22
+ - Ollama installed and running
23
+ - Required Python packages (installed automatically):
24
+ - gradio
25
+ - ollama
26
+ - pandas
27
+ - plotly
28
+ - python-dotenv
29
+ - tqdm
30
+ - rich
31
+
32
+ ## Installation
33
+
34
+ 1. Clone the repository:
35
+ ```bash
36
+ git clone https://github.com/yourusername/LLM-Bench1.git
37
+ cd LLM-Bench1
38
+ ```
39
+
40
+ 2. Create and activate a virtual environment:
41
+ ```bash
42
+ python -m venv .venv
43
+ source .venv/bin/activate # On Windows, use: .venv\Scripts\activate
44
+ ```
45
+
46
+ 3. Install dependencies:
47
+ ```bash
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ ## Usage
52
+
53
+ 1. Ensure Ollama is running and you have the desired models pulled:
54
+ ```bash
55
+ ollama pull codellama
56
+ ollama pull llama2
57
+ # Pull any other models you want to benchmark
58
+ ```
59
+
60
+ 2. Run the application:
61
+ ```bash
62
+ python app.py
63
+ ```
64
+
65
+ 3. Open the provided URL in your browser to access the Gradio interface.
66
+
67
+ 4. Select:
68
+ - The model to benchmark
69
+ - The judge model (can be the same or different)
70
+ - Number of test iterations
71
+
72
+ 5. Click "Run Benchmark" and wait for the results.
73
+
74
+ ## Benchmark Categories
75
+
76
+ ### 1. Logical Reasoning
77
+ Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.
78
+
79
+ ### 2. Code Generation
80
+ Evaluates the model's capability to:
81
+ - Write functional code
82
+ - Implement algorithms
83
+ - Handle edge cases
84
+ - Provide proper documentation
85
+
86
+ ### 3. Mathematical Problem Solving
87
+ Tests mathematical reasoning across:
88
+ - Calculus
89
+ - Probability
90
+ - Proof writing
91
+ - Problem-solving strategies
92
+
93
+ ### 4. Context Understanding
94
+ Assesses the model's ability to:
95
+ - Comprehend complex passages
96
+ - Analyze code snippets
97
+ - Evaluate business scenarios
98
+ - Provide structured analysis
99
+
100
+ ### 5. Performance Metrics
101
+ Measures:
102
+ - Response time
103
+ - Tokens per second
104
+ - Consistency across iterations
105
+ - Resource efficiency
106
+
107
+ ## Results
108
+
109
+ Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format:
110
+ ```
111
+ benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json
112
+ ```
113
+
114
+ Each result file contains:
115
+ - Model details
116
+ - Timestamp
117
+ - Detailed scores for each category
118
+ - Performance metrics
119
+ - Raw responses and evaluations
120
+
121
+ ## Contributing
122
+
123
+ Feel free to open issues or submit pull requests with improvements to:
124
+ - Test cases
125
+ - Evaluation metrics
126
+ - UI/UX enhancements
127
+ - Documentation
128
+
129
+ ## License
130
+
131
+ MIT License
app.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import ollama
3
+ import pandas as pd
4
+ import plotly.express as px
5
+ from rich.console import Console
6
+ from rich.progress import track
7
+ from datetime import datetime
8
+ import json
9
+ import time
10
+ import os
11
+ from benchmarks.benchmark_suite import BenchmarkSuite
12
+
13
+ console = Console()
14
+
15
+ def get_available_models():
16
+ try:
17
+ models = ollama.list()
18
+ if 'models' in models and models['models']:
19
+ # Extract model names, handling cases where 'name' key might not exist
20
+ model_names = []
21
+ for model in models['models']:
22
+ if 'name' in model:
23
+ model_names.append(model['name'])
24
+ else:
25
+ # Fallback to 'model' key if 'name' doesn't exist
26
+ model_names.append(model.get('model', 'unknown_model'))
27
+ return model_names
28
+ else:
29
+ console.print("[yellow]No models found in Ollama[/yellow]")
30
+ return ["codellama", "llama2", "mistral"] # Fallback default models
31
+ except Exception as e:
32
+ console.print(f"[red]Error fetching models: {e}[/red]")
33
+ return ["codellama", "llama2", "mistral"] # Fallback default models
34
+
35
+ class BenchmarkApp:
36
+ def __init__(self):
37
+ self.available_models = get_available_models()
38
+
39
+ def create_interface(self):
40
+ with gr.Blocks(theme=gr.themes.Soft()) as interface:
41
+ gr.Markdown("""
42
+ # 🚀 LLM Benchmark Suite
43
+ Compare different LLM models using various benchmarking metrics
44
+ """)
45
+
46
+ with gr.Row():
47
+ with gr.Column():
48
+ model_name = gr.Dropdown(
49
+ choices=self.available_models,
50
+ label="Select Model to Benchmark",
51
+ value=self.available_models[0] if self.available_models else None
52
+ )
53
+ judge_model = gr.Dropdown(
54
+ choices=self.available_models,
55
+ label="Select Judge Model",
56
+ value=self.available_models[0] if self.available_models else None
57
+ )
58
+ num_iterations = gr.Slider(
59
+ minimum=1,
60
+ maximum=20,
61
+ value=5,
62
+ step=1,
63
+ label="Number of Test Iterations"
64
+ )
65
+ run_button = gr.Button("🎯 Run Benchmark", variant="primary")
66
+
67
+ with gr.Column():
68
+ progress_output = gr.Textbox(
69
+ label="Benchmark Progress",
70
+ lines=10,
71
+ max_lines=10
72
+ )
73
+
74
+ with gr.Row():
75
+ chat_output = gr.Chatbot(
76
+ label="Q&A Chat During Benchmark",
77
+ height=300
78
+ )
79
+
80
+ with gr.Row():
81
+ with gr.Column():
82
+ results_json = gr.JSON(label="Detailed Results")
83
+ with gr.Column():
84
+ plot_output = gr.Plot(label="Performance Visualization")
85
+
86
+ run_button.click(
87
+ fn=self.run_benchmark,
88
+ inputs=[model_name, judge_model, num_iterations],
89
+ outputs=[progress_output, chat_output, results_json, plot_output]
90
+ )
91
+
92
+ return interface
93
+
94
+ def run_benchmark(self, model_name, judge_model, num_iterations):
95
+ if not model_name or not judge_model:
96
+ return "Please select both a model and a judge model.", None, None, None
97
+
98
+ console.print(f"[bold blue]Starting benchmark for {model_name} with {num_iterations} iterations[/bold blue]")
99
+ console.print(f"[bold blue]Judge model: {judge_model}[/bold blue]")
100
+
101
+ try:
102
+ benchmark_suite = BenchmarkSuite(model_name, judge_model)
103
+
104
+ # Run benchmarks with rich progress bar
105
+ results = {}
106
+ start_time = time.time()
107
+
108
+ from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TaskProgressColumn, TimeRemainingColumn
109
+
110
+ with Progress(
111
+ SpinnerColumn(),
112
+ TextColumn("[progress.description]{task.description}"),
113
+ BarColumn(),
114
+ TaskProgressColumn(),
115
+ TimeRemainingColumn(),
116
+ console=console
117
+ ) as progress:
118
+ # Create a single task for overall progress
119
+ overall_task = progress.add_task("[cyan]Running benchmarks...", total=5) # 5 test categories
120
+
121
+ # Run benchmarks
122
+ for test_name, result in benchmark_suite.run_all_tests(num_iterations):
123
+ results[test_name] = result
124
+ progress.update(overall_task, advance=1, description=f"[green]Completed {test_name}[/green]")
125
+ console.print(f"[green]✓ {test_name}: Score {result.get('score', 0):.2f}[/green]")
126
+
127
+ total_time = time.time() - start_time
128
+
129
+ # Calculate final scores
130
+ final_results = {
131
+ "model_name": model_name,
132
+ "judge_model": judge_model,
133
+ "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
134
+ "total_time": f"{total_time:.2f}s",
135
+ "results": results,
136
+ "overall_score": sum(r.get("score", 0) for r in results.values()) / len(results) if results else 0
137
+ }
138
+
139
+ # Print summary
140
+ console.print("\n[bold green]Benchmark Results Summary:[/bold green]")
141
+ for test_name, result in results.items():
142
+ console.print(f" {test_name}: {result.get('score', 0):.2f}/10")
143
+ console.print(f"[bold blue]Overall Score: {final_results['overall_score']:.2f}/10[/bold blue]")
144
+ console.print(f"[bold blue]Total Time: {total_time:.2f} seconds[/bold blue]")
145
+
146
+ # Create visualization
147
+ df = pd.DataFrame([
148
+ {"Metric": k, "Score": v.get("score", 0)}
149
+ for k, v in results.items()
150
+ ])
151
+
152
+ fig = px.bar(
153
+ df,
154
+ x="Metric",
155
+ y="Score",
156
+ title=f"Benchmark Results: {model_name}",
157
+ color="Score",
158
+ color_continuous_scale="viridis"
159
+ )
160
+
161
+ # Save results
162
+ os.makedirs("benchmark_results", exist_ok=True)
163
+ result_file = f"benchmark_results/{model_name}_vs_{judge_model}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
164
+ with open(result_file, "w") as f:
165
+ json.dump(final_results, f, indent=2)
166
+
167
+ progress_text = f"✨ Benchmark completed! Results saved to {result_file}"
168
+ console.print(f"[green]{progress_text}[/green]")
169
+
170
+ return progress_text, None, final_results, fig
171
+
172
+ except Exception as e:
173
+ error_msg = f"Error during benchmark: {str(e)}"
174
+ console.print(f"[red]{error_msg}[/red]")
175
+ console.print_exception()
176
+ return error_msg, None, None, None
177
+
178
+ if __name__ == "__main__":
179
+ app = BenchmarkApp()
180
+ interface = app.create_interface()
181
+ interface.launch(share=True)
benchmarks/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
+ # Add the project root directory to the Python path
5
+ project_root = Path(__file__).parent.parent
6
+ sys.path.append(str(project_root))
benchmarks/benchmark_suite.py ADDED
@@ -0,0 +1,861 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ollama
2
+ import time
3
+ from typing import Dict, Any, List, Tuple, Callable, Optional, Generator
4
+ import json
5
+ from rich.console import Console
6
+ from rich.panel import Panel
7
+ from rich.table import Table
8
+ from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TimeElapsedColumn
9
+ from rich.style import Style
10
+ import threading
11
+ import queue
12
+
13
+
14
+ class BenchmarkSuite:
15
+ def __init__(self, model_name: str, judge_model: str):
16
+ self.model_name = model_name
17
+ self.judge_model = judge_model
18
+ self.console = Console()
19
+ self.progress_callback = None
20
+
21
+ def set_progress_callback(self, callback: Callable[[str], None]):
22
+ self.progress_callback = callback
23
+
24
+ def log_progress(self, message: str, style: str = ""):
25
+ self.console.print(message, style=style)
26
+ if self.progress_callback:
27
+ # Strip rich formatting for UI display
28
+ clean_message = message.replace("[cyan]", "").replace("[/cyan]", "") \
29
+ .replace("[green]", "").replace("[/green]", "") \
30
+ .replace("[yellow]", "").replace("[/yellow]", "") \
31
+ .replace("[red]", "").replace("[/red]", "")
32
+ self.progress_callback(clean_message)
33
+
34
+ def run_all_tests(self, num_iterations: int, interaction_callback=None) -> Generator[Tuple[str, Dict[str, Any]], None, None]:
35
+ tests = [
36
+ ("Logical Reasoning", self.test_logical_reasoning),
37
+ ("Code Generation", self.test_code_generation),
38
+ ("Mathematical Problem Solving", self.test_math_solving),
39
+ ("Context Understanding", self.test_context_understanding),
40
+ ("Performance Metrics", self.test_performance)
41
+ ]
42
+
43
+ for test_name, test_func in tests:
44
+ result = test_func(num_iterations, interaction_callback)
45
+ yield test_name, result
46
+
47
+ def evaluate_response(self, prompt: str, expected_elements: List[str], interaction_callback=None) -> Dict[str, Any]:
48
+ start_time = time.time()
49
+
50
+ try:
51
+
52
+ # Get model response with streaming
53
+ try:
54
+ response_stream = self._ollama_generate_with_timeout(
55
+ model=self.model_name,
56
+ prompt=prompt,
57
+ stream=True,
58
+ timeout=60
59
+ )
60
+ except Exception as e:
61
+ raise
62
+ except TimeoutError as e:
63
+ # If the initial request times out, return an error result
64
+ error_msg = f"Model {self.model_name} timed out: {str(e)}"
65
+ if interaction_callback:
66
+ interaction_callback(
67
+ prompt=prompt,
68
+ model_response=f"[ERROR] {error_msg}",
69
+ judge_response="N/A",
70
+ model_name=self.model_name,
71
+ judge_model_name=self.judge_model
72
+ )
73
+ return {
74
+ "score": 0,
75
+ "response_time": 0,
76
+ "error": error_msg,
77
+ "evaluation": None,
78
+ "response": None,
79
+ "prompt": prompt,
80
+ "judge_response_raw": None
81
+ }
82
+ except Exception as e:
83
+ # If the initial request fails, return an error result
84
+ error_msg = f"Failed to connect to model {self.model_name}: {str(e)}"
85
+ if interaction_callback:
86
+ interaction_callback(
87
+ prompt=prompt,
88
+ model_response=f"[ERROR] {error_msg}",
89
+ judge_response="N/A",
90
+ model_name=self.model_name,
91
+ judge_model_name=self.judge_model
92
+ )
93
+ return {
94
+ "score": 0,
95
+ "response_time": 0,
96
+ "error": error_msg,
97
+ "evaluation": None,
98
+ "response": None,
99
+ "prompt": prompt,
100
+ "judge_response_raw": None
101
+ }
102
+
103
+ # Process streaming response
104
+ response_text = ""
105
+ model_update_counter = 0
106
+ model_stream_start_time = time.time()
107
+ model_stream_timeout = 60 # 60 seconds timeout for model response (more reasonable)
108
+ last_response_length = 0
109
+ progress_check_count = 0
110
+ max_progress_checks = 30 # Max number of checks without progress
111
+ last_progress_time = time.time() # Track when we last saw progress
112
+ absolute_timeout = model_stream_start_time + model_stream_timeout + 60 # Absolute timeout (60s buffer)
113
+
114
+ try:
115
+ for chunk in response_stream:
116
+ # Check for absolute timeout
117
+ current_time = time.time()
118
+ if current_time > absolute_timeout:
119
+ response_text += " [Timeout: Model response took too long (absolute timeout)]"
120
+ break
121
+
122
+ # Check for timeout - only timeout if no progress for a while
123
+ if current_time - last_progress_time > model_stream_timeout:
124
+ response_text += " [Timeout: Model response took too long]"
125
+ break
126
+
127
+ # Check if chunk is empty or None
128
+ if not chunk or (isinstance(chunk, dict) and not chunk):
129
+ break
130
+
131
+ # Check if the stream has ended
132
+ if 'done' in chunk and chunk['done']:
133
+ break
134
+
135
+ # Additional check for stream termination
136
+ if 'done' in chunk and isinstance(chunk['done'], bool) and chunk['done'] == True:
137
+ break
138
+
139
+ if 'response' in chunk and chunk['response']:
140
+ response_text += chunk['response']
141
+ model_update_counter += 1
142
+ elif 'response' not in chunk:
143
+ # If there's no response key, it might be a control message, continue processing
144
+ pass
145
+
146
+ # Update UI with streaming response (every 5 chunks to reduce UI updates)
147
+ if interaction_callback and model_update_counter % 5 == 0:
148
+ interaction_callback(
149
+ prompt=prompt,
150
+ model_response=response_text,
151
+ judge_response="", # Empty for now, will update when judge responds
152
+ model_name=self.model_name,
153
+ judge_model_name=self.judge_model
154
+ )
155
+
156
+ # Also print to terminal for real-time streaming display
157
+ if model_update_counter % 10 == 0: # Print every 10 chunks to terminal
158
+ print(f"\rModel response (chunk {model_update_counter}): {response_text[-100:]}", end="", flush=True)
159
+
160
+ # Check for progress - if no progress in a while, break
161
+ if len(response_text) > last_response_length:
162
+ last_response_length = len(response_text)
163
+ last_progress_time = time.time() # Update progress time
164
+ progress_check_count = 0 # Reset progress counter
165
+ else:
166
+ progress_check_count += 1
167
+ if progress_check_count > max_progress_checks:
168
+ response_text += " [Stuck: No progress in response]"
169
+ break
170
+
171
+ # Safety check: prevent infinite loop
172
+ if model_update_counter > 5000: # Maximum 5000 chunks
173
+ response_text += " [Error: Too many response chunks]"
174
+ break
175
+
176
+ # Additional safety check: if we've been streaming for too long without meaningful content
177
+ if len(response_text) > 10000: # Reduced limit to prevent infinite streaming
178
+ response_text += " [Truncated: Response too long]"
179
+ break
180
+
181
+ # Safety check: if we've been in this loop for too long, break
182
+ if time.time() - model_stream_start_time > 120: # Maximum 2 minutes total
183
+ response_text += " [Timeout: Maximum time exceeded]"
184
+ break
185
+ except Exception as e:
186
+ # If streaming fails, try to get the full response
187
+ response_text = "Error during streaming: " + str(e)
188
+ if interaction_callback:
189
+ interaction_callback(
190
+ prompt=prompt,
191
+ model_response=response_text,
192
+ judge_response="",
193
+ model_name=self.model_name,
194
+ judge_model_name=self.judge_model
195
+ )
196
+
197
+ # Add a newline after streaming display
198
+ print() # Newline after model response streaming
199
+
200
+
201
+ # Calculate metrics
202
+ response_time = time.time() - start_time
203
+
204
+ # Ask judge model to evaluate with streaming
205
+ judge_prompt = f"""
206
+ Evaluate the following response based on accuracy, completeness, and correctness.
207
+ The response should contain or address these elements: {', '.join(expected_elements)}
208
+
209
+ Response to evaluate:
210
+ {response_text}
211
+
212
+ Rate each criteria from 0-10 and provide a brief explanation.
213
+ Return your evaluation in JSON format:
214
+ {{
215
+ "accuracy": {{"score": "X", "reason": "..."}},
216
+ "completeness": {{"score": "X", "reason": "..."}},
217
+ "correctness": {{"score": "X", "reason": "..."}}
218
+ }}
219
+ """
220
+
221
+ try:
222
+ judge_response_stream = self._ollama_generate_with_timeout(
223
+ model=self.judge_model,
224
+ prompt=judge_prompt,
225
+ stream=True,
226
+ timeout=60
227
+ )
228
+ except TimeoutError as e:
229
+ # If the judge request times out, return an error result
230
+ error_msg = f"Judge model {self.judge_model} timed out: {str(e)}"
231
+ judge_response_text = f"[ERROR] {error_msg}"
232
+ if interaction_callback:
233
+ interaction_callback(
234
+ prompt=prompt,
235
+ model_response=response_text,
236
+ judge_response=judge_response_text,
237
+ model_name=self.model_name,
238
+ judge_model_name=self.judge_model
239
+ )
240
+
241
+ evaluation = {
242
+ "accuracy": {"score": 0, "reason": error_msg},
243
+ "completeness": {"score": 0, "reason": error_msg},
244
+ "correctness": {"score": 0, "reason": error_msg}
245
+ }
246
+
247
+ return {
248
+ "score": 0,
249
+ "response_time": time.time() - start_time,
250
+ "evaluation": evaluation,
251
+ "response": response_text,
252
+ "prompt": prompt,
253
+ "judge_response_raw": judge_response_text
254
+ }
255
+ except Exception as e:
256
+ # If the judge request fails, return an error result
257
+ error_msg = f"Failed to connect to judge model {self.judge_model}: {str(e)}"
258
+ judge_response_text = f"[ERROR] {error_msg}"
259
+ if interaction_callback:
260
+ interaction_callback(
261
+ prompt=prompt,
262
+ model_response=response_text,
263
+ judge_response=judge_response_text,
264
+ model_name=self.model_name,
265
+ judge_model_name=self.judge_model
266
+ )
267
+
268
+ evaluation = {
269
+ "accuracy": {"score": 0, "reason": error_msg},
270
+ "completeness": {"score": 0, "reason": error_msg},
271
+ "correctness": {"score": 0, "reason": error_msg}
272
+ }
273
+
274
+ return {
275
+ "score": 0,
276
+ "response_time": time.time() - start_time,
277
+ "evaluation": evaluation,
278
+ "response": response_text,
279
+ "prompt": prompt,
280
+ "judge_response_raw": judge_response_text
281
+ }
282
+
283
+ # Process streaming judge response
284
+ judge_response_text = ""
285
+ judge_update_counter = 0
286
+ judge_stream_start_time = time.time()
287
+ judge_stream_timeout = 300 # 5 minutes timeout for judge response (more reasonable)
288
+ last_judge_response_length = 0
289
+ judge_progress_check_count = 0
290
+ max_judge_progress_checks = 20 # Max number of checks without progress
291
+ last_judge_progress_time = time.time() # Track when we last saw progress
292
+ judge_absolute_timeout = judge_stream_start_time + judge_stream_timeout + 300 # Absolute timeout (5min buffer)
293
+
294
+ try:
295
+ for chunk in judge_response_stream:
296
+ # Check for absolute timeout
297
+ current_time = time.time()
298
+ if current_time > judge_absolute_timeout:
299
+ judge_response_text += " [Timeout: Judge response took too long (absolute timeout)]"
300
+ break
301
+
302
+ # Check for timeout - only timeout if no progress for a while
303
+ if current_time - last_judge_progress_time > judge_stream_timeout:
304
+ judge_response_text += " [Timeout: Judge response took too long]"
305
+ break
306
+
307
+ # Check if chunk is empty or None
308
+ if not chunk or (isinstance(chunk, dict) and not chunk):
309
+ break
310
+
311
+ # Check if the stream has ended
312
+ if 'done' in chunk and chunk['done']:
313
+ break
314
+
315
+ # Additional check for stream termination
316
+ if 'done' in chunk and isinstance(chunk['done'], bool) and chunk['done'] == True:
317
+ break
318
+
319
+ if 'response' in chunk and chunk['response']:
320
+ judge_response_text += chunk['response']
321
+ judge_update_counter += 1
322
+ elif 'response' not in chunk:
323
+ # If there's no response key, it might be a control message, continue processing
324
+ pass
325
+ # Update UI with streaming judge response (every 5 chunks to reduce UI updates)
326
+ if interaction_callback and judge_update_counter % 5 == 0:
327
+ interaction_callback(
328
+ prompt=prompt,
329
+ model_response=response_text,
330
+ judge_response=judge_response_text,
331
+ model_name=self.model_name,
332
+ judge_model_name=self.judge_model
333
+ )
334
+
335
+ # Also print to terminal for real-time streaming display
336
+ if judge_update_counter % 10 == 0: # Print every 10 chunks to terminal
337
+ print(f"\rJudge response (chunk {judge_update_counter}): {judge_response_text[-100:]}", end="", flush=True)
338
+
339
+ # Check for progress - if no progress in a while, break
340
+ if len(judge_response_text) > last_judge_response_length:
341
+ last_judge_response_length = len(judge_response_text)
342
+ last_judge_progress_time = time.time() # Update progress time
343
+ judge_progress_check_count = 0 # Reset progress counter
344
+ else:
345
+ judge_progress_check_count += 1
346
+ if judge_progress_check_count > max_judge_progress_checks:
347
+ judge_response_text += " [Stuck: No progress in response]"
348
+ break
349
+
350
+ # Safety check: prevent infinite loop
351
+ if judge_update_counter > 10000: # Maximum 10000 chunks
352
+ judge_response_text += " [Error: Too many response chunks]"
353
+ break
354
+
355
+ # Additional safety check: if we've been streaming for too long without meaningful content
356
+ if len(judge_response_text) > 50000: # Increased limit to prevent infinite streaming
357
+ judge_response_text += " [Truncated: Response too long]"
358
+ break
359
+
360
+ # Safety check: if we've been in this loop for too long, break
361
+ if time.time() - judge_stream_start_time > 600: # Maximum 10 minutes total
362
+ judge_response_text += " [Timeout: Maximum time exceeded]"
363
+ break
364
+
365
+ except Exception as e:
366
+ # If streaming fails, use an error message
367
+ judge_response_text = "Error during judge streaming: " + str(e)
368
+ if interaction_callback:
369
+ interaction_callback(
370
+ prompt=prompt,
371
+ model_response=response_text,
372
+ judge_response=judge_response_text,
373
+ model_name=self.model_name,
374
+ judge_model_name=self.judge_model
375
+ )
376
+
377
+ # Add a newline after streaming display
378
+ print() # Newline after judge response streaming
379
+
380
+ # Parse judge response
381
+ evaluation = None
382
+ try:
383
+ # Try to extract JSON from code blocks first
384
+ json_text = judge_response_text
385
+ if "```json" in judge_response_text:
386
+ start = judge_response_text.find("```json") + 7
387
+ end = judge_response_text.find("```", start)
388
+ if end != -1:
389
+ json_text = judge_response_text[start:end].strip()
390
+ elif "```" in judge_response_text:
391
+ # Handle generic code blocks
392
+ start = judge_response_text.find("```") + 3
393
+ end = judge_response_text.find("```", start)
394
+ if end != -1:
395
+ json_text = judge_response_text[start:end].strip()
396
+
397
+ # Try to parse the JSON
398
+ try:
399
+ evaluation = json.loads(json_text)
400
+ except json.JSONDecodeError:
401
+ # If JSON parsing fails, try to fix common issues
402
+ # Replace unescaped backslashes in LaTeX expressions
403
+ fixed_json_text = json_text.replace(r'\\', r'\\\\').replace(r'\(', r'\\(').replace(r'\)', r'\\)').replace(r'\frac', r'\\frac')
404
+ try:
405
+ evaluation = json.loads(fixed_json_text)
406
+ except json.JSONDecodeError:
407
+ # If still failing, try a more aggressive fix
408
+ import re
409
+ # Escape all backslashes that aren't already escaped
410
+ fixed_json_text = re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'\\\\', json_text)
411
+ try:
412
+ evaluation = json.loads(fixed_json_text)
413
+ except json.JSONDecodeError:
414
+ raise
415
+ except json.JSONDecodeError:
416
+ # If judge response is a timeout message, don't set score to 0
417
+ if "[Timeout:" in judge_response_text or "[Error:" in judge_response_text:
418
+ evaluation = {
419
+ "accuracy": {"score": 5, "reason": "Judge response timed out or had error"},
420
+ "completeness": {"score": 5, "reason": "Judge response timed out or had error"},
421
+ "correctness": {"score": 5, "reason": "Judge response timed out or had error"}
422
+ }
423
+ else:
424
+ # Try to find JSON-like content in the response
425
+ import re
426
+ json_match = re.search(r'\{[^{]*"accuracy"[^}]*\}', judge_response_text)
427
+ if json_match:
428
+ try:
429
+ evaluation = json.loads(json_match.group())
430
+ except:
431
+ evaluation = self._extract_scores_from_text(judge_response_text)
432
+ else:
433
+ evaluation = self._extract_scores_from_text(judge_response_text)
434
+ except Exception as e:
435
+ # If judge response is a timeout message, don't set score to 0
436
+ if "[Timeout:" in judge_response_text or "[Error:" in judge_response_text:
437
+ evaluation = {
438
+ "accuracy": {"score": 5, "reason": "Judge response timed out or had error"},
439
+ "completeness": {"score": 5, "reason": "Judge response timed out or had error"},
440
+ "correctness": {"score": 5, "reason": "Judge response timed out or had error"}
441
+ }
442
+ else:
443
+ evaluation = {
444
+ "accuracy": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"},
445
+ "completeness": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"},
446
+ "correctness": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"}
447
+ }
448
+ except Exception as e:
449
+ # If judge response is a timeout message, don't set score to 0
450
+ if "[Timeout:" in judge_response_text or "[Error:" in judge_response_text:
451
+ evaluation = {
452
+ "accuracy": {"score": 5, "reason": "Judge response timed out or had error"},
453
+ "completeness": {"score": 5, "reason": "Judge response timed out or had error"},
454
+ "correctness": {"score": 5, "reason": "Judge response timed out or had error"}
455
+ }
456
+ else:
457
+ evaluation = {
458
+ "accuracy": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"},
459
+ "completeness": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"},
460
+ "correctness": {"score": 0, "reason": f"Error parsing judge response: {str(e)}"}
461
+ }
462
+
463
+ # Call the callback with final Q&A data
464
+ if interaction_callback:
465
+ interaction_callback(
466
+ prompt=prompt,
467
+ model_response=response_text,
468
+ judge_response=judge_response_text,
469
+ model_name=self.model_name,
470
+ judge_model_name=self.judge_model
471
+ )
472
+
473
+ return {
474
+ "score": sum(self._normalize_score(e["score"]) for e in evaluation.values()) / len(evaluation) if evaluation and len(evaluation) > 0 else 0,
475
+ "response_time": response_time,
476
+ "evaluation": evaluation,
477
+ "response": response_text,
478
+ "prompt": prompt,
479
+ "judge_response_raw": judge_response_text
480
+ }
481
+
482
+ except Exception as e:
483
+ if interaction_callback:
484
+ interaction_callback(
485
+ prompt=prompt,
486
+ model_response=f"[ERROR] {str(e)}",
487
+ judge_response="N/A",
488
+ model_name=self.model_name,
489
+ judge_model_name=self.judge_model
490
+ )
491
+ return {
492
+ "score": 0,
493
+ "response_time": 0,
494
+ "error": str(e),
495
+ "evaluation": None,
496
+ "response": None,
497
+ "prompt": prompt,
498
+ "judge_response_raw": None
499
+ }
500
+
501
+ def _ollama_generate_with_timeout(self, model: str, prompt: str, stream: bool = True, timeout: int = 60):
502
+ """
503
+ Wrapper function to call ollama.generate with a timeout.
504
+ """
505
+ result_queue = queue.Queue()
506
+ exception_queue = queue.Queue()
507
+
508
+ def generate_wrapper():
509
+ try:
510
+ result = ollama.generate(
511
+ model=model,
512
+ prompt=prompt,
513
+ options={"temperature": 0.7},
514
+ stream=stream
515
+ )
516
+ result_queue.put(result)
517
+ except Exception as e:
518
+ exception_queue.put(e)
519
+
520
+ # Start the generation in a separate thread
521
+ thread = threading.Thread(target=generate_wrapper)
522
+ thread.daemon = True
523
+ thread.start()
524
+
525
+ # Wait for the thread to complete or timeout
526
+ thread.join(timeout)
527
+
528
+ if thread.is_alive():
529
+ # Thread is still running, which means it timed out
530
+ raise TimeoutError(f"Request to model {model} timed out after {timeout} seconds")
531
+
532
+ # Check if there was an exception
533
+ if not exception_queue.empty():
534
+ raise exception_queue.get()
535
+
536
+ # Return the result
537
+ if not result_queue.empty():
538
+ return result_queue.get()
539
+ else:
540
+ raise TimeoutError(f"Request to model {model} timed out after {timeout} seconds")
541
+
542
+ def _extract_scores_from_text(self, text: str) -> dict:
543
+ """
544
+ Extract scores from text-based judge responses.
545
+ Look for patterns like "accuracy: 10", "completeness: 9", etc.
546
+ """
547
+ import re
548
+
549
+ # Look for score patterns in the text
550
+ scores = {
551
+ "accuracy": {"score": 0, "reason": "Score extracted from text evaluation"},
552
+ "completeness": {"score": 0, "reason": "Score extracted from text evaluation"},
553
+ "correctness": {"score": 0, "reason": "Score extracted from text evaluation"}
554
+ }
555
+
556
+ # Look for score patterns like "accuracy: 10", "completeness: 9", etc.
557
+ patterns = {
558
+ "accuracy": r"[aA]ccuracy[^\d]{0,20}(\d+)",
559
+ "completeness": r"[cC]ompleteness[^\d]{0,20}(\d+)",
560
+ "correctness": r"[cC]orrectness[^\d]{0,20}(\d+)"
561
+ }
562
+
563
+ for key, pattern in patterns.items():
564
+ match = re.search(pattern, text)
565
+ if match:
566
+ try:
567
+ score = int(match.group(1))
568
+ scores[key]["score"] = score
569
+ except ValueError:
570
+ pass
571
+
572
+ # If no scores found, try to extract any numbers that might be scores
573
+ if all(scores[key]["score"] == 0 for key in scores):
574
+ # Look for any numbers in the text
575
+ numbers = re.findall(r'\b\d+\b', text)
576
+ # Take the first few numbers as scores (up to 3)
577
+ for i, key in enumerate(scores.keys()):
578
+ if i < len(numbers):
579
+ try:
580
+ score = int(numbers[i])
581
+ if 0 <= score <= 10: # Valid score range
582
+ scores[key]["score"] = score
583
+ except ValueError:
584
+ pass
585
+
586
+ return scores
587
+
588
+ def _normalize_score(self, score) -> float:
589
+ """
590
+ Normalize score from 0-100 scale to 0-10 scale if needed.
591
+ Handle various score formats (numbers, strings, fractions, letter grades).
592
+ """
593
+ try:
594
+ # Convert string scores to numbers
595
+ if isinstance(score, str):
596
+ # Handle letter grades
597
+ letter_grades = {
598
+ 'A+': 10.0, 'A': 9.5, 'A-': 9.0,
599
+ 'B+': 8.5, 'B': 8.0, 'B-': 7.5,
600
+ 'C+': 7.0, 'C': 6.5, 'C-': 6.0,
601
+ 'D+': 5.5, 'D': 5.0, 'D-': 4.5,
602
+ 'F': 0.0
603
+ }
604
+ score_upper = score.strip().upper()
605
+ if score_upper in letter_grades:
606
+ return letter_grades[score_upper]
607
+
608
+ # Handle fractional scores like "4/5"
609
+ if "/" in score:
610
+ parts = score.split("/")
611
+ if len(parts) == 2:
612
+ numerator = float(parts[0].strip())
613
+ denominator = float(parts[1].strip())
614
+ if denominator != 0:
615
+ score = (numerator / denominator) * 10 # Convert to 0-10 scale
616
+ else:
617
+ score = 0
618
+ else:
619
+ score = float(score)
620
+ else:
621
+ score = float(score)
622
+ else:
623
+ score = float(score)
624
+
625
+ # If score is greater than 10, assume it's out of 100 and normalize
626
+ if score > 10:
627
+ return score / 10.0
628
+ return score
629
+ except (ValueError, TypeError):
630
+ # If we can't parse the score, return 0
631
+ return 0.0
632
+
633
+ def test_logical_reasoning(self, num_iterations: int, interaction_callback=None) -> Dict[str, Any]:
634
+ prompts = [
635
+ ("""
636
+ Three people - Alice, Bob, and Charlie - are standing in a line.
637
+ We know that:
638
+ 1. Alice is not first in line
639
+ 2. Charlie is not last in line
640
+ 3. Bob is not second in line
641
+ What is the correct order of people in the line?
642
+ Explain your reasoning step by step.
643
+ """, ["logical steps", "final answer", "explanation"]),
644
+
645
+ ("""
646
+ In a bag, there are red, blue, and green marbles.
647
+ If you pick two marbles at random:
648
+ - The probability of getting two red marbles is 1/6
649
+ - The probability of getting two blue marbles is 1/15
650
+ How many marbles of each color are in the bag?
651
+ Show your work.
652
+ """, ["equation setup", "calculation", "final answer"]),
653
+
654
+ ("""
655
+ You have 8 coins that look identical, but one is slightly heavier than the others.
656
+ Using a balance scale only twice, how can you identify the heavier coin?
657
+ Provide a detailed strategy.
658
+ """, ["strategy", "steps", "explanation"])
659
+ ]
660
+
661
+ results = []
662
+ for _ in range(num_iterations):
663
+ prompt, expected = prompts[_ % len(prompts)]
664
+ results.append(self.evaluate_response(prompt, expected, interaction_callback))
665
+
666
+ # Aggregate results
667
+ avg_score = sum(r["score"] for r in results) / len(results)
668
+ avg_time = sum(r["response_time"] for r in results) / len(results)
669
+
670
+ return {
671
+ "score": avg_score,
672
+ "average_response_time": avg_time,
673
+ "iterations": len(results),
674
+ "individual_results": results
675
+ }
676
+
677
+ def test_code_generation(self, num_iterations: int, interaction_callback=None) -> Dict[str, Any]:
678
+ prompts = [
679
+ ("""
680
+ Write a Python function that implements a binary search algorithm.
681
+ The function should:
682
+ 1. Take a sorted list and target value as input
683
+ 2. Return the index if found, or -1 if not found
684
+ 3. Include type hints
685
+ 4. Include docstring with examples
686
+ 5. Include error handling
687
+ """, ["function signature", "implementation", "type hints", "docstring", "error handling"]),
688
+
689
+ ("""
690
+ Create a class representing a Queue data structure using two stacks.
691
+ Implement the following methods:
692
+ 1. enqueue(item)
693
+ 2. dequeue()
694
+ 3. peek()
695
+ 4. is_empty()
696
+ Include proper error handling and type hints.
697
+ """, ["class definition", "method implementations", "error handling", "type hints"]),
698
+
699
+ ("""
700
+ Write a function that finds all prime numbers up to a given number using
701
+ the Sieve of Eratosthenes algorithm. The function should:
702
+ 1. Take a positive integer n as input
703
+ 2. Return a list of all prime numbers up to n
704
+ 3. Include time complexity analysis in comments
705
+ 4. Include memory optimization techniques
706
+ """, ["function implementation", "algorithm correctness", "optimization", "complexity analysis"])
707
+ ]
708
+
709
+ results = []
710
+ for _ in range(num_iterations):
711
+ prompt, expected = prompts[_ % len(prompts)]
712
+ results.append(self.evaluate_response(prompt, expected, interaction_callback))
713
+
714
+ avg_score = sum(r["score"] for r in results) / len(results)
715
+ avg_time = sum(r["response_time"] for r in results) / len(results)
716
+
717
+ return {
718
+ "score": avg_score,
719
+ "average_response_time": avg_time,
720
+ "iterations": len(results),
721
+ "individual_results": results
722
+ }
723
+
724
+ def test_math_solving(self, num_iterations: int, interaction_callback=None) -> Dict[str, Any]:
725
+ prompts = [
726
+ ("""
727
+ Solve the following calculus problem:
728
+ Find the volume of the solid obtained by rotating the region bounded by
729
+ y = x², y = 2x, and the y-axis about the x-axis.
730
+ Show all steps and explain your reasoning.
731
+ """, ["setup", "integration", "calculation", "final answer"]),
732
+
733
+ ("""
734
+ Prove that the sum of two odd numbers is even.
735
+ Provide a formal mathematical proof using algebraic notation.
736
+ """, ["definition", "algebraic representation", "logical steps", "conclusion"]),
737
+
738
+ ("""
739
+ Solve the following probability problem:
740
+ A box contains 3 red balls, 4 blue balls, and 5 green balls.
741
+ Two balls are drawn without replacement.
742
+ What is the probability that both balls are the same color?
743
+ Show detailed calculations.
744
+ """, ["probability theory", "calculations", "final answer"])
745
+ ]
746
+
747
+ results = []
748
+ for _ in range(num_iterations):
749
+ prompt, expected = prompts[_ % len(prompts)]
750
+ results.append(self.evaluate_response(prompt, expected, interaction_callback))
751
+
752
+ avg_score = sum(r["score"] for r in results) / len(results)
753
+ avg_time = sum(r["response_time"] for r in results) / len(results)
754
+
755
+ return {
756
+ "score": avg_score,
757
+ "average_response_time": avg_time,
758
+ "iterations": len(results),
759
+ "individual_results": results
760
+ }
761
+
762
+ def test_context_understanding(self, num_iterations: int, interaction_callback=None) -> Dict[str, Any]:
763
+ prompts = [
764
+ ("""
765
+ Read the following passage and answer the questions:
766
+
767
+ The Antikythera mechanism is an ancient Greek hand-powered orrery, described as the first analog computer, used to predict astronomical positions and eclipses for calendar and astrological purposes decades in advance. It was recovered in 1901 from the Antikythera wreck, a shipwreck off the Greek island of Antikythera. The instrument has been dated to about 100 BCE.
768
+
769
+ Questions:
770
+ 1. What was the main purpose of the Antikythera mechanism?
771
+ 2. When and where was it discovered?
772
+ 3. Why is it considered significant in the history of technology?
773
+
774
+ Provide detailed answers with supporting evidence from the text.
775
+ """, ["accurate answers", "text evidence", "comprehension"]),
776
+
777
+ ("""
778
+ Analyze the following code snippet and explain its implications:
779
+
780
+ ```python
781
+ def process_data(items: List[Dict[str, Any]]) -> Generator[Dict[str, Any], None, None]:
782
+ seen = set()
783
+ for item in items:
784
+ if item['id'] not in seen:
785
+ seen.add(item['id'])
786
+ yield item
787
+ ```
788
+
789
+ Explain:
790
+ 1. What does this code do?
791
+ 2. What are potential performance implications?
792
+ 3. What are possible use cases?
793
+ 4. Are there any potential improvements?
794
+ """, ["functionality", "performance analysis", "use cases", "improvements"]),
795
+
796
+ ("""
797
+ Consider this business scenario:
798
+
799
+ A startup is experiencing rapid growth but facing scalability issues with their current monolithic architecture. They need to decide between:
800
+ 1. Gradually refactoring to microservices
801
+ 2. Complete rewrite with modern architecture
802
+ 3. Optimizing current monolith
803
+
804
+ Provide a recommendation with justification.
805
+ """, ["analysis", "trade-offs", "recommendation", "justification"])
806
+ ]
807
+
808
+ results = []
809
+ for _ in range(num_iterations):
810
+ prompt, expected = prompts[_ % len(prompts)]
811
+ results.append(self.evaluate_response(prompt, expected, interaction_callback))
812
+
813
+ avg_score = sum(r["score"] for r in results) / len(results)
814
+ avg_time = sum(r["response_time"] for r in results) / len(results)
815
+
816
+ return {
817
+ "score": avg_score,
818
+ "average_response_time": avg_time,
819
+ "iterations": len(results),
820
+ "individual_results": results
821
+ }
822
+
823
+ def test_performance(self, num_iterations: int, interaction_callback=None) -> Dict[str, Any]:
824
+ # Test various performance metrics
825
+ start_time = time.time()
826
+ total_tokens = 0
827
+ response_times = []
828
+
829
+ prompt = "Generate a detailed technical explanation of how quantum computers work."
830
+
831
+ for _ in range(num_iterations):
832
+ iteration_start = time.time()
833
+ response_stream = ollama.generate(
834
+ model=self.model_name,
835
+ prompt=prompt,
836
+ options={"temperature": 0.7},
837
+ stream=True
838
+ )
839
+
840
+ # Collect full response for token counting
841
+ response_text = ""
842
+ for chunk in response_stream:
843
+ if 'response' in chunk:
844
+ response_text += chunk['response']
845
+
846
+ response_time = time.time() - iteration_start
847
+ response_times.append(response_time)
848
+ total_tokens += len(response_text.split())
849
+
850
+ total_time = time.time() - start_time
851
+ avg_response_time = sum(response_times) / len(response_times)
852
+ tokens_per_second = total_tokens / total_time
853
+
854
+ return {
855
+ "score": min(10, 10 * (1 / avg_response_time)) if avg_response_time > 0 else 0,
856
+ "average_response_time": avg_response_time,
857
+ "tokens_per_second": tokens_per_second,
858
+ "total_tokens": total_tokens,
859
+ "total_time": total_time,
860
+ "iterations": num_iterations
861
+ }
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==4.*
2
+ ollama==0.1.*
3
+ pandas==2.*
4
+ plotly==5.*
5
+ python-dotenv==1.*
6
+ tqdm==4.*
7
+ rich==13.*
test_benchmark.py ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import ollama
2
+ import time
3
+ import json
4
+ from rich.console import Console
5
+ from rich.panel import Panel
6
+ from benchmarks.benchmark_suite import BenchmarkSuite
7
+ from typing import Dict, Any, List, Tuple, Generator
8
+
9
+ console = Console()
10
+
11
+ def run_full_benchmark(model_name: str, judge_model: str, num_iterations: int = 1):
12
+ """
13
+ Runs the full benchmark with real-time progress and Q&A output using rich.
14
+ """
15
+ console.print(Panel("[bold magenta]LLM Full Benchmark Test[/bold magenta]", expand=False))
16
+ console.print(f"[bold blue]Running full benchmark: {model_name} vs {judge_model}[/bold blue]")
17
+
18
+ benchmark_suite = BenchmarkSuite(model_name, judge_model)
19
+ results = {}
20
+
21
+ # Test categories with their display names
22
+ test_categories = [
23
+ ("Logical Reasoning", "test_logical_reasoning"),
24
+ ("Code Generation", "test_code_generation"),
25
+ ("Mathematical Problem Solving", "test_math_solving"),
26
+ ("Context Understanding", "test_context_understanding"),
27
+ ("Performance Metrics", "test_performance")
28
+ ]
29
+
30
+ # Store test cases data
31
+ test_cases = []
32
+ max_test_cases = 5 # Limit the number of test cases displayed
33
+
34
+ # Track the last update time for each test case to throttle updates
35
+ last_updates = {}
36
+ update_interval = 0.5 # Minimum seconds between updates per test case
37
+
38
+ # Callback function to output Q&A information
39
+ def update_qa_output_callback(prompt: str, model_response: str, judge_response: str, model_name: str, judge_model_name: str):
40
+ # Create a key for this test case
41
+ row_key = f"{model_name}_{hash(prompt) % 1000}"
42
+
43
+ # Check if we should update based on throttling
44
+ current_time = time.time()
45
+ if row_key in last_updates:
46
+ time_since_last = current_time - last_updates[row_key]
47
+ if time_since_last < update_interval:
48
+ # Skip update if not enough time has passed
49
+ return
50
+ last_updates[row_key] = current_time
51
+
52
+ # Check if this test case already exists
53
+ existing_case = None
54
+ for i, case in enumerate(test_cases):
55
+ if case.get('key') == row_key:
56
+ existing_case = case
57
+ existing_case_index = i
58
+ break
59
+
60
+ if existing_case is None:
61
+ # Add a new test case if we haven't reached the limit
62
+ if len(test_cases) < max_test_cases:
63
+ test_cases.append({
64
+ 'key': row_key,
65
+ 'model_name': model_name,
66
+ 'prompt': prompt,
67
+ 'model_response': model_response,
68
+ 'judge_response': judge_response
69
+ })
70
+ else:
71
+ # If we've reached the limit, update the oldest test case
72
+ test_cases.pop(0) # Remove the oldest
73
+ test_cases.append({
74
+ 'key': row_key,
75
+ 'model_name': model_name,
76
+ 'prompt': prompt,
77
+ 'model_response': model_response,
78
+ 'judge_response': judge_response
79
+ })
80
+ else:
81
+ # Update the existing test case
82
+ existing_case['model_response'] = model_response
83
+ existing_case['judge_response'] = judge_response
84
+
85
+ # Output the Q&A information in rich text
86
+ console.print(f"[bold blue]Model:[/bold blue] {model_name}")
87
+ console.print(f"[bold cyan]Prompt:[/bold cyan] {prompt}")
88
+ console.print(f"[bold green]Response:[/bold green] {model_response}")
89
+ console.print(f"[bold yellow]Judge:[/bold yellow] {judge_response}")
90
+ console.print("-" * 50) # Separator line
91
+
92
+ for i, (category_name, method_name) in enumerate(test_categories):
93
+ console.print(f"[bold green]Running {category_name} Benchmark...[/bold green]")
94
+
95
+ # Show model loading/processing
96
+ console.print(f"[magenta] Loading models for {category_name}...[/magenta]")
97
+ time.sleep(0.5) # Simulate loading time
98
+
99
+ try:
100
+ # Run the actual test, passing the callback
101
+ test_func = getattr(benchmark_suite, method_name)
102
+ result = test_func(num_iterations, update_qa_output_callback) # Pass the new callback
103
+ results[category_name] = result
104
+
105
+ console.print(f"[green]✓ {category_name} completed: {result.get('score', 0):.1f}/10[/green]")
106
+ except Exception as e:
107
+ console.print(f"[red]✗ {category_name} failed: {str(e)}[/red]")
108
+ results[category_name] = {"score": 0, "error": str(e)}
109
+
110
+ # Calculate final scores
111
+ overall_score = sum(r.get("score", 0) for r in results.values()) / len(results) if results else 0
112
+
113
+ # Print summary
114
+ console.print(Panel("[bold magenta]Benchmark Results Summary[/bold magenta]", expand=False))
115
+ for test_name, result in results.items():
116
+ score = result.get('score', 0)
117
+ if 'error' in result:
118
+ console.print(f" {test_name}: [red]Error - {result['error']}[/red]")
119
+ else:
120
+ console.print(f" {test_name}: {score:.1f}/10")
121
+ console.print(f"[bold blue]Overall Score: {overall_score:.1f}/10[/bold blue]")
122
+
123
+ return results
124
+
125
+ if __name__ == "__main__":
126
+ import sys
127
+
128
+ model_to_test = "qwen3:8b"
129
+ judge_model = "deepscaler:latest"
130
+ iterations = 1
131
+
132
+ if len(sys.argv) > 1:
133
+ if sys.argv[1] == "detailed":
134
+ run_full_benchmark(model_to_test, judge_model, iterations)
135
+ else:
136
+ console.print("[red]Invalid argument. Use 'python test_benchmark.py detailed'[/red]")
137
+ else:
138
+ run_full_benchmark(model_to_test, judge_model, iterations)