ThomasTheMaker commited on
Commit
d2602b9
Β·
verified Β·
1 Parent(s): 94ef616

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. .gitignore +2 -0
  2. README.md +113 -3
  3. eval_prompts.json +81 -0
  4. evaluation_results.md +0 -0
  5. inference.py +136 -0
  6. requirements.txt +4 -0
  7. run_eval.py +325 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ myenv/
2
+ __pycache__/
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Evaluation System
2
+
3
+ A comprehensive system for evaluating local language models using standardized prompts and generating detailed markdown reports.
4
+
5
+ ## Files
6
+
7
+ - **`inference.py`** - Simple inference function: Text in, text out
8
+ - **`eval_prompts.json`** - A set of prompts to run evaluation on models
9
+ - **`run_eval.py`** - Uses inference.py & eval_prompts.json to run evaluation on all local models and save responses to markdown
10
+ - **`requirements.txt`** - Python dependencies
11
+ - **`myenv/`** - Python virtual environment
12
+
13
+ ## Quick Start
14
+
15
+ 1. **Activate the virtual environment:**
16
+ ```bash
17
+ source myenv/bin/activate
18
+ ```
19
+
20
+ 2. **Install dependencies (if not already installed):**
21
+ ```bash
22
+ pip install -r requirements.txt
23
+ ```
24
+
25
+ 3. **Run evaluation on all models:**
26
+ ```bash
27
+ python run_eval.py
28
+ ```
29
+
30
+ 4. **Test with a single model:**
31
+ ```bash
32
+ python inference.py
33
+ ```
34
+
35
+ ## Features
36
+
37
+ ### Inference System (`inference.py`)
38
+ - **ModelInference class**: Load and run inference on local Hugging Face models
39
+ - **Memory management**: Automatic model loading/unloading
40
+ - **GPU/CPU support**: Automatically uses GPU if available, falls back to CPU
41
+ - **Model discovery**: Automatically finds all local model directories
42
+
43
+ ### Evaluation Prompts (`eval_prompts.json`)
44
+ - **12 diverse prompts** covering:
45
+ - Reasoning & logic
46
+ - Mathematics & algebra
47
+ - Coding & technical explanations
48
+ - General knowledge & facts
49
+ - Creative writing
50
+ - Instruction following
51
+ - Common sense reasoning
52
+ - Text summarization
53
+
54
+ ### Evaluation Runner (`run_eval.py`)
55
+ - **Batch processing**: Evaluates all local models automatically
56
+ - **Progress tracking**: Shows real-time progress with emojis and timing
57
+ - **Error handling**: Gracefully handles model loading failures
58
+ - **Markdown reports**: Generates comprehensive evaluation reports
59
+ - **Memory efficient**: Unloads models between evaluations
60
+
61
+ ## Model Requirements
62
+
63
+ Models should be in Hugging Face format with these files:
64
+ - `config.json`
65
+ - `model.safetensors`
66
+ - `tokenizer.json`
67
+ - `vocab.json`
68
+ - Other standard HF files
69
+
70
+ ## Example Usage
71
+
72
+ ```python
73
+ from inference import ModelInference, get_local_models
74
+
75
+ # Find all models
76
+ models = get_local_models()
77
+ print(f"Found {len(models)} models")
78
+
79
+ # Quick inference
80
+ from inference import simple_inference
81
+ result = simple_inference(models[0], "What is AI?", max_length=256)
82
+ print(result)
83
+
84
+ # Advanced usage
85
+ inference = ModelInference(models[0])
86
+ if inference.load_model():
87
+ response = inference.generate_text("Explain Python", max_length=512, temperature=0.7)
88
+ print(response)
89
+ inference.unload_model()
90
+ ```
91
+
92
+ ## Output
93
+
94
+ The evaluation generates a markdown report (`evaluation_results.md`) with:
95
+ - **Summary table**: Model performance overview
96
+ - **Detailed results**: Full responses organized by category
97
+ - **Timing information**: Evaluation duration per model
98
+ - **Error reporting**: Any issues encountered
99
+
100
+ ## System Requirements
101
+
102
+ - Python 3.8+
103
+ - PyTorch 2.0+
104
+ - Transformers 4.30+
105
+ - 4GB+ RAM (varies by model size)
106
+ - Optional: CUDA-compatible GPU for faster inference
107
+
108
+ ## Notes
109
+
110
+ - Evaluation can take significant time with many models (136 models detected)
111
+ - Models are evaluated sequentially to manage memory usage
112
+ - Small delays between prompts prevent overheating
113
+ - Progress is shown with real-time updates
eval_prompts.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "evaluation_prompts": [
3
+ {
4
+ "id": "reasoning_1",
5
+ "category": "reasoning",
6
+ "prompt": "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",
7
+ "expected_type": "logical_reasoning"
8
+ },
9
+ {
10
+ "id": "math_1",
11
+ "category": "mathematics",
12
+ "prompt": "What is 15% of 240?",
13
+ "expected_type": "arithmetic"
14
+ },
15
+ {
16
+ "id": "coding_1",
17
+ "category": "coding",
18
+ "prompt": "Write a Python function to check if a string is a palindrome.",
19
+ "expected_type": "code_generation"
20
+ },
21
+ {
22
+ "id": "general_1",
23
+ "category": "general_knowledge",
24
+ "prompt": "What is the capital of France?",
25
+ "expected_type": "factual_recall"
26
+ },
27
+ {
28
+ "id": "creative_1",
29
+ "category": "creative_writing",
30
+ "prompt": "Write a short story about a robot learning to paint.",
31
+ "expected_type": "creative_generation"
32
+ },
33
+ {
34
+ "id": "reasoning_2",
35
+ "category": "reasoning",
36
+ "prompt": "A farmer has 17 sheep. All but 9 die. How many sheep are left?",
37
+ "expected_type": "logical_reasoning"
38
+ },
39
+ {
40
+ "id": "math_2",
41
+ "category": "mathematics",
42
+ "prompt": "If x + 5 = 12, what is x?",
43
+ "expected_type": "algebra"
44
+ },
45
+ {
46
+ "id": "coding_2",
47
+ "category": "coding",
48
+ "prompt": "Explain the difference between a list and a tuple in Python.",
49
+ "expected_type": "technical_explanation"
50
+ },
51
+ {
52
+ "id": "general_2",
53
+ "category": "general_knowledge",
54
+ "prompt": "Who wrote the novel '1984'?",
55
+ "expected_type": "factual_recall"
56
+ },
57
+ {
58
+ "id": "instruction_following",
59
+ "category": "instruction_following",
60
+ "prompt": "Please respond with exactly three words, no more and no less.",
61
+ "expected_type": "constraint_following"
62
+ },
63
+ {
64
+ "id": "common_sense",
65
+ "category": "common_sense",
66
+ "prompt": "Why do people use umbrellas when it rains?",
67
+ "expected_type": "common_sense_reasoning"
68
+ },
69
+ {
70
+ "id": "summarization",
71
+ "category": "summarization",
72
+ "prompt": "Summarize this in one sentence: Artificial intelligence is a field of computer science that aims to create machines capable of intelligent behavior. It includes machine learning, natural language processing, and robotics.",
73
+ "expected_type": "text_summarization"
74
+ }
75
+ ],
76
+ "evaluation_settings": {
77
+ "max_length": 512,
78
+ "temperature": 0.7,
79
+ "description": "Standard evaluation prompts covering reasoning, math, coding, knowledge, and creative tasks"
80
+ }
81
+ }
evaluation_results.md ADDED
The diff for this file is too large to render. See raw diff
 
inference.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Simple inference function: Text in, text out
3
+ Supports loading and running inference on local Hugging Face models
4
+ """
5
+
6
+ import os
7
+ import json
8
+ from typing import List, Dict, Any
9
+ from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
10
+ import torch
11
+
12
+
13
+ class ModelInference:
14
+ def __init__(self, model_path: str):
15
+ """Initialize inference with a local model path"""
16
+ self.model_path = model_path
17
+ self.model_name = os.path.basename(model_path)
18
+ self.tokenizer = None
19
+ self.model = None
20
+ self.pipeline = None
21
+
22
+ def load_model(self):
23
+ """Load the model and tokenizer"""
24
+ try:
25
+ print(f"Loading model from {self.model_path}")
26
+
27
+ # Load tokenizer and model
28
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
29
+ self.model = AutoModelForCausalLM.from_pretrained(
30
+ self.model_path,
31
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
32
+ device_map="auto" if torch.cuda.is_available() else None
33
+ )
34
+
35
+ # Create text generation pipeline
36
+ self.pipeline = pipeline(
37
+ "text-generation",
38
+ model=self.model,
39
+ tokenizer=self.tokenizer,
40
+ device_map="auto" if torch.cuda.is_available() else None
41
+ )
42
+
43
+ print(f"βœ“ Model loaded successfully: {self.model_name}")
44
+ return True
45
+
46
+ except Exception as e:
47
+ print(f"βœ— Failed to load model {self.model_name}: {e}")
48
+ return False
49
+
50
+ def generate_text(self, prompt: str, max_length: int = 512, temperature: float = 0.7) -> str:
51
+ """Generate text from a prompt"""
52
+ if not self.pipeline:
53
+ raise RuntimeError("Model not loaded. Call load_model() first.")
54
+
55
+ try:
56
+ # Generate text
57
+ outputs = self.pipeline(
58
+ prompt,
59
+ max_length=max_length,
60
+ temperature=temperature,
61
+ do_sample=True,
62
+ pad_token_id=self.tokenizer.eos_token_id,
63
+ return_full_text=False # Only return generated text, not the prompt
64
+ )
65
+
66
+ generated_text = outputs[0]['generated_text']
67
+ return generated_text.strip()
68
+
69
+ except Exception as e:
70
+ print(f"Error generating text: {e}")
71
+ return f"[Error: {str(e)}]"
72
+
73
+ def unload_model(self):
74
+ """Unload model to free memory"""
75
+ if self.model:
76
+ del self.model
77
+ del self.tokenizer
78
+ del self.pipeline
79
+ self.model = None
80
+ self.tokenizer = None
81
+ self.pipeline = None
82
+
83
+ # Clear GPU cache if available
84
+ if torch.cuda.is_available():
85
+ torch.cuda.empty_cache()
86
+
87
+ print(f"βœ“ Model {self.model_name} unloaded")
88
+
89
+
90
+ def get_local_models(base_path: str = ".") -> List[str]:
91
+ """Find all local model directories"""
92
+ model_dirs = []
93
+
94
+ for item in os.listdir(base_path):
95
+ item_path = os.path.join(base_path, item)
96
+
97
+ # Check if directory contains model files
98
+ if os.path.isdir(item_path):
99
+ required_files = ['config.json', 'model.safetensors']
100
+ if all(os.path.exists(os.path.join(item_path, f)) for f in required_files):
101
+ model_dirs.append(item_path)
102
+
103
+ return sorted(model_dirs)
104
+
105
+
106
+ def simple_inference(model_path: str, prompt: str, max_length: int = 512) -> str:
107
+ """Simple one-shot inference function"""
108
+ inference = ModelInference(model_path)
109
+
110
+ if not inference.load_model():
111
+ return "[Error: Failed to load model]"
112
+
113
+ try:
114
+ result = inference.generate_text(prompt, max_length)
115
+ return result
116
+ finally:
117
+ inference.unload_model()
118
+
119
+
120
+ if __name__ == "__main__":
121
+ # Example usage
122
+ models = get_local_models()
123
+
124
+ if models:
125
+ print(f"Found {len(models)} local models:")
126
+ for model in models[:3]: # Show first 3
127
+ print(f" - {os.path.basename(model)}")
128
+
129
+ # Test inference on first model
130
+ test_prompt = "What is artificial intelligence?"
131
+ print(f"\nTesting inference with prompt: '{test_prompt}'")
132
+
133
+ result = simple_inference(models[0], test_prompt, max_length=256)
134
+ print(f"Response: {result}")
135
+ else:
136
+ print("No local models found in current directory")
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ torch>=2.0.0
2
+ transformers>=4.30.0
3
+ accelerate>=0.20.0
4
+ safetensors>=0.3.0
run_eval.py ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Model Evaluation Runner
4
+ Uses inference.py & eval_prompts.json to run evaluation on all local models
5
+ present in this folder, and saves the responses to a concise markdown file
6
+ """
7
+
8
+ import os
9
+ import json
10
+ import time
11
+ from datetime import datetime
12
+ from typing import List, Dict, Any
13
+ from inference import ModelInference, get_local_models
14
+
15
+
16
+ def load_eval_prompts(prompts_file: str = "eval_prompts.json") -> Dict[str, Any]:
17
+ """Load evaluation prompts from JSON file"""
18
+ try:
19
+ with open(prompts_file, 'r', encoding='utf-8') as f:
20
+ return json.load(f)
21
+ except FileNotFoundError:
22
+ print(f"Error: {prompts_file} not found")
23
+ return {}
24
+ except json.JSONDecodeError as e:
25
+ print(f"Error parsing {prompts_file}: {e}")
26
+ return {}
27
+
28
+
29
+ def run_single_evaluation(model_path: str, prompts_data: Dict[str, Any]) -> Dict[str, Any]:
30
+ """Run evaluation on a single model"""
31
+ model_name = os.path.basename(model_path)
32
+ print(f"\n{'='*60}")
33
+ print(f"Evaluating: {model_name}")
34
+ print(f"{'='*60}")
35
+
36
+ # Initialize model
37
+ inference = ModelInference(model_path)
38
+ if not inference.load_model():
39
+ return {
40
+ "model_name": model_name,
41
+ "model_path": model_path,
42
+ "status": "failed_to_load",
43
+ "responses": {},
44
+ "evaluation_time": 0
45
+ }
46
+
47
+ # Get settings
48
+ settings = prompts_data.get("evaluation_settings", {})
49
+ max_length = settings.get("max_length", 512)
50
+ temperature = settings.get("temperature", 0.7)
51
+
52
+ # Run evaluation
53
+ start_time = time.time()
54
+ responses = {}
55
+
56
+ try:
57
+ for prompt_data in prompts_data["evaluation_prompts"]:
58
+ prompt_id = prompt_data["id"]
59
+ prompt_text = prompt_data["prompt"]
60
+ category = prompt_data["category"]
61
+
62
+ print(f" Running prompt: {prompt_id} ({category})")
63
+
64
+ # Generate response
65
+ response = inference.generate_text(
66
+ prompt_text,
67
+ max_length=max_length,
68
+ temperature=temperature
69
+ )
70
+
71
+ responses[prompt_id] = {
72
+ "prompt": prompt_text,
73
+ "response": response,
74
+ "category": category,
75
+ "expected_type": prompt_data.get("expected_type", ""),
76
+ }
77
+
78
+ # Small delay to prevent overheating
79
+ time.sleep(0.5)
80
+
81
+ except Exception as e:
82
+ print(f"Error during evaluation: {e}")
83
+ responses["error"] = str(e)
84
+
85
+ finally:
86
+ # Clean up model
87
+ inference.unload_model()
88
+
89
+ evaluation_time = time.time() - start_time
90
+
91
+ return {
92
+ "model_name": model_name,
93
+ "model_path": model_path,
94
+ "status": "completed",
95
+ "responses": responses,
96
+ "evaluation_time": evaluation_time,
97
+ "settings": {
98
+ "max_length": max_length,
99
+ "temperature": temperature
100
+ }
101
+ }
102
+
103
+
104
+ def generate_markdown_report(results: List[Dict[str, Any]], output_file: str = "evaluation_results.md", is_incremental: bool = False):
105
+ """Generate a markdown report from evaluation results"""
106
+
107
+ # Get timestamp
108
+ timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
109
+
110
+ # Start markdown content
111
+ md_content = f"""# Model Evaluation Results
112
+
113
+ **Generated:** {timestamp}
114
+ **Models Evaluated:** {len(results)}
115
+ **Status:** {'In Progress' if is_incremental else 'Complete'}
116
+
117
+ ## Summary
118
+
119
+ | Model | Status | Prompts Completed | Evaluation Time |
120
+ |-------|--------|-------------------|-----------------|
121
+ """
122
+
123
+ # Add summary table
124
+ for result in results:
125
+ model_name = result["model_name"]
126
+ status = result["status"]
127
+ num_prompts = len([r for r in result["responses"].keys() if r != "error"])
128
+ eval_time = f"{result['evaluation_time']:.1f}s"
129
+
130
+ md_content += f"| {model_name} | {status} | {num_prompts} | {eval_time} |\n"
131
+
132
+ # Add detailed results
133
+ md_content += "\n## Detailed Results\n\n"
134
+
135
+ for result in results:
136
+ model_name = result["model_name"]
137
+ md_content += f"### {model_name}\n\n"
138
+
139
+ if result["status"] == "failed_to_load":
140
+ md_content += "❌ **Failed to load model**\n\n"
141
+ continue
142
+
143
+ # Add model info
144
+ settings = result.get("settings", {})
145
+ md_content += f"- **Evaluation Time:** {result['evaluation_time']:.1f} seconds\n"
146
+ md_content += f"- **Max Length:** {settings.get('max_length', 'N/A')}\n"
147
+ md_content += f"- **Temperature:** {settings.get('temperature', 'N/A')}\n\n"
148
+
149
+ # Add responses by category
150
+ responses = result["responses"]
151
+ categories = {}
152
+
153
+ for prompt_id, data in responses.items():
154
+ if prompt_id == "error":
155
+ continue
156
+ category = data.get("category", "other")
157
+ if category not in categories:
158
+ categories[category] = []
159
+ categories[category].append((prompt_id, data))
160
+
161
+ # Display by category
162
+ for category, prompts in categories.items():
163
+ md_content += f"#### {category.title()}\n\n"
164
+
165
+ for prompt_id, data in prompts:
166
+ md_content += f"**Prompt ({prompt_id}):** {data['prompt']}\n\n"
167
+ md_content += f"**Response:**\n```\n{data['response']}\n```\n\n"
168
+ md_content += "---\n\n"
169
+
170
+ # Add error if present
171
+ if "error" in responses:
172
+ md_content += f"⚠️ **Error:** {responses['error']}\n\n"
173
+
174
+ # Add progress indicator if incremental
175
+ if is_incremental:
176
+ md_content += f"\n---\n\n**Last Updated:** {timestamp}\n"
177
+ md_content += f"**Progress:** {len(results)} models completed\n"
178
+
179
+ # Write to file
180
+ try:
181
+ with open(output_file, 'w', encoding='utf-8') as f:
182
+ f.write(md_content)
183
+ if is_incremental:
184
+ print(f"πŸ“ Progress saved to: {output_file}")
185
+ else:
186
+ print(f"\nβœ… Evaluation report saved to: {output_file}")
187
+ except Exception as e:
188
+ print(f"❌ Failed to save report: {e}")
189
+
190
+
191
+ def save_incremental_results(results: List[Dict[str, Any]], output_file: str = "evaluation_results.md"):
192
+ """Save results incrementally as evaluation progresses"""
193
+ generate_markdown_report(results, output_file, is_incremental=True)
194
+
195
+
196
+ def load_existing_results(output_file: str = "evaluation_results.md") -> List[str]:
197
+ """Load list of already evaluated models from existing results file"""
198
+ if not os.path.exists(output_file):
199
+ return []
200
+
201
+ try:
202
+ with open(output_file, 'r', encoding='utf-8') as f:
203
+ content = f.read()
204
+
205
+ # Extract model names from the summary table
206
+ import re
207
+ model_names = re.findall(r'\|\s*([^|]+?)\s*\|', content)
208
+ # Remove header row and filter out non-model entries
209
+ model_names = [name for name in model_names[1:] if not name.startswith('Model') and name.strip()]
210
+ return model_names
211
+ except Exception as e:
212
+ print(f"⚠️ Could not load existing results: {e}")
213
+ return []
214
+
215
+
216
+ def main():
217
+ """Main evaluation function"""
218
+ print("πŸš€ Starting Model Evaluation")
219
+ print(f"Working directory: {os.getcwd()}")
220
+
221
+ # Load prompts
222
+ print("\nπŸ“‹ Loading evaluation prompts...")
223
+ prompts_data = load_eval_prompts()
224
+ if not prompts_data:
225
+ print("❌ Failed to load prompts. Exiting.")
226
+ return
227
+
228
+ num_prompts = len(prompts_data.get("evaluation_prompts", []))
229
+ print(f"βœ… Loaded {num_prompts} evaluation prompts")
230
+
231
+ # Find models
232
+ print("\nπŸ” Scanning for local models...")
233
+ models = get_local_models()
234
+
235
+ if not models:
236
+ print("❌ No local models found. Make sure model directories are present.")
237
+ return
238
+
239
+ # Check for existing results to resume
240
+ output_file = "evaluation_results.md"
241
+ existing_models = load_existing_results(output_file)
242
+
243
+ if existing_models:
244
+ print(f"πŸ“„ Found existing results with {len(existing_models)} models")
245
+ print("πŸ”„ Resuming evaluation from where it left off...")
246
+
247
+ # Filter out already evaluated models
248
+ models = [model for model in models if os.path.basename(model) not in existing_models]
249
+ print(f"πŸ“Š {len(existing_models)} models already evaluated, {len(models)} remaining")
250
+ else:
251
+ print(f"βœ… Found {len(models)} models to evaluate")
252
+
253
+ if not models:
254
+ print("πŸŽ‰ All models already evaluated!")
255
+ return
256
+
257
+ print(f"πŸ“‹ Models to evaluate:")
258
+ for i, model in enumerate(models, 1):
259
+ print(f" {i}. {os.path.basename(model)}")
260
+
261
+ # Load existing results if resuming
262
+ results = []
263
+ if existing_models:
264
+ # For now, we'll start fresh but skip already evaluated models
265
+ # In a more advanced version, we could parse the existing markdown to load results
266
+ print("πŸ“ Note: Starting fresh evaluation (existing results will be overwritten)")
267
+
268
+ # Run evaluations
269
+ print(f"\nπŸ§ͺ Starting evaluation on {len(models)} models...")
270
+
271
+ for i, model_path in enumerate(models, 1):
272
+ print(f"\n[{i}/{len(models)}] Processing: {os.path.basename(model_path)}")
273
+
274
+ try:
275
+ result = run_single_evaluation(model_path, prompts_data)
276
+ results.append(result)
277
+
278
+ # Show progress
279
+ if result["status"] == "completed":
280
+ num_responses = len([r for r in result["responses"].keys() if r != "error"])
281
+ print(f"βœ… Completed {num_responses}/{num_prompts} prompts in {result['evaluation_time']:.1f}s")
282
+ else:
283
+ print(f"❌ Failed to evaluate model")
284
+
285
+ # Save progress incrementally after each model
286
+ save_incremental_results(results)
287
+
288
+ except KeyboardInterrupt:
289
+ print("\n⚠️ Evaluation interrupted by user")
290
+ print("πŸ’Ύ Saving current progress...")
291
+ save_incremental_results(results)
292
+ break
293
+ except Exception as e:
294
+ print(f"❌ Unexpected error: {e}")
295
+ # Add error result and continue
296
+ error_result = {
297
+ "model_name": os.path.basename(model_path),
298
+ "model_path": model_path,
299
+ "status": "error",
300
+ "responses": {"error": str(e)},
301
+ "evaluation_time": 0
302
+ }
303
+ results.append(error_result)
304
+ save_incremental_results(results)
305
+ # Continue with next model
306
+
307
+ # Generate report
308
+ if results:
309
+ print(f"\nπŸ“Š Generating evaluation report...")
310
+ generate_markdown_report(results)
311
+
312
+ # Summary stats
313
+ successful = len([r for r in results if r["status"] == "completed"])
314
+ total_time = sum(r["evaluation_time"] for r in results)
315
+
316
+ print(f"\nπŸŽ‰ Evaluation Complete!")
317
+ print(f" - Models evaluated: {successful}/{len(results)}")
318
+ print(f" - Total time: {total_time:.1f} seconds")
319
+ print(f" - Average time per model: {total_time/len(results):.1f} seconds")
320
+ else:
321
+ print("❌ No results to report")
322
+
323
+
324
+ if __name__ == "__main__":
325
+ main()