Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitignore +2 -0
README.md +113 -3
eval_prompts.json +81 -0
evaluation_results.md +0 -0
inference.py +136 -0
requirements.txt +4 -0
run_eval.py +325 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ myenv/
2	+ __pycache__/

README.md CHANGED Viewed

@@ -1,3 +1,113 @@
----
-license: apache-2.0
----

+# Model Evaluation System
+A comprehensive system for evaluating local language models using standardized prompts and generating detailed markdown reports.
+## Files
+- **`inference.py`** - Simple inference function: Text in, text out
+- **`eval_prompts.json`** - A set of prompts to run evaluation on models
+- **`run_eval.py`** - Uses inference.py & eval_prompts.json to run evaluation on all local models and save responses to markdown
+- **`requirements.txt`** - Python dependencies
+- **`myenv/`** - Python virtual environment
+## Quick Start
+1. **Activate the virtual environment:**
+   ```bash
+   source myenv/bin/activate
+   ```
+2. **Install dependencies (if not already installed):**
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Run evaluation on all models:**
+   ```bash
+   python run_eval.py
+   ```
+4. **Test with a single model:**
+   ```bash
+   python inference.py
+   ```
+## Features
+### Inference System (`inference.py`)
+- **ModelInference class**: Load and run inference on local Hugging Face models
+- **Memory management**: Automatic model loading/unloading
+- **GPU/CPU support**: Automatically uses GPU if available, falls back to CPU
+- **Model discovery**: Automatically finds all local model directories
+### Evaluation Prompts (`eval_prompts.json`)
+- **12 diverse prompts** covering:
+  - Reasoning & logic
+  - Mathematics & algebra
+  - Coding & technical explanations
+  - General knowledge & facts
+  - Creative writing
+  - Instruction following
+  - Common sense reasoning
+  - Text summarization
+### Evaluation Runner (`run_eval.py`)
+- **Batch processing**: Evaluates all local models automatically
+- **Progress tracking**: Shows real-time progress with emojis and timing
+- **Error handling**: Gracefully handles model loading failures
+- **Markdown reports**: Generates comprehensive evaluation reports
+- **Memory efficient**: Unloads models between evaluations
+## Model Requirements
+Models should be in Hugging Face format with these files:
+- `config.json`
+- `model.safetensors`
+- `tokenizer.json`
+- `vocab.json`
+- Other standard HF files
+## Example Usage
+```python
+from inference import ModelInference, get_local_models
+# Find all models
+models = get_local_models()
+print(f"Found {len(models)} models")
+# Quick inference
+from inference import simple_inference
+result = simple_inference(models[0], "What is AI?", max_length=256)
+print(result)
+# Advanced usage
+inference = ModelInference(models[0])
+if inference.load_model():
+    response = inference.generate_text("Explain Python", max_length=512, temperature=0.7)
+    print(response)
+    inference.unload_model()
+```
+## Output
+The evaluation generates a markdown report (`evaluation_results.md`) with:
+- **Summary table**: Model performance overview
+- **Detailed results**: Full responses organized by category
+- **Timing information**: Evaluation duration per model
+- **Error reporting**: Any issues encountered
+## System Requirements
+- Python 3.8+
+- PyTorch 2.0+
+- Transformers 4.30+
+- 4GB+ RAM (varies by model size)
+- Optional: CUDA-compatible GPU for faster inference
+## Notes
+- Evaluation can take significant time with many models (136 models detected)
+- Models are evaluated sequentially to manage memory usage
+- Small delays between prompts prevent overheating
+- Progress is shown with real-time updates

eval_prompts.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "evaluation_prompts": [
+    {
+      "id": "reasoning_1",
+      "category": "reasoning",
+      "prompt": "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",
+      "expected_type": "logical_reasoning"
+    },
+    {
+      "id": "math_1",
+      "category": "mathematics",
+      "prompt": "What is 15% of 240?",
+      "expected_type": "arithmetic"
+    },
+    {
+      "id": "coding_1",
+      "category": "coding",
+      "prompt": "Write a Python function to check if a string is a palindrome.",
+      "expected_type": "code_generation"
+    },
+    {
+      "id": "general_1",
+      "category": "general_knowledge",
+      "prompt": "What is the capital of France?",
+      "expected_type": "factual_recall"
+    },
+    {
+      "id": "creative_1",
+      "category": "creative_writing",
+      "prompt": "Write a short story about a robot learning to paint.",
+      "expected_type": "creative_generation"
+    },
+    {
+      "id": "reasoning_2",
+      "category": "reasoning",
+      "prompt": "A farmer has 17 sheep. All but 9 die. How many sheep are left?",
+      "expected_type": "logical_reasoning"
+    },
+    {
+      "id": "math_2",
+      "category": "mathematics",
+      "prompt": "If x + 5 = 12, what is x?",
+      "expected_type": "algebra"
+    },
+    {
+      "id": "coding_2",
+      "category": "coding",
+      "prompt": "Explain the difference between a list and a tuple in Python.",
+      "expected_type": "technical_explanation"
+    },
+    {
+      "id": "general_2",
+      "category": "general_knowledge",
+      "prompt": "Who wrote the novel '1984'?",
+      "expected_type": "factual_recall"
+    },
+    {
+      "id": "instruction_following",
+      "category": "instruction_following",
+      "prompt": "Please respond with exactly three words, no more and no less.",
+      "expected_type": "constraint_following"
+    },
+    {
+      "id": "common_sense",
+      "category": "common_sense",
+      "prompt": "Why do people use umbrellas when it rains?",
+      "expected_type": "common_sense_reasoning"
+    },
+    {
+      "id": "summarization",
+      "category": "summarization",
+      "prompt": "Summarize this in one sentence: Artificial intelligence is a field of computer science that aims to create machines capable of intelligent behavior. It includes machine learning, natural language processing, and robotics.",
+      "expected_type": "text_summarization"
+    }
+  ],
+  "evaluation_settings": {
+    "max_length": 512,
+    "temperature": 0.7,
+    "description": "Standard evaluation prompts covering reasoning, math, coding, knowledge, and creative tasks"
+  }
+}

evaluation_results.md ADDED Viewed

The diff for this file is too large to render. See raw diff

inference.py ADDED Viewed

	@@ -0,0 +1,136 @@

+"""
+Simple inference function: Text in, text out
+Supports loading and running inference on local Hugging Face models
+"""
+import os
+import json
+from typing import List, Dict, Any
+from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
+import torch
+class ModelInference:
+    def __init__(self, model_path: str):
+        """Initialize inference with a local model path"""
+        self.model_path = model_path
+        self.model_name = os.path.basename(model_path)
+        self.tokenizer = None
+        self.model = None
+        self.pipeline = None
+    def load_model(self):
+        """Load the model and tokenizer"""
+        try:
+            print(f"Loading model from {self.model_path}")
+            # Load tokenizer and model
+            self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
+            self.model = AutoModelForCausalLM.from_pretrained(
+                self.model_path,
+                torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+                device_map="auto" if torch.cuda.is_available() else None
+            )
+            # Create text generation pipeline
+            self.pipeline = pipeline(
+                "text-generation",
+                model=self.model,
+                tokenizer=self.tokenizer,
+                device_map="auto" if torch.cuda.is_available() else None
+            )
+            print(f"✓ Model loaded successfully: {self.model_name}")
+            return True
+        except Exception as e:
+            print(f"✗ Failed to load model {self.model_name}: {e}")
+            return False
+    def generate_text(self, prompt: str, max_length: int = 512, temperature: float = 0.7) -> str:
+        """Generate text from a prompt"""
+        if not self.pipeline:
+            raise RuntimeError("Model not loaded. Call load_model() first.")
+        try:
+            # Generate text
+            outputs = self.pipeline(
+                prompt,
+                max_length=max_length,
+                temperature=temperature,
+                do_sample=True,
+                pad_token_id=self.tokenizer.eos_token_id,
+                return_full_text=False  # Only return generated text, not the prompt
+            )
+            generated_text = outputs[0]['generated_text']
+            return generated_text.strip()
+        except Exception as e:
+            print(f"Error generating text: {e}")
+            return f"[Error: {str(e)}]"
+    def unload_model(self):
+        """Unload model to free memory"""
+        if self.model:
+            del self.model
+            del self.tokenizer
+            del self.pipeline
+            self.model = None
+            self.tokenizer = None
+            self.pipeline = None
+            # Clear GPU cache if available
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            print(f"✓ Model {self.model_name} unloaded")
+def get_local_models(base_path: str = ".") -> List[str]:
+    """Find all local model directories"""
+    model_dirs = []
+    for item in os.listdir(base_path):
+        item_path = os.path.join(base_path, item)
+        # Check if directory contains model files
+        if os.path.isdir(item_path):
+            required_files = ['config.json', 'model.safetensors']
+            if all(os.path.exists(os.path.join(item_path, f)) for f in required_files):
+                model_dirs.append(item_path)
+    return sorted(model_dirs)
+def simple_inference(model_path: str, prompt: str, max_length: int = 512) -> str:
+    """Simple one-shot inference function"""
+    inference = ModelInference(model_path)
+    if not inference.load_model():
+        return "[Error: Failed to load model]"
+    try:
+        result = inference.generate_text(prompt, max_length)
+        return result
+    finally:
+        inference.unload_model()
+if __name__ == "__main__":
+    # Example usage
+    models = get_local_models()
+    if models:
+        print(f"Found {len(models)} local models:")
+        for model in models[:3]:  # Show first 3
+            print(f"  - {os.path.basename(model)}")
+        # Test inference on first model
+        test_prompt = "What is artificial intelligence?"
+        print(f"\nTesting inference with prompt: '{test_prompt}'")
+        result = simple_inference(models[0], test_prompt, max_length=256)
+        print(f"Response: {result}")
+    else:
+        print("No local models found in current directory")

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+torch>=2.0.0
+transformers>=4.30.0
+accelerate>=0.20.0
+safetensors>=0.3.0

run_eval.py ADDED Viewed

	@@ -0,0 +1,325 @@

+#!/usr/bin/env python3
+"""
+Model Evaluation Runner
+Uses inference.py & eval_prompts.json to run evaluation on all local models
+present in this folder, and saves the responses to a concise markdown file
+"""
+import os
+import json
+import time
+from datetime import datetime
+from typing import List, Dict, Any
+from inference import ModelInference, get_local_models
+def load_eval_prompts(prompts_file: str = "eval_prompts.json") -> Dict[str, Any]:
+    """Load evaluation prompts from JSON file"""
+    try:
+        with open(prompts_file, 'r', encoding='utf-8') as f:
+            return json.load(f)
+    except FileNotFoundError:
+        print(f"Error: {prompts_file} not found")
+        return {}
+    except json.JSONDecodeError as e:
+        print(f"Error parsing {prompts_file}: {e}")
+        return {}
+def run_single_evaluation(model_path: str, prompts_data: Dict[str, Any]) -> Dict[str, Any]:
+    """Run evaluation on a single model"""
+    model_name = os.path.basename(model_path)
+    print(f"\n{'='*60}")
+    print(f"Evaluating: {model_name}")
+    print(f"{'='*60}")
+    # Initialize model
+    inference = ModelInference(model_path)
+    if not inference.load_model():
+        return {
+            "model_name": model_name,
+            "model_path": model_path,
+            "status": "failed_to_load",
+            "responses": {},
+            "evaluation_time": 0
+        }
+    # Get settings
+    settings = prompts_data.get("evaluation_settings", {})
+    max_length = settings.get("max_length", 512)
+    temperature = settings.get("temperature", 0.7)
+    # Run evaluation
+    start_time = time.time()
+    responses = {}
+    try:
+        for prompt_data in prompts_data["evaluation_prompts"]:
+            prompt_id = prompt_data["id"]
+            prompt_text = prompt_data["prompt"]
+            category = prompt_data["category"]
+            print(f"  Running prompt: {prompt_id} ({category})")
+            # Generate response
+            response = inference.generate_text(
+                prompt_text,
+                max_length=max_length,
+                temperature=temperature
+            )
+            responses[prompt_id] = {
+                "prompt": prompt_text,
+                "response": response,
+                "category": category,
+                "expected_type": prompt_data.get("expected_type", ""),
+            }
+            # Small delay to prevent overheating
+            time.sleep(0.5)
+    except Exception as e:
+        print(f"Error during evaluation: {e}")
+        responses["error"] = str(e)
+    finally:
+        # Clean up model
+        inference.unload_model()
+    evaluation_time = time.time() - start_time
+    return {
+        "model_name": model_name,
+        "model_path": model_path,
+        "status": "completed",
+        "responses": responses,
+        "evaluation_time": evaluation_time,
+        "settings": {
+            "max_length": max_length,
+            "temperature": temperature
+        }
+    }
+def generate_markdown_report(results: List[Dict[str, Any]], output_file: str = "evaluation_results.md", is_incremental: bool = False):
+    """Generate a markdown report from evaluation results"""
+    # Get timestamp
+    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    # Start markdown content
+    md_content = f"""# Model Evaluation Results
+**Generated:** {timestamp}
+**Models Evaluated:** {len(results)}
+**Status:** {'In Progress' if is_incremental else 'Complete'}
+## Summary
+| Model | Status | Prompts Completed | Evaluation Time |
+|-------|--------|-------------------|-----------------|
+"""
+    # Add summary table
+    for result in results:
+        model_name = result["model_name"]
+        status = result["status"]
+        num_prompts = len([r for r in result["responses"].keys() if r != "error"])
+        eval_time = f"{result['evaluation_time']:.1f}s"
+        md_content += f"| {model_name} | {status} | {num_prompts} | {eval_time} |\n"
+    # Add detailed results
+    md_content += "\n## Detailed Results\n\n"
+    for result in results:
+        model_name = result["model_name"]
+        md_content += f"### {model_name}\n\n"
+        if result["status"] == "failed_to_load":
+            md_content += "❌ **Failed to load model**\n\n"
+            continue
+        # Add model info
+        settings = result.get("settings", {})
+        md_content += f"- **Evaluation Time:** {result['evaluation_time']:.1f} seconds\n"
+        md_content += f"- **Max Length:** {settings.get('max_length', 'N/A')}\n"
+        md_content += f"- **Temperature:** {settings.get('temperature', 'N/A')}\n\n"
+        # Add responses by category
+        responses = result["responses"]
+        categories = {}
+        for prompt_id, data in responses.items():
+            if prompt_id == "error":
+                continue
+            category = data.get("category", "other")
+            if category not in categories:
+                categories[category] = []
+            categories[category].append((prompt_id, data))
+        # Display by category
+        for category, prompts in categories.items():
+            md_content += f"#### {category.title()}\n\n"
+            for prompt_id, data in prompts:
+                md_content += f"**Prompt ({prompt_id}):** {data['prompt']}\n\n"
+                md_content += f"**Response:**\n```\n{data['response']}\n```\n\n"
+                md_content += "---\n\n"
+        # Add error if present
+        if "error" in responses:
+            md_content += f"⚠️ **Error:** {responses['error']}\n\n"
+    # Add progress indicator if incremental
+    if is_incremental:
+        md_content += f"\n---\n\n**Last Updated:** {timestamp}\n"
+        md_content += f"**Progress:** {len(results)} models completed\n"
+    # Write to file
+    try:
+        with open(output_file, 'w', encoding='utf-8') as f:
+            f.write(md_content)
+        if is_incremental:
+            print(f"📝 Progress saved to: {output_file}")
+        else:
+            print(f"\n✅ Evaluation report saved to: {output_file}")
+    except Exception as e:
+        print(f"❌ Failed to save report: {e}")
+def save_incremental_results(results: List[Dict[str, Any]], output_file: str = "evaluation_results.md"):
+    """Save results incrementally as evaluation progresses"""
+    generate_markdown_report(results, output_file, is_incremental=True)
+def load_existing_results(output_file: str = "evaluation_results.md") -> List[str]:
+    """Load list of already evaluated models from existing results file"""
+    if not os.path.exists(output_file):
+        return []
+    try:
+        with open(output_file, 'r', encoding='utf-8') as f:
+            content = f.read()
+        # Extract model names from the summary table
+        import re
+        model_names = re.findall(r'\|\s*([^|]+?)\s*\|', content)
+        # Remove header row and filter out non-model entries
+        model_names = [name for name in model_names[1:] if not name.startswith('Model') and name.strip()]
+        return model_names
+    except Exception as e:
+        print(f"⚠️ Could not load existing results: {e}")
+        return []
+def main():
+    """Main evaluation function"""
+    print("🚀 Starting Model Evaluation")
+    print(f"Working directory: {os.getcwd()}")
+    # Load prompts
+    print("\n📋 Loading evaluation prompts...")
+    prompts_data = load_eval_prompts()
+    if not prompts_data:
+        print("❌ Failed to load prompts. Exiting.")
+        return
+    num_prompts = len(prompts_data.get("evaluation_prompts", []))
+    print(f"✅ Loaded {num_prompts} evaluation prompts")
+    # Find models
+    print("\n🔍 Scanning for local models...")
+    models = get_local_models()
+    if not models:
+        print("❌ No local models found. Make sure model directories are present.")
+        return
+    # Check for existing results to resume
+    output_file = "evaluation_results.md"
+    existing_models = load_existing_results(output_file)
+    if existing_models:
+        print(f"📄 Found existing results with {len(existing_models)} models")
+        print("🔄 Resuming evaluation from where it left off...")
+        # Filter out already evaluated models
+        models = [model for model in models if os.path.basename(model) not in existing_models]
+        print(f"📊 {len(existing_models)} models already evaluated, {len(models)} remaining")
+    else:
+        print(f"✅ Found {len(models)} models to evaluate")
+    if not models:
+        print("🎉 All models already evaluated!")
+        return
+    print(f"📋 Models to evaluate:")
+    for i, model in enumerate(models, 1):
+        print(f"  {i}. {os.path.basename(model)}")
+    # Load existing results if resuming
+    results = []
+    if existing_models:
+        # For now, we'll start fresh but skip already evaluated models
+        # In a more advanced version, we could parse the existing markdown to load results
+        print("📝 Note: Starting fresh evaluation (existing results will be overwritten)")
+    # Run evaluations
+    print(f"\n🧪 Starting evaluation on {len(models)} models...")
+    for i, model_path in enumerate(models, 1):
+        print(f"\n[{i}/{len(models)}] Processing: {os.path.basename(model_path)}")
+        try:
+            result = run_single_evaluation(model_path, prompts_data)
+            results.append(result)
+            # Show progress
+            if result["status"] == "completed":
+                num_responses = len([r for r in result["responses"].keys() if r != "error"])
+                print(f"✅ Completed {num_responses}/{num_prompts} prompts in {result['evaluation_time']:.1f}s")
+            else:
+                print(f"❌ Failed to evaluate model")
+            # Save progress incrementally after each model
+            save_incremental_results(results)
+        except KeyboardInterrupt:
+            print("\n⚠️ Evaluation interrupted by user")
+            print("💾 Saving current progress...")
+            save_incremental_results(results)
+            break
+        except Exception as e:
+            print(f"❌ Unexpected error: {e}")
+            # Add error result and continue
+            error_result = {
+                "model_name": os.path.basename(model_path),
+                "model_path": model_path,
+                "status": "error",
+                "responses": {"error": str(e)},
+                "evaluation_time": 0
+            }
+            results.append(error_result)
+            save_incremental_results(results)
+            # Continue with next model
+    # Generate report
+    if results:
+        print(f"\n📊 Generating evaluation report...")
+        generate_markdown_report(results)
+        # Summary stats
+        successful = len([r for r in results if r["status"] == "completed"])
+        total_time = sum(r["evaluation_time"] for r in results)
+        print(f"\n🎉 Evaluation Complete!")
+        print(f"   - Models evaluated: {successful}/{len(results)}")
+        print(f"   - Total time: {total_time:.1f} seconds")
+        print(f"   - Average time per model: {total_time/len(results):.1f} seconds")
+    else:
+        print("❌ No results to report")
+if __name__ == "__main__":
+    main()