--- title: LLM Benchmark Model vs Judge emoji: 🎯 colorFrom: blue colorTo: green sdk: gradio sdk_version: "4.28.3" app_file: app.py pinned: false --- # LLM-Bench1: Language Model Benchmarking Suite A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities. ## Features - 🔄 Compare different Ollama models head-to-head - 📊 Comprehensive benchmarking across multiple categories: - Logical Reasoning - Code Generation - Mathematical Problem Solving - Context Understanding - Performance Metrics - 📈 Interactive visualization of results - 💾 Automatic saving of benchmark results - 🎯 Customizable number of test iterations - 🤖 Uses a separate judge model for unbiased evaluation ## Requirements - Python 3.8+ - Ollama installed and running - Required Python packages (installed automatically): - gradio - ollama - pandas - plotly - python-dotenv - tqdm - rich ## Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/LLM-Bench1.git cd LLM-Bench1 ``` 2. Create and activate a virtual environment: ```bash python -m venv .venv source .venv/bin/activate # On Windows, use: .venv\Scripts\activate ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` ## Usage 1. Ensure Ollama is running and you have the desired models pulled: ```bash ollama pull codellama ollama pull llama2 # Pull any other models you want to benchmark ``` 2. Run the application: ```bash python app.py ``` 3. Open the provided URL in your browser to access the Gradio interface. 4. Select: - The model to benchmark - The judge model (can be the same or different) - Number of test iterations 5. Click "Run Benchmark" and wait for the results. ## Benchmark Categories ### 1. Logical Reasoning Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning. ### 2. Code Generation Evaluates the model's capability to: - Write functional code - Implement algorithms - Handle edge cases - Provide proper documentation ### 3. Mathematical Problem Solving Tests mathematical reasoning across: - Calculus - Probability - Proof writing - Problem-solving strategies ### 4. Context Understanding Assesses the model's ability to: - Comprehend complex passages - Analyze code snippets - Evaluate business scenarios - Provide structured analysis ### 5. Performance Metrics Measures: - Response time - Tokens per second - Consistency across iterations - Resource efficiency ## Results Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format: ``` benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json ``` Each result file contains: - Model details - Timestamp - Detailed scores for each category - Performance metrics - Raw responses and evaluations ## Contributing Feel free to open issues or submit pull requests with improvements to: - Test cases - Evaluation metrics - UI/UX enhancements - Documentation ## License MIT License