Spaces:
Sleeping
Sleeping
| title: LLM Benchmark Model vs Judge | |
| emoji: π― | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: "4.28.3" | |
| app_file: app.py | |
| pinned: false | |
| # LLM-Bench1: Language Model Benchmarking Suite | |
| A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities. | |
| ## Features | |
| - π Compare different Ollama models head-to-head | |
| - π Comprehensive benchmarking across multiple categories: | |
| - Logical Reasoning | |
| - Code Generation | |
| - Mathematical Problem Solving | |
| - Context Understanding | |
| - Performance Metrics | |
| - π Interactive visualization of results | |
| - πΎ Automatic saving of benchmark results | |
| - π― Customizable number of test iterations | |
| - π€ Uses a separate judge model for unbiased evaluation | |
| ## Requirements | |
| - Python 3.8+ | |
| - Ollama installed and running | |
| - Required Python packages (installed automatically): | |
| - gradio | |
| - ollama | |
| - pandas | |
| - plotly | |
| - python-dotenv | |
| - tqdm | |
| - rich | |
| ## Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/yourusername/LLM-Bench1.git | |
| cd LLM-Bench1 | |
| ``` | |
| 2. Create and activate a virtual environment: | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate # On Windows, use: .venv\Scripts\activate | |
| ``` | |
| 3. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| 1. Ensure Ollama is running and you have the desired models pulled: | |
| ```bash | |
| ollama pull codellama | |
| ollama pull llama2 | |
| # Pull any other models you want to benchmark | |
| ``` | |
| 2. Run the application: | |
| ```bash | |
| python app.py | |
| ``` | |
| 3. Open the provided URL in your browser to access the Gradio interface. | |
| 4. Select: | |
| - The model to benchmark | |
| - The judge model (can be the same or different) | |
| - Number of test iterations | |
| 5. Click "Run Benchmark" and wait for the results. | |
| ## Benchmark Categories | |
| ### 1. Logical Reasoning | |
| Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning. | |
| ### 2. Code Generation | |
| Evaluates the model's capability to: | |
| - Write functional code | |
| - Implement algorithms | |
| - Handle edge cases | |
| - Provide proper documentation | |
| ### 3. Mathematical Problem Solving | |
| Tests mathematical reasoning across: | |
| - Calculus | |
| - Probability | |
| - Proof writing | |
| - Problem-solving strategies | |
| ### 4. Context Understanding | |
| Assesses the model's ability to: | |
| - Comprehend complex passages | |
| - Analyze code snippets | |
| - Evaluate business scenarios | |
| - Provide structured analysis | |
| ### 5. Performance Metrics | |
| Measures: | |
| - Response time | |
| - Tokens per second | |
| - Consistency across iterations | |
| - Resource efficiency | |
| ## Results | |
| Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format: | |
| ``` | |
| benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json | |
| ``` | |
| Each result file contains: | |
| - Model details | |
| - Timestamp | |
| - Detailed scores for each category | |
| - Performance metrics | |
| - Raw responses and evaluations | |
| ## Contributing | |
| Feel free to open issues or submit pull requests with improvements to: | |
| - Test cases | |
| - Evaluation metrics | |
| - UI/UX enhancements | |
| - Documentation | |
| ## License | |
| MIT License | |