Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
title: LLM Benchmark Model vs Judge
emoji: π―
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
LLM-Bench1: Language Model Benchmarking Suite
A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.
Features
- π Compare different Ollama models head-to-head
- π Comprehensive benchmarking across multiple categories:
- Logical Reasoning
- Code Generation
- Mathematical Problem Solving
- Context Understanding
- Performance Metrics
- π Interactive visualization of results
- πΎ Automatic saving of benchmark results
- π― Customizable number of test iterations
- π€ Uses a separate judge model for unbiased evaluation
Requirements
- Python 3.8+
- Ollama installed and running
- Required Python packages (installed automatically):
- gradio
- ollama
- pandas
- plotly
- python-dotenv
- tqdm
- rich
Installation
- Clone the repository:
git clone https://github.com/yourusername/LLM-Bench1.git
cd LLM-Bench1
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows, use: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Usage
- Ensure Ollama is running and you have the desired models pulled:
ollama pull codellama
ollama pull llama2
# Pull any other models you want to benchmark
- Run the application:
python app.py
Open the provided URL in your browser to access the Gradio interface.
Select:
- The model to benchmark
- The judge model (can be the same or different)
- Number of test iterations
Click "Run Benchmark" and wait for the results.
Benchmark Categories
1. Logical Reasoning
Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.
2. Code Generation
Evaluates the model's capability to:
- Write functional code
- Implement algorithms
- Handle edge cases
- Provide proper documentation
3. Mathematical Problem Solving
Tests mathematical reasoning across:
- Calculus
- Probability
- Proof writing
- Problem-solving strategies
4. Context Understanding
Assesses the model's ability to:
- Comprehend complex passages
- Analyze code snippets
- Evaluate business scenarios
- Provide structured analysis
5. Performance Metrics
Measures:
- Response time
- Tokens per second
- Consistency across iterations
- Resource efficiency
Results
Benchmark results are automatically saved in the benchmark_results directory with the following naming format:
benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json
Each result file contains:
- Model details
- Timestamp
- Detailed scores for each category
- Performance metrics
- Raw responses and evaluations
Contributing
Feel free to open issues or submit pull requests with improvements to:
- Test cases
- Evaluation metrics
- UI/UX enhancements
- Documentation
License
MIT License