rzvn's picture
Upload README.md with huggingface_hub
e09310b verified
---
title: LLM Benchmark Model vs Judge
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "4.28.3"
app_file: app.py
pinned: false
---
# LLM-Bench1: Language Model Benchmarking Suite
A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.
## Features
- πŸ”„ Compare different Ollama models head-to-head
- πŸ“Š Comprehensive benchmarking across multiple categories:
- Logical Reasoning
- Code Generation
- Mathematical Problem Solving
- Context Understanding
- Performance Metrics
- πŸ“ˆ Interactive visualization of results
- πŸ’Ύ Automatic saving of benchmark results
- 🎯 Customizable number of test iterations
- πŸ€– Uses a separate judge model for unbiased evaluation
## Requirements
- Python 3.8+
- Ollama installed and running
- Required Python packages (installed automatically):
- gradio
- ollama
- pandas
- plotly
- python-dotenv
- tqdm
- rich
## Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/LLM-Bench1.git
cd LLM-Bench1
```
2. Create and activate a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows, use: .venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
1. Ensure Ollama is running and you have the desired models pulled:
```bash
ollama pull codellama
ollama pull llama2
# Pull any other models you want to benchmark
```
2. Run the application:
```bash
python app.py
```
3. Open the provided URL in your browser to access the Gradio interface.
4. Select:
- The model to benchmark
- The judge model (can be the same or different)
- Number of test iterations
5. Click "Run Benchmark" and wait for the results.
## Benchmark Categories
### 1. Logical Reasoning
Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.
### 2. Code Generation
Evaluates the model's capability to:
- Write functional code
- Implement algorithms
- Handle edge cases
- Provide proper documentation
### 3. Mathematical Problem Solving
Tests mathematical reasoning across:
- Calculus
- Probability
- Proof writing
- Problem-solving strategies
### 4. Context Understanding
Assesses the model's ability to:
- Comprehend complex passages
- Analyze code snippets
- Evaluate business scenarios
- Provide structured analysis
### 5. Performance Metrics
Measures:
- Response time
- Tokens per second
- Consistency across iterations
- Resource efficiency
## Results
Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format:
```
benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json
```
Each result file contains:
- Model details
- Timestamp
- Detailed scores for each category
- Performance metrics
- Raw responses and evaluations
## Contributing
Feel free to open issues or submit pull requests with improvements to:
- Test cases
- Evaluation metrics
- UI/UX enhancements
- Documentation
## License
MIT License