Spaces:

rzvn
/

LLM-Benchmark-Model-vs-Judge

Sleeping

App Files Files Community

LLM-Benchmark-Model-vs-Judge / README.md

rzvn

Upload README.md with huggingface_hub

e09310b verified 5 months ago

preview code

raw

history blame contribute delete

3.16 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

metadata

title: LLM Benchmark Model vs Judge
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false

LLM-Bench1: Language Model Benchmarking Suite

A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.

Features

🔄 Compare different Ollama models head-to-head
📊 Comprehensive benchmarking across multiple categories:
- Logical Reasoning
- Code Generation
- Mathematical Problem Solving
- Context Understanding
- Performance Metrics
📈 Interactive visualization of results
💾 Automatic saving of benchmark results
🎯 Customizable number of test iterations
🤖 Uses a separate judge model for unbiased evaluation

Requirements

Python 3.8+
Ollama installed and running
Required Python packages (installed automatically):
- gradio
- ollama
- pandas
- plotly
- python-dotenv
- tqdm
- rich

Installation

Clone the repository:

git clone https://github.com/yourusername/LLM-Bench1.git
cd LLM-Bench1

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Ensure Ollama is running and you have the desired models pulled:

ollama pull codellama
ollama pull llama2
# Pull any other models you want to benchmark

Run the application:

python app.py

Open the provided URL in your browser to access the Gradio interface.
Select:
- The model to benchmark
- The judge model (can be the same or different)
- Number of test iterations
Click "Run Benchmark" and wait for the results.

Benchmark Categories

1. Logical Reasoning

Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.

2. Code Generation

Evaluates the model's capability to:

Write functional code
Implement algorithms
Handle edge cases
Provide proper documentation

3. Mathematical Problem Solving

Tests mathematical reasoning across:

Calculus
Probability
Proof writing
Problem-solving strategies

4. Context Understanding

Assesses the model's ability to:

Comprehend complex passages
Analyze code snippets
Evaluate business scenarios
Provide structured analysis

5. Performance Metrics

Measures:

Response time
Tokens per second
Consistency across iterations
Resource efficiency

Results

Benchmark results are automatically saved in the benchmark_results directory with the following naming format:

benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json

Each result file contains:

Model details
Timestamp
Detailed scores for each category
Performance metrics
Raw responses and evaluations

Contributing

Feel free to open issues or submit pull requests with improvements to:

Test cases
Evaluation metrics
UI/UX enhancements
Documentation

License

MIT License