rzvn's picture
Upload README.md with huggingface_hub
e09310b verified

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: LLM Benchmark Model vs Judge
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false

LLM-Bench1: Language Model Benchmarking Suite

A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.

Features

  • πŸ”„ Compare different Ollama models head-to-head
  • πŸ“Š Comprehensive benchmarking across multiple categories:
    • Logical Reasoning
    • Code Generation
    • Mathematical Problem Solving
    • Context Understanding
    • Performance Metrics
  • πŸ“ˆ Interactive visualization of results
  • πŸ’Ύ Automatic saving of benchmark results
  • 🎯 Customizable number of test iterations
  • πŸ€– Uses a separate judge model for unbiased evaluation

Requirements

  • Python 3.8+
  • Ollama installed and running
  • Required Python packages (installed automatically):
    • gradio
    • ollama
    • pandas
    • plotly
    • python-dotenv
    • tqdm
    • rich

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/LLM-Bench1.git
cd LLM-Bench1
  1. Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Ensure Ollama is running and you have the desired models pulled:
ollama pull codellama
ollama pull llama2
# Pull any other models you want to benchmark
  1. Run the application:
python app.py
  1. Open the provided URL in your browser to access the Gradio interface.

  2. Select:

    • The model to benchmark
    • The judge model (can be the same or different)
    • Number of test iterations
  3. Click "Run Benchmark" and wait for the results.

Benchmark Categories

1. Logical Reasoning

Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.

2. Code Generation

Evaluates the model's capability to:

  • Write functional code
  • Implement algorithms
  • Handle edge cases
  • Provide proper documentation

3. Mathematical Problem Solving

Tests mathematical reasoning across:

  • Calculus
  • Probability
  • Proof writing
  • Problem-solving strategies

4. Context Understanding

Assesses the model's ability to:

  • Comprehend complex passages
  • Analyze code snippets
  • Evaluate business scenarios
  • Provide structured analysis

5. Performance Metrics

Measures:

  • Response time
  • Tokens per second
  • Consistency across iterations
  • Resource efficiency

Results

Benchmark results are automatically saved in the benchmark_results directory with the following naming format:

benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json

Each result file contains:

  • Model details
  • Timestamp
  • Detailed scores for each category
  • Performance metrics
  • Raw responses and evaluations

Contributing

Feel free to open issues or submit pull requests with improvements to:

  • Test cases
  • Evaluation metrics
  • UI/UX enhancements
  • Documentation

License

MIT License