---
title: LLM Benchmark Model vs Judge
emoji: 🎯
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "4.28.3"
app_file: app.py
pinned: false
---

# LLM-Bench1: Language Model Benchmarking Suite

A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.

## Features

- 🔄 Compare different Ollama models head-to-head
- 📊 Comprehensive benchmarking across multiple categories:
  - Logical Reasoning
  - Code Generation
  - Mathematical Problem Solving
  - Context Understanding
  - Performance Metrics
- 📈 Interactive visualization of results
- 💾 Automatic saving of benchmark results
- 🎯 Customizable number of test iterations
- 🤖 Uses a separate judge model for unbiased evaluation

## Requirements

- Python 3.8+
- Ollama installed and running
- Required Python packages (installed automatically):
  - gradio
  - ollama
  - pandas
  - plotly
  - python-dotenv
  - tqdm
  - rich

## Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/LLM-Bench1.git
cd LLM-Bench1
```

2. Create and activate a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

1. Ensure Ollama is running and you have the desired models pulled:
```bash
ollama pull codellama
ollama pull llama2
# Pull any other models you want to benchmark
```

2. Run the application:
```bash
python app.py
```

3. Open the provided URL in your browser to access the Gradio interface.

4. Select:
   - The model to benchmark
   - The judge model (can be the same or different)
   - Number of test iterations
   
5. Click "Run Benchmark" and wait for the results.

## Benchmark Categories

### 1. Logical Reasoning
Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.

### 2. Code Generation
Evaluates the model's capability to:
- Write functional code
- Implement algorithms
- Handle edge cases
- Provide proper documentation

### 3. Mathematical Problem Solving
Tests mathematical reasoning across:
- Calculus
- Probability
- Proof writing
- Problem-solving strategies

### 4. Context Understanding
Assesses the model's ability to:
- Comprehend complex passages
- Analyze code snippets
- Evaluate business scenarios
- Provide structured analysis

### 5. Performance Metrics
Measures:
- Response time
- Tokens per second
- Consistency across iterations
- Resource efficiency

## Results

Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format:
```
benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json
```

Each result file contains:
- Model details
- Timestamp
- Detailed scores for each category
- Performance metrics
- Raw responses and evaluations

## Contributing

Feel free to open issues or submit pull requests with improvements to:
- Test cases
- Evaluation metrics
- UI/UX enhancements
- Documentation

## License

MIT License