Spaces:

rzvn
/

LLM-Benchmark-Model-vs-Judge

Sleeping

App Files Files Community

LLM-Benchmark-Model-vs-Judge / README.md

rzvn

Upload README.md with huggingface_hub

e09310b verified 5 months ago

preview code

raw

history blame contribute delete

3.16 kB

	---
	title: LLM Benchmark Model vs Judge
	emoji: 🎯
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: "4.28.3"
	app_file: app.py
	pinned: false
	---

	# LLM-Bench1: Language Model Benchmarking Suite

	A comprehensive benchmarking tool for comparing different LLM models using Ollama, with a focus on various aspects of model performance including accuracy, speed, and reasoning capabilities.

	## Features

	- 🔄 Compare different Ollama models head-to-head
	- 📊 Comprehensive benchmarking across multiple categories:
	- Logical Reasoning
	- Code Generation
	- Mathematical Problem Solving
	- Context Understanding
	- Performance Metrics
	- 📈 Interactive visualization of results
	- 💾 Automatic saving of benchmark results
	- 🎯 Customizable number of test iterations
	- 🤖 Uses a separate judge model for unbiased evaluation

	## Requirements

	- Python 3.8+
	- Ollama installed and running
	- Required Python packages (installed automatically):
	- gradio
	- ollama
	- pandas
	- plotly
	- python-dotenv
	- tqdm
	- rich

	## Installation

	1. Clone the repository:
	```bash
	git clone https://github.com/yourusername/LLM-Bench1.git
	cd LLM-Bench1
	```

	2. Create and activate a virtual environment:
	```bash
	python -m venv .venv
	source .venv/bin/activate # On Windows, use: .venv\Scripts\activate
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	## Usage

	1. Ensure Ollama is running and you have the desired models pulled:
	```bash
	ollama pull codellama
	ollama pull llama2
	# Pull any other models you want to benchmark
	```

	2. Run the application:
	```bash
	python app.py
	```

	3. Open the provided URL in your browser to access the Gradio interface.

	4. Select:
	- The model to benchmark
	- The judge model (can be the same or different)
	- Number of test iterations

	5. Click "Run Benchmark" and wait for the results.

	## Benchmark Categories

	### 1. Logical Reasoning
	Tests the model's ability to solve complex logical problems and puzzles, evaluating step-by-step reasoning.

	### 2. Code Generation
	Evaluates the model's capability to:
	- Write functional code
	- Implement algorithms
	- Handle edge cases
	- Provide proper documentation

	### 3. Mathematical Problem Solving
	Tests mathematical reasoning across:
	- Calculus
	- Probability
	- Proof writing
	- Problem-solving strategies

	### 4. Context Understanding
	Assesses the model's ability to:
	- Comprehend complex passages
	- Analyze code snippets
	- Evaluate business scenarios
	- Provide structured analysis

	### 5. Performance Metrics
	Measures:
	- Response time
	- Tokens per second
	- Consistency across iterations
	- Resource efficiency

	## Results

	Benchmark results are automatically saved in the `benchmark_results` directory with the following naming format:
	```
	benchmark_results/[model_name]_vs_[judge_model]_[timestamp].json
	```

	Each result file contains:
	- Model details
	- Timestamp
	- Detailed scores for each category
	- Performance metrics
	- Raw responses and evaluations

	## Contributing

	Feel free to open issues or submit pull requests with improvements to:
	- Test cases
	- Evaluation metrics
	- UI/UX enhancements
	- Documentation

	## License

	MIT License