Spaces:
Sleeping
Sleeping
File size: 4,116 Bytes
18eeaee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# Multi-Model Evaluator (2_lab2.py)
A Python script that evaluates and compares the performance of multiple AI language models by generating a challenging question, collecting responses from various providers, and ranking them using a judge model.
## Overview
This script performs the following steps:
1. **Question Generation**: Uses an OpenAI model to generate a challenging, real-world question
2. **Multi-Model Evaluation**: Sends the question to multiple AI models from different providers
3. **Response Collection**: Gathers and displays all responses with timing information
4. **Judging**: Uses a judge model to rank the responses based on correctness, depth, clarity, and helpfulness
## Prerequisites
- Python 3.7 or higher
- API keys for the AI providers you want to test (at minimum, OpenAI API key is required)
## Installation
1. **Install required Python packages:**
```bash
pip install openai anthropic python-dotenv
```
Or if you have a requirements file:
```bash
pip install -r requirements.txt
```
Required packages:
- `openai` - For OpenAI API calls and OpenAI-compatible APIs
- `anthropic` - For Anthropic/Claude API calls
- `python-dotenv` - For loading environment variables from `.env` file
## Environment Setup
1. **Create a `.env` file** in the same directory as `2_lab2.py` (or in the project root)
2. **Add your API keys** to the `.env` file:
```env
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional (add only if you want to test these providers)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
DEEPSEEK_API_KEY=your_deepseek_api_key_here
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_BASE_URL=http://localhost:11434
```
**Note:** Only `OPENAI_API_KEY` is strictly required. The script will skip providers for which API keys are missing.
## Supported Models
The script is configured to test the following models (you can modify the `COMPETITORS` list in the script):
- **Claude Sonnet 4.5** (Anthropic) - Requires `ANTHROPIC_API_KEY`
- **GPT-5 Nano** (OpenAI) - Requires `OPENAI_API_KEY`
- **Gemini 2.0 Flash** (Google) - Requires `GOOGLE_API_KEY`
- **Llama 3.2** (via Ollama) - Requires `OLLAMA_BASE_URL` pointing to local Ollama instance
- **DeepSeek Chat** (DeepSeek) - Requires `DEEPSEEK_API_KEY`
- **GPT-OSS-120B** (via Groq) - Requires `GROQ_API_KEY`
## Usage
1. **Ensure your `.env` file is set up** with at least the `OPENAI_API_KEY`
2. **Run the script:**
```bash
python 2_lab2.py
```
The script will:
- Generate a challenging question
- Display the question
- Query each configured model (skipping those without API keys)
- Display each response with timing information
- Use a judge model to rank all responses
- Display the final rankings with scores and justifications
## Customization
You can customize the script by modifying:
- **`QUESTION_GENERATOR_MODEL`** (line 167): The model used to generate questions (default: `"gpt-4.1-mini"`)
- **`JUDGE_MODEL`** (line 319): The model used to judge responses (default: `"o3-mini"`)
- **`COMPETITORS`** list (lines 196-227): Add, remove, or modify the models to test
## Notes
- Models without corresponding API keys will be skipped gracefully
- The script uses OpenAI's Responses API for some models and standard Chat Completions API for others
- Ollama requires a local instance running and accessible at the `OLLAMA_BASE_URL`
- Response times are measured and displayed for each model
- The judge model outputs JSON-formatted rankings with scores (0-10) and justifications
## Troubleshooting
- **"OPENAI_API_KEY is required"**: Make sure your `.env` file contains a valid OpenAI API key
- **"ANTHROPIC_API_KEY missing"**: This is expected if you don't have an Anthropic key. The script will skip Anthropic models
- **Ollama connection errors**: Ensure Ollama is running locally and accessible at the configured `OLLAMA_BASE_URL`
- **Import errors**: Make sure all required packages are installed: `pip install openai anthropic python-dotenv`
|