File size: 4,116 Bytes
18eeaee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Multi-Model Evaluator (2_lab2.py)



A Python script that evaluates and compares the performance of multiple AI language models by generating a challenging question, collecting responses from various providers, and ranking them using a judge model.



## Overview



This script performs the following steps:

1. **Question Generation**: Uses an OpenAI model to generate a challenging, real-world question

2. **Multi-Model Evaluation**: Sends the question to multiple AI models from different providers

3. **Response Collection**: Gathers and displays all responses with timing information

4. **Judging**: Uses a judge model to rank the responses based on correctness, depth, clarity, and helpfulness



## Prerequisites



- Python 3.7 or higher

- API keys for the AI providers you want to test (at minimum, OpenAI API key is required)



## Installation



1. **Install required Python packages:**



```bash

pip install openai anthropic python-dotenv

```



Or if you have a requirements file:



```bash

pip install -r requirements.txt

```



Required packages:

- `openai` - For OpenAI API calls and OpenAI-compatible APIs

- `anthropic` - For Anthropic/Claude API calls

- `python-dotenv` - For loading environment variables from `.env` file



## Environment Setup



1. **Create a `.env` file** in the same directory as `2_lab2.py` (or in the project root)

2. **Add your API keys** to the `.env` file:

```env

# Required

OPENAI_API_KEY=your_openai_api_key_here



# Optional (add only if you want to test these providers)

ANTHROPIC_API_KEY=your_anthropic_api_key_here

GOOGLE_API_KEY=your_google_api_key_here

DEEPSEEK_API_KEY=your_deepseek_api_key_here

GROQ_API_KEY=your_groq_api_key_here

OLLAMA_BASE_URL=http://localhost:11434

```

**Note:** Only `OPENAI_API_KEY` is strictly required. The script will skip providers for which API keys are missing.

## Supported Models

The script is configured to test the following models (you can modify the `COMPETITORS` list in the script):

- **Claude Sonnet 4.5** (Anthropic) - Requires `ANTHROPIC_API_KEY`
- **GPT-5 Nano** (OpenAI) - Requires `OPENAI_API_KEY`
- **Gemini 2.0 Flash** (Google) - Requires `GOOGLE_API_KEY`
- **Llama 3.2** (via Ollama) - Requires `OLLAMA_BASE_URL` pointing to local Ollama instance
- **DeepSeek Chat** (DeepSeek) - Requires `DEEPSEEK_API_KEY`
- **GPT-OSS-120B** (via Groq) - Requires `GROQ_API_KEY`

## Usage

1. **Ensure your `.env` file is set up** with at least the `OPENAI_API_KEY`

2. **Run the script:**

```bash

python 2_lab2.py

```

The script will:
- Generate a challenging question
- Display the question
- Query each configured model (skipping those without API keys)
- Display each response with timing information
- Use a judge model to rank all responses
- Display the final rankings with scores and justifications

## Customization

You can customize the script by modifying:

- **`QUESTION_GENERATOR_MODEL`** (line 167): The model used to generate questions (default: `"gpt-4.1-mini"`)
- **`JUDGE_MODEL`** (line 319): The model used to judge responses (default: `"o3-mini"`)

- **`COMPETITORS`** list (lines 196-227): Add, remove, or modify the models to test



## Notes



- Models without corresponding API keys will be skipped gracefully

- The script uses OpenAI's Responses API for some models and standard Chat Completions API for others

- Ollama requires a local instance running and accessible at the `OLLAMA_BASE_URL`

- Response times are measured and displayed for each model

- The judge model outputs JSON-formatted rankings with scores (0-10) and justifications



## Troubleshooting



- **"OPENAI_API_KEY is required"**: Make sure your `.env` file contains a valid OpenAI API key

- **"ANTHROPIC_API_KEY missing"**: This is expected if you don't have an Anthropic key. The script will skip Anthropic models

- **Ollama connection errors**: Ensure Ollama is running locally and accessible at the configured `OLLAMA_BASE_URL`

- **Import errors**: Make sure all required packages are installed: `pip install openai anthropic python-dotenv`