Spaces:

raulp7
/

agents_career

Runtime error

File size: 4,008 Bytes

1b22320

# Multi-Model Evaluator (2_lab2.py)

A Python script that evaluates and compares the performance of multiple AI language models by generating a challenging question, collecting responses from various providers, and ranking them using a judge model.

## Overview

This script performs the following steps:
1. **Question Generation**: Uses an OpenAI model to generate a challenging, real-world question
2. **Multi-Model Evaluation**: Sends the question to multiple AI models from different providers
3. **Response Collection**: Gathers and displays all responses with timing information
4. **Judging**: Uses a judge model to rank the responses based on correctness, depth, clarity, and helpfulness

## Prerequisites

- Python 3.7 or higher
- API keys for the AI providers you want to test (at minimum, OpenAI API key is required)

## Installation

1. **Install required Python packages:**

```bash
pip install openai anthropic python-dotenv
```

Or if you have a requirements file:

```bash
pip install -r requirements.txt
```

Required packages:
- `openai` - For OpenAI API calls and OpenAI-compatible APIs
- `anthropic` - For Anthropic/Claude API calls
- `python-dotenv` - For loading environment variables from `.env` file

## Environment Setup

1. **Create a `.env` file** in the same directory as `2_lab2.py` (or in the project root)

2. **Add your API keys** to the `.env` file:

```env
# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional (add only if you want to test these providers)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
DEEPSEEK_API_KEY=your_deepseek_api_key_here
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_BASE_URL=http://localhost:11434
```

**Note:** Only `OPENAI_API_KEY` is strictly required. The script will skip providers for which API keys are missing.

## Supported Models

The script is configured to test the following models (you can modify the `COMPETITORS` list in the script):

- **Claude Sonnet 4.5** (Anthropic) - Requires `ANTHROPIC_API_KEY`
- **GPT-5 Nano** (OpenAI) - Requires `OPENAI_API_KEY`
- **Gemini 2.0 Flash** (Google) - Requires `GOOGLE_API_KEY`
- **Llama 3.2** (via Ollama) - Requires `OLLAMA_BASE_URL` pointing to local Ollama instance
- **DeepSeek Chat** (DeepSeek) - Requires `DEEPSEEK_API_KEY`
- **GPT-OSS-120B** (via Groq) - Requires `GROQ_API_KEY`

## Usage

1. **Ensure your `.env` file is set up** with at least the `OPENAI_API_KEY`

2. **Run the script:**

```bash
python 2_lab2.py
```

The script will:
- Generate a challenging question
- Display the question
- Query each configured model (skipping those without API keys)
- Display each response with timing information
- Use a judge model to rank all responses
- Display the final rankings with scores and justifications

## Customization

You can customize the script by modifying:

- **`QUESTION_GENERATOR_MODEL`** (line 167): The model used to generate questions (default: `"gpt-4.1-mini"`)
- **`JUDGE_MODEL`** (line 319): The model used to judge responses (default: `"o3-mini"`)
- **`COMPETITORS`** list (lines 196-227): Add, remove, or modify the models to test

## Notes

- Models without corresponding API keys will be skipped gracefully
- The script uses OpenAI's Responses API for some models and standard Chat Completions API for others
- Ollama requires a local instance running and accessible at the `OLLAMA_BASE_URL`
- Response times are measured and displayed for each model
- The judge model outputs JSON-formatted rankings with scores (0-10) and justifications

## Troubleshooting

- **"OPENAI_API_KEY is required"**: Make sure your `.env` file contains a valid OpenAI API key
- **"ANTHROPIC_API_KEY missing"**: This is expected if you don't have an Anthropic key. The script will skip Anthropic models
- **Ollama connection errors**: Ensure Ollama is running locally and accessible at the configured `OLLAMA_BASE_URL`
- **Import errors**: Make sure all required packages are installed: `pip install openai anthropic python-dotenv`