Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
Multi-Model Evaluator (2_lab2.py)
A Python script that evaluates and compares the performance of multiple AI language models by generating a challenging question, collecting responses from various providers, and ranking them using a judge model.
Overview
This script performs the following steps:
- Question Generation: Uses an OpenAI model to generate a challenging, real-world question
- Multi-Model Evaluation: Sends the question to multiple AI models from different providers
- Response Collection: Gathers and displays all responses with timing information
- Judging: Uses a judge model to rank the responses based on correctness, depth, clarity, and helpfulness
Prerequisites
- Python 3.7 or higher
- API keys for the AI providers you want to test (at minimum, OpenAI API key is required)
Installation
- Install required Python packages:
pip install openai anthropic python-dotenv
Or if you have a requirements file:
pip install -r requirements.txt
Required packages:
openai- For OpenAI API calls and OpenAI-compatible APIsanthropic- For Anthropic/Claude API callspython-dotenv- For loading environment variables from.envfile
Environment Setup
Create a
.envfile in the same directory as2_lab2.py(or in the project root)Add your API keys to the
.envfile:
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional (add only if you want to test these providers)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
DEEPSEEK_API_KEY=your_deepseek_api_key_here
GROQ_API_KEY=your_groq_api_key_here
OLLAMA_BASE_URL=http://localhost:11434
Note: Only OPENAI_API_KEY is strictly required. The script will skip providers for which API keys are missing.
Supported Models
The script is configured to test the following models (you can modify the COMPETITORS list in the script):
- Claude Sonnet 4.5 (Anthropic) - Requires
ANTHROPIC_API_KEY - GPT-5 Nano (OpenAI) - Requires
OPENAI_API_KEY - Gemini 2.0 Flash (Google) - Requires
GOOGLE_API_KEY - Llama 3.2 (via Ollama) - Requires
OLLAMA_BASE_URLpointing to local Ollama instance - DeepSeek Chat (DeepSeek) - Requires
DEEPSEEK_API_KEY - GPT-OSS-120B (via Groq) - Requires
GROQ_API_KEY
Usage
Ensure your
.envfile is set up with at least theOPENAI_API_KEYRun the script:
python 2_lab2.py
The script will:
- Generate a challenging question
- Display the question
- Query each configured model (skipping those without API keys)
- Display each response with timing information
- Use a judge model to rank all responses
- Display the final rankings with scores and justifications
Customization
You can customize the script by modifying:
QUESTION_GENERATOR_MODEL(line 167): The model used to generate questions (default:"gpt-4.1-mini")JUDGE_MODEL(line 319): The model used to judge responses (default:"o3-mini")COMPETITORSlist (lines 196-227): Add, remove, or modify the models to test
Notes
- Models without corresponding API keys will be skipped gracefully
- The script uses OpenAI's Responses API for some models and standard Chat Completions API for others
- Ollama requires a local instance running and accessible at the
OLLAMA_BASE_URL - Response times are measured and displayed for each model
- The judge model outputs JSON-formatted rankings with scores (0-10) and justifications
Troubleshooting
- "OPENAI_API_KEY is required": Make sure your
.envfile contains a valid OpenAI API key - "ANTHROPIC_API_KEY missing": This is expected if you don't have an Anthropic key. The script will skip Anthropic models
- Ollama connection errors: Ensure Ollama is running locally and accessible at the configured
OLLAMA_BASE_URL - Import errors: Make sure all required packages are installed:
pip install openai anthropic python-dotenv