--- title: EvalArena emoji: 🥇 colorFrom: pink colorTo: indigo sdk: gradio app_file: app.py pinned: true license: cc-by-nc-4.0 short_description: "An AI Judge Evaluation Platform" sdk_version: 5.19.0 --- # EvalArena An AI Judge Evaluation Platform ## About EvalArena is a platform that allows users to compare and rate different AI evaluation models (judges). The platform uses a competitive ELO rating system to rank different judge models based on human preferences. ## Project Structure After refactoring, the project now has a cleaner structure: ``` EvalArena/ │ ├── src/ # Source code │ ├── app.py # Application logic │ ├── config.py # Constants and configuration │ ├── data_manager.py # Dataset loading and management │ ├── judge.py # Judge evaluation functionality │ └── ui.py # Gradio UI components │ ├── data/ # Data directory for CSV files ├── models.jsonl # Model definitions ├── main.py # Entry point └── requirements.txt # Dependencies ``` ## Setup 1. Clone the repository 2. Install dependencies: ``` pip install -r requirements.txt ``` 3. Create a `.env` file with any API keys: ``` OPENAI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here QUALIFIRE_API_KEY=your_qualifire_key_here ``` ## Running Run the application using: ``` python main.py ``` This will start the Gradio web interface where you can: - Select test types (grounding, hallucinations, safety, etc.) - Get random examples - See evaluations from two random judge models - Select which judge provided a better evaluation - View the leaderboard of judges ranked by ELO score ## Features - Multiple test types (prompt injections, safety, grounding, hallucinations, policy) - ELO-based competitive rating system - Support for various model providers (OpenAI, Anthropic, Together AI) - Detailed evaluations with scoring criteria - Persistent leaderboard ## Overview This application allows users to: 1. View AI-generated outputs based on input prompts 2. Compare evaluations from two different AI judges 3. Select the better evaluation 4. Build a leaderboard of judges ranked by ELO score ## Features - **Blind Comparison**: Judge identities are hidden until after selection - **ELO Rating System**: Calculates judge rankings based on user preferences - **Leaderboard**: Track performance of different evaluation models - **Sample Examples**: Includes pre-loaded examples for immediate testing ## Setup ### Prerequisites - Python 3.6+ - Required packages: gradio, pandas, numpy ### Installation 1. Clone this repository: ``` git clone https://github.com/yourusername/eval-arena.git cd eval-arena ``` 2. Install dependencies: ``` pip install -r requirements.txt ``` 3. Run the application: ``` python app.py ``` 4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860) ## Usage 1. **Get Random Example**: Click to load a random input/output pair 2. **Get Judge Evaluations**: View two anonymous evaluations of the output 3. **Select Better Evaluation**: Choose which evaluation you prefer 4. **See Results**: Learn which judges you compared and update the leaderboard 5. **Leaderboard Tab**: View current rankings of all judges ## Extending the Application ### Adding New Examples Add new examples in JSON format to the `data/examples` directory: ```json { "id": "example_id", "input": "Your input prompt", "output": "AI-generated output to evaluate" } ``` ### Adding New Judges Add new judges in JSON format to the `data/judges` directory: ```json { "id": "judge_id", "name": "Judge Name", "description": "Description of judge's evaluation approach" } ``` ### Integrating Real Models For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations. ## License MIT ## Citation If you use this platform in your research, please cite: ``` @software{ai_eval_arena, author = {Your Name}, title = {AI Evaluation Judge Arena}, year = {2023}, url = {https://github.com/yourusername/eval-arena} } ``` # Start the configuration Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks). Results files should have the following format and be stored as json files: ```json { "config": { "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit "model_name": "path of the model on the hub: org/model", "model_sha": "revision on the hub", }, "results": { "task_name": { "metric_name": score, }, "task_name2": { "metric_name": score, } } } ``` Request files are created automatically by this tool. If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder. # Code logic for more complex edits You'll find - the main table' columns names and properties in `src/display/utils.py` - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py` - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`