--- title: LLM Comparison Hub colorFrom: blue colorTo: green sdk: gradio sdk_version: "5.34.2" app_file: app.py pinned: false --- # LLM Comparison Hub A comprehensive tool for comparing responses from multiple Large Language Models (GPT-4, Claude 3, Gemini 1.5) with built-in evaluation, analysis, and visualization capabilities. **Live Demo:** [https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub](https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub) ## Overview This application provides a complete LLM comparison and evaluation system that generates responses from multiple models, performs round-robin evaluations where each model evaluates all others, and provides comprehensive analysis with interactive visualizations. ## Key Features - **Multi-Model Response Generation**: Dynamically generate responses from any combination of GPT-4, Claude 3, and Gemini 1.5 using a simple model selector. - **Dynamic Round-Robin Evaluation**: A robust evaluation system where selected models evaluate each other. If a model is deselected, the evaluation logic adapts automatically. - **Real-time Query Detection**: Automatically detects if a prompt requires current information and fetches it using a Google search fallback. - **ATS Scoring**: Performs detailed resume vs. job description matching and scoring. - **Interactive Data Analysis & Visualization**: Generates consistent, professionally styled charts (Heatmap, Radar, Bar) for all prompt types. - **Batch Processing**: Handles multiple prompts from CSV files. - **Modular Architecture**: A clean, production-ready codebase with a new `universal_model_wrapper.py` that centralizes core logic. - **Gradio Web Interface**: A user-friendly web UI with a model selector to easily choose which LLMs to run. - **Export Capabilities**: Download a ZIP bundle with all evaluation results and interactive HTML charts. - **Automated Deployment**: GitHub Actions for continuous deployment to Hugging Face Spaces. ## Project Architecture The architecture has been refactored for simplicity and robustness. ### Core Application Files - **`app.py`** - Main Gradio web interface, including UI logic and the model selector. - **`universal_model_wrapper.py`** - **New core module!** Centralizes all LLM API calls, real-time detection, search fallback, and ATS/general prompt logic. - **`response_generator.py`** - A simplified wrapper that interfaces between the app and the `universal_model_wrapper`. - **`round_robin_evaluator.py`** - A dynamic evaluation engine that adapts to the models selected in the UI. - **`llm_prompt_eval_analysis.py`** - Data analysis and visualization engine. - **`llm_response_logger.py`** - Quick testing and logging tool. ### Supporting Modules - **`search_fallback.py`**: This file is kept for reference, but its core functionality has been integrated into `universal_model_wrapper.py` for a more robust, self-contained architecture. ## Usage ### Web Interface (Recommended) Launch the Gradio web interface: ```bash python app.py ``` The interface provides: - **Input Section**: Enter prompts, upload files, and use the **Model Selector** checkboxes to choose which LLMs to run. - **Results Tabs**: View responses, evaluations, search results, and interactive visualizations. - **Export Options**: Download results as ZIP bundles with interactive HTML charts. - **Real-time Features**: Automatic query detection and search enhancement. ### Model Selection The UI now includes a set of checkboxes allowing you to select any combination of models (GPT-4, Claude 3, Gemini 1.5) for a given query. The application, including the round-robin evaluation, will dynamically adapt to your selection. ## Technical Architecture ### Design Principles - **Centralized Logic**: The new `universal_model_wrapper.py` acts as a single source of truth for model interaction. - **Dynamic & Robust**: The evaluation system is no longer static and adapts to user input, preventing crashes when models are deselected. - **Separation of Concerns**: Each file has a clear, specific responsibility. - **Clean Code**: Production-ready and easy to maintain. - **Hugging Face Compatible**: No external browser dependencies for chart generation. ### Module Responsibilities | Module | Responsibility | |--------|---------------| | `app.py` | UI orchestration, including the model selector and deployment. | | `universal_model_wrapper.py` | Handles all LLM calls, prompt logic, and search. | | `response_generator.py` | Connects the UI to the universal wrapper. | | `round_robin_evaluator.py` | Dynamically evaluates the currently selected models. | | `llm_prompt_eval_analysis.py` | Data analysis and visualization. | ## Installation ### Prerequisites - Python 3.8 or higher - API keys for OpenAI, Anthropic, and Google Generative AI ### Setup Instructions 1. **Clone the repository**: ```bash git clone cd LLM-Compare-Hub ``` 2. **Install dependencies**: ```bash pip install -r requirements.txt ``` 3. **Configure API keys**: Create a `.env` file in the project root with your API keys: ``` OPENAI_API_KEY=your_openai_key_here CLAUDE_API_KEY=your_claude_key_here GEMINI_API_KEY=your_gemini_key_here GOOGLE_API_KEY=your_google_key_here GOOGLE_CSE_ID=your_google_cse_id_here ``` ## API Requirements ### Required APIs - **OpenAI API**: For GPT-4 responses and ATS scoring - **Anthropic API**: For Claude 3 responses - **Google Generative AI**: For Gemini 1.5 responses ### Optional APIs - **Google Custom Search**: For real-time query enhancement ## Evaluation Metrics The system evaluates responses on eight comprehensive criteria: - **Helpfulness**: How useful and informative is the response? - **Correctness**: How accurate and factually correct is the response? - **Coherence**: How well-structured and logical is the response? - **Tone Score**: How appropriate and professional is the tone? - **Accuracy**: How precise and detailed is the information? - **Relevance**: How well does the response address the prompt? - **Completeness**: How comprehensive is the response? - **Clarity**: How clear and easy to understand is the response? ## ATS Scoring System When a resume and job description are provided, the system performs ATS (Applicant Tracking System) scoring: - **Keyword Matching**: Identifies relevant skills and qualifications - **Section Weighting**: Prioritizes important sections - **Semantic Similarity**: Analyzes meaning and context - **Recency/Frequency**: Considers experience relevance - **Penalty Detection**: Identifies potential issues - **Aggregation**: Provides overall match score ## Output and Results ### Generated Files - **CSV Files**: Comprehensive evaluation results with timestamps - **Analysis Reports**: Detailed analysis and insights - **Interactive Visualizations**: Interactive HTML charts and graphs - **Export Bundles**: ZIP files containing all results and interactive charts ### File Naming Convention - `evaluation_YYYYMMDD_HHMMSS.csv` - Evaluation results - `batch_YYYYMMDD_HHMMSS/` - Results directory - `heatmap.html`, `radar.html`, `barchart.html` - Interactive visualization files - `bundle.zip` - Complete export package