Spaces:
Running
Running
File size: 7,460 Bytes
ca390ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ---
title: LLM Comparison Hub
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.34.2"
app_file: app.py
pinned: false
---
# LLM Comparison Hub
A comprehensive tool for comparing responses from multiple Large Language Models (GPT-4, Claude Sonnet 4, Gemini 3 Flash) with built-in evaluation, analysis, and visualization capabilities.
**Live Demo:** [https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub](https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub)
## Overview
This application provides a complete LLM comparison and evaluation system that generates responses from multiple models, performs round-robin evaluations where each model evaluates all others, and provides comprehensive analysis with interactive visualizations.
## Key Features
- **Multi-Model Response Generation**: Dynamically generate responses from any combination of GPT-4, Claude Sonnet 4, and Gemini 3 Flash using a simple model selector.
- **Dynamic Round-Robin Evaluation**: A robust evaluation system where selected models evaluate each other. If a model is deselected, the evaluation logic adapts automatically.
- **Real-time Query Detection**: Automatically detects if a prompt requires current information and fetches it using a Google search fallback.
- **ATS Scoring**: Performs detailed resume vs. job description matching and scoring.
- **Interactive Data Analysis & Visualization**: Generates consistent, professionally styled charts (Heatmap, Radar, Bar) for all prompt types.
- **Batch Processing**: Handles multiple prompts from CSV files.
- **Modular Architecture**: A clean, production-ready codebase with a new `universal_model_wrapper.py` that centralizes core logic.
- **Gradio Web Interface**: A user-friendly web UI with a model selector to easily choose which LLMs to run.
- **Export Capabilities**: Download a ZIP bundle with all evaluation results and interactive HTML charts.
- **Automated Deployment**: GitHub Actions for continuous deployment to Hugging Face Spaces.
## Project Architecture
The architecture has been refactored for simplicity and robustness.
### Core Application Files
- **`app.py`** - Main Gradio web interface, including UI logic and the model selector.
- **`model_config.py`** - API model IDs and UI/routing keys (GPT-4, Claude Sonnet 4, Gemini 3 Flash).
- **`universal_model_wrapper.py`** - **New core module!** Centralizes all LLM API calls, real-time detection, search fallback, and ATS/general prompt logic.
- **`response_generator.py`** - A simplified wrapper that interfaces between the app and the `universal_model_wrapper`.
- **`round_robin_evaluator.py`** - A dynamic evaluation engine that adapts to the models selected in the UI.
- **`llm_prompt_eval_analysis.py`** - Data analysis and visualization engine.
- **`llm_response_logger.py`** - Quick testing and logging tool.
### Supporting Modules
- **`search_fallback.py`**: This file is kept for reference, but its core functionality has been integrated into `universal_model_wrapper.py` for a more robust, self-contained architecture.
## Usage
### Web Interface (Recommended)
Launch the Gradio web interface:
```bash
python app.py
```
The interface provides:
- **Input Section**: Enter prompts, upload files, and use the **Model Selector** checkboxes to choose which LLMs to run.
- **Results Tabs**: View responses, evaluations, search results, and interactive visualizations.
- **Export Options**: Download results as ZIP bundles with interactive HTML charts.
- **Real-time Features**: Automatic query detection and search enhancement.
### Model Selection
The UI now includes a set of checkboxes allowing you to select any combination of models (GPT-4, Claude Sonnet 4, Gemini 3 Flash) for a given query. The application, including the round-robin evaluation, will dynamically adapt to your selection.
## Technical Architecture
### Design Principles
- **Centralized Logic**: The new `universal_model_wrapper.py` acts as a single source of truth for model interaction.
- **Dynamic & Robust**: The evaluation system is no longer static and adapts to user input, preventing crashes when models are deselected.
- **Separation of Concerns**: Each file has a clear, specific responsibility.
- **Clean Code**: Production-ready and easy to maintain.
- **Hugging Face Compatible**: No external browser dependencies for chart generation.
### Module Responsibilities
| Module | Responsibility |
|--------|---------------|
| `model_config.py` | API model IDs and UI/routing keys for each provider. |
| `app.py` | UI orchestration, including the model selector and deployment. |
| `universal_model_wrapper.py` | Handles all LLM calls, prompt logic, and search. |
| `response_generator.py` | Connects the UI to the universal wrapper. |
| `round_robin_evaluator.py` | Dynamically evaluates the currently selected models. |
| `llm_prompt_eval_analysis.py` | Data analysis and visualization. |
## Installation
### Prerequisites
- Python 3.8 or higher
- API keys for OpenAI, Anthropic, and Google Generative AI
### Setup Instructions
1. **Clone the repository**:
```bash
git clone <repository-url>
cd LLM-Compare-Hub
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Configure API keys**:
Create a `.env` file in the project root with your API keys:
```
OPENAI_API_KEY=your_openai_key_here
CLAUDE_API_KEY=your_claude_key_here
GEMINI_API_KEY=your_gemini_key_here
GOOGLE_API_KEY=your_google_key_here
GOOGLE_CSE_ID=your_google_cse_id_here
```
## API Requirements
### Required APIs
- **OpenAI API**: For GPT-4-class responses and ATS scoring
- **Anthropic API**: For Claude Sonnet 4 responses
- **Google Generative AI**: For Gemini 3 Flash responses
### Optional APIs
- **Google Custom Search**: For real-time query enhancement
## Evaluation Metrics
The system evaluates responses on eight comprehensive criteria:
- **Helpfulness**: How useful and informative is the response?
- **Correctness**: How accurate and factually correct is the response?
- **Coherence**: How well-structured and logical is the response?
- **Tone Score**: How appropriate and professional is the tone?
- **Accuracy**: How precise and detailed is the information?
- **Relevance**: How well does the response address the prompt?
- **Completeness**: How comprehensive is the response?
- **Clarity**: How clear and easy to understand is the response?
## ATS Scoring System
When a resume and job description are provided, the system performs ATS (Applicant Tracking System) scoring:
- **Keyword Matching**: Identifies relevant skills and qualifications
- **Section Weighting**: Prioritizes important sections
- **Semantic Similarity**: Analyzes meaning and context
- **Recency/Frequency**: Considers experience relevance
- **Penalty Detection**: Identifies potential issues
- **Aggregation**: Provides overall match score
## Output and Results
### Generated Files
- **CSV Files**: Comprehensive evaluation results with timestamps
- **Analysis Reports**: Detailed analysis and insights
- **Interactive Visualizations**: Interactive HTML charts and graphs
- **Export Bundles**: ZIP files containing all results and interactive charts
### File Naming Convention
- `evaluation_YYYYMMDD_HHMMSS.csv` - Evaluation results
- `batch_YYYYMMDD_HHMMSS/` - Results directory
- `heatmap.html`, `radar.html`, `barchart.html` - Interactive visualization files
- `bundle.zip` - Complete export package
|