LLM-Comparison-Hub / README.md
chunchu-08's picture
Update README.md
6b046ac unverified

A newer version of the Gradio SDK is available: 6.4.0

Upgrade
metadata
title: LLM Comparison Hub
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false

LLM Comparison Hub

A comprehensive tool for comparing responses from multiple Large Language Models (GPT-4, Claude 3, Gemini 1.5) with built-in evaluation, analysis, and visualization capabilities.

Live Demo: https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub

Overview

This application provides a complete LLM comparison and evaluation system that generates responses from multiple models, performs round-robin evaluations where each model evaluates all others, and provides comprehensive analysis with interactive visualizations.

Key Features

  • Multi-Model Response Generation: Dynamically generate responses from any combination of GPT-4, Claude 3, and Gemini 1.5 using a simple model selector.
  • Dynamic Round-Robin Evaluation: A robust evaluation system where selected models evaluate each other. If a model is deselected, the evaluation logic adapts automatically.
  • Real-time Query Detection: Automatically detects if a prompt requires current information and fetches it using a Google search fallback.
  • ATS Scoring: Performs detailed resume vs. job description matching and scoring.
  • Interactive Data Analysis & Visualization: Generates consistent, professionally styled charts (Heatmap, Radar, Bar) for all prompt types.
  • Batch Processing: Handles multiple prompts from CSV files.
  • Modular Architecture: A clean, production-ready codebase with a new universal_model_wrapper.py that centralizes core logic.
  • Gradio Web Interface: A user-friendly web UI with a model selector to easily choose which LLMs to run.
  • Export Capabilities: Download a ZIP bundle with all evaluation results and interactive HTML charts.
  • Automated Deployment: GitHub Actions for continuous deployment to Hugging Face Spaces.

Project Architecture

The architecture has been refactored for simplicity and robustness.

Core Application Files

  • app.py - Main Gradio web interface, including UI logic and the model selector.
  • universal_model_wrapper.py - New core module! Centralizes all LLM API calls, real-time detection, search fallback, and ATS/general prompt logic.
  • response_generator.py - A simplified wrapper that interfaces between the app and the universal_model_wrapper.
  • round_robin_evaluator.py - A dynamic evaluation engine that adapts to the models selected in the UI.
  • llm_prompt_eval_analysis.py - Data analysis and visualization engine.
  • llm_response_logger.py - Quick testing and logging tool.

Supporting Modules

  • search_fallback.py: This file is kept for reference, but its core functionality has been integrated into universal_model_wrapper.py for a more robust, self-contained architecture.

Usage

Web Interface (Recommended)

Launch the Gradio web interface:

python app.py

The interface provides:

  • Input Section: Enter prompts, upload files, and use the Model Selector checkboxes to choose which LLMs to run.
  • Results Tabs: View responses, evaluations, search results, and interactive visualizations.
  • Export Options: Download results as ZIP bundles with interactive HTML charts.
  • Real-time Features: Automatic query detection and search enhancement.

Model Selection

The UI now includes a set of checkboxes allowing you to select any combination of models (GPT-4, Claude 3, Gemini 1.5) for a given query. The application, including the round-robin evaluation, will dynamically adapt to your selection.

Technical Architecture

Design Principles

  • Centralized Logic: The new universal_model_wrapper.py acts as a single source of truth for model interaction.
  • Dynamic & Robust: The evaluation system is no longer static and adapts to user input, preventing crashes when models are deselected.
  • Separation of Concerns: Each file has a clear, specific responsibility.
  • Clean Code: Production-ready and easy to maintain.
  • Hugging Face Compatible: No external browser dependencies for chart generation.

Module Responsibilities

Module Responsibility
app.py UI orchestration, including the model selector and deployment.
universal_model_wrapper.py Handles all LLM calls, prompt logic, and search.
response_generator.py Connects the UI to the universal wrapper.
round_robin_evaluator.py Dynamically evaluates the currently selected models.
llm_prompt_eval_analysis.py Data analysis and visualization.

Installation

Prerequisites

  • Python 3.8 or higher
  • API keys for OpenAI, Anthropic, and Google Generative AI

Setup Instructions

  1. Clone the repository:

    git clone <repository-url>
    cd LLM-Compare-Hub
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Configure API keys: Create a .env file in the project root with your API keys:

    OPENAI_API_KEY=your_openai_key_here
    CLAUDE_API_KEY=your_claude_key_here
    GEMINI_API_KEY=your_gemini_key_here
    GOOGLE_API_KEY=your_google_key_here
    GOOGLE_CSE_ID=your_google_cse_id_here
    

API Requirements

Required APIs

  • OpenAI API: For GPT-4 responses and ATS scoring
  • Anthropic API: For Claude 3 responses
  • Google Generative AI: For Gemini 1.5 responses

Optional APIs

  • Google Custom Search: For real-time query enhancement

Evaluation Metrics

The system evaluates responses on eight comprehensive criteria:

  • Helpfulness: How useful and informative is the response?
  • Correctness: How accurate and factually correct is the response?
  • Coherence: How well-structured and logical is the response?
  • Tone Score: How appropriate and professional is the tone?
  • Accuracy: How precise and detailed is the information?
  • Relevance: How well does the response address the prompt?
  • Completeness: How comprehensive is the response?
  • Clarity: How clear and easy to understand is the response?

ATS Scoring System

When a resume and job description are provided, the system performs ATS (Applicant Tracking System) scoring:

  • Keyword Matching: Identifies relevant skills and qualifications
  • Section Weighting: Prioritizes important sections
  • Semantic Similarity: Analyzes meaning and context
  • Recency/Frequency: Considers experience relevance
  • Penalty Detection: Identifies potential issues
  • Aggregation: Provides overall match score

Output and Results

Generated Files

  • CSV Files: Comprehensive evaluation results with timestamps
  • Analysis Reports: Detailed analysis and insights
  • Interactive Visualizations: Interactive HTML charts and graphs
  • Export Bundles: ZIP files containing all results and interactive charts

File Naming Convention

  • evaluation_YYYYMMDD_HHMMSS.csv - Evaluation results
  • batch_YYYYMMDD_HHMMSS/ - Results directory
  • heatmap.html, radar.html, barchart.html - Interactive visualization files
  • bundle.zip - Complete export package