Spaces:

chunchu-08
/

LLM-Comparison-Hub

Sleeping

App Files Files Community

LLM-Comparison-Hub / README.md

chunchu-08

Update README.md

6b046ac unverified 6 months ago

preview code

raw

history blame contribute delete

7.23 kB

A newer version of the Gradio SDK is available: 6.4.0

Upgrade

metadata

title: LLM Comparison Hub
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false

LLM Comparison Hub

A comprehensive tool for comparing responses from multiple Large Language Models (GPT-4, Claude 3, Gemini 1.5) with built-in evaluation, analysis, and visualization capabilities.

Live Demo: https://huggingface.co/spaces/chunchu-08/LLM-Comparison-Hub

Overview

This application provides a complete LLM comparison and evaluation system that generates responses from multiple models, performs round-robin evaluations where each model evaluates all others, and provides comprehensive analysis with interactive visualizations.

Key Features

Multi-Model Response Generation: Dynamically generate responses from any combination of GPT-4, Claude 3, and Gemini 1.5 using a simple model selector.
Dynamic Round-Robin Evaluation: A robust evaluation system where selected models evaluate each other. If a model is deselected, the evaluation logic adapts automatically.
Real-time Query Detection: Automatically detects if a prompt requires current information and fetches it using a Google search fallback.
ATS Scoring: Performs detailed resume vs. job description matching and scoring.
Interactive Data Analysis & Visualization: Generates consistent, professionally styled charts (Heatmap, Radar, Bar) for all prompt types.
Batch Processing: Handles multiple prompts from CSV files.
Modular Architecture: A clean, production-ready codebase with a new universal_model_wrapper.py that centralizes core logic.
Gradio Web Interface: A user-friendly web UI with a model selector to easily choose which LLMs to run.
Export Capabilities: Download a ZIP bundle with all evaluation results and interactive HTML charts.
Automated Deployment: GitHub Actions for continuous deployment to Hugging Face Spaces.

Project Architecture

The architecture has been refactored for simplicity and robustness.

Core Application Files

app.py - Main Gradio web interface, including UI logic and the model selector.
universal_model_wrapper.py - New core module! Centralizes all LLM API calls, real-time detection, search fallback, and ATS/general prompt logic.
response_generator.py - A simplified wrapper that interfaces between the app and the universal_model_wrapper.
round_robin_evaluator.py - A dynamic evaluation engine that adapts to the models selected in the UI.
llm_prompt_eval_analysis.py - Data analysis and visualization engine.
llm_response_logger.py - Quick testing and logging tool.

Supporting Modules

search_fallback.py: This file is kept for reference, but its core functionality has been integrated into universal_model_wrapper.py for a more robust, self-contained architecture.

Usage

Web Interface (Recommended)

Launch the Gradio web interface:

python app.py

The interface provides:

Input Section: Enter prompts, upload files, and use the Model Selector checkboxes to choose which LLMs to run.
Results Tabs: View responses, evaluations, search results, and interactive visualizations.
Export Options: Download results as ZIP bundles with interactive HTML charts.
Real-time Features: Automatic query detection and search enhancement.

Model Selection

The UI now includes a set of checkboxes allowing you to select any combination of models (GPT-4, Claude 3, Gemini 1.5) for a given query. The application, including the round-robin evaluation, will dynamically adapt to your selection.

Technical Architecture

Design Principles

Centralized Logic: The new universal_model_wrapper.py acts as a single source of truth for model interaction.
Dynamic & Robust: The evaluation system is no longer static and adapts to user input, preventing crashes when models are deselected.
Separation of Concerns: Each file has a clear, specific responsibility.
Clean Code: Production-ready and easy to maintain.
Hugging Face Compatible: No external browser dependencies for chart generation.

Module Responsibilities

Module	Responsibility
`app.py`	UI orchestration, including the model selector and deployment.
`universal_model_wrapper.py`	Handles all LLM calls, prompt logic, and search.
`response_generator.py`	Connects the UI to the universal wrapper.
`round_robin_evaluator.py`	Dynamically evaluates the currently selected models.
`llm_prompt_eval_analysis.py`	Data analysis and visualization.

Installation

Prerequisites

Python 3.8 or higher
API keys for OpenAI, Anthropic, and Google Generative AI

Setup Instructions

Clone the repository:

git clone <repository-url>
cd LLM-Compare-Hub

Install dependencies:
```
pip install -r requirements.txt
```

Configure API keys: Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_key_here
CLAUDE_API_KEY=your_claude_key_here
GEMINI_API_KEY=your_gemini_key_here
GOOGLE_API_KEY=your_google_key_here
GOOGLE_CSE_ID=your_google_cse_id_here

API Requirements

Required APIs

OpenAI API: For GPT-4 responses and ATS scoring
Anthropic API: For Claude 3 responses
Google Generative AI: For Gemini 1.5 responses

Optional APIs

Google Custom Search: For real-time query enhancement

Evaluation Metrics

The system evaluates responses on eight comprehensive criteria:

Helpfulness: How useful and informative is the response?
Correctness: How accurate and factually correct is the response?
Coherence: How well-structured and logical is the response?
Tone Score: How appropriate and professional is the tone?
Accuracy: How precise and detailed is the information?
Relevance: How well does the response address the prompt?
Completeness: How comprehensive is the response?
Clarity: How clear and easy to understand is the response?

ATS Scoring System

When a resume and job description are provided, the system performs ATS (Applicant Tracking System) scoring:

Keyword Matching: Identifies relevant skills and qualifications
Section Weighting: Prioritizes important sections
Semantic Similarity: Analyzes meaning and context
Recency/Frequency: Considers experience relevance
Penalty Detection: Identifies potential issues
Aggregation: Provides overall match score