Nav772's picture
Upload README.md with huggingface_hub
b5a6418 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: LLM Evaluation Dashboard
emoji: πŸ§ͺ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions

πŸ§ͺ LLM Evaluation Dashboard

Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.

🎯 What This Does

  1. Benchmark Results β€” View pre-computed evaluation results across 15 tasks
  2. Interactive Charts β€” Visualize accuracy and latency comparisons
  3. Live Testing β€” Test any model with your own custom prompts
  4. Detailed Analysis β€” Filter and explore results by model and category

πŸ€– Models Evaluated

Model Parameters Type Organization
Mistral-7B-Instruct 7B General Mistral AI
Llama-3.2-3B-Instruct 3B General Meta
Llama-3.1-70B-Instruct 70B General Meta
Qwen2.5-72B-Instruct 72B General Alibaba
Qwen2.5-Coder-32B 32B Code Alibaba

πŸ“Š Evaluation Categories

1. Reasoning (Math & Logic)

Tests mathematical computation and logical deduction abilities.

Example tasks:

  • "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
  • "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"

2. Knowledge (Facts)

Tests factual accuracy across science, history, and geography.

Example tasks:

  • "What is the chemical symbol for gold?"
  • "What planet is known as the Red Planet?"

3. Instruction Following

Tests ability to follow specific format constraints.

Example tasks:

  • "Return a JSON object with keys 'name' and 'age'"
  • "List exactly 3 colors, one per line"
  • "Write a sentence of exactly 5 words"

πŸ“ˆ Key Findings

Category Best Model Score
Overall Mistral-7B 80%
Reasoning Qwen2.5-Coder 80%
Knowledge Mistral-7B, Llama-3.2, Qwen-Coder 100%
Instruction Following Qwen2.5-72B 100%

Insights

  • Mistral-7B achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
  • Qwen2.5-Coder excelled at reasoning tasks despite being code-focused
  • Qwen2.5-72B had perfect instruction following but struggled with reasoning
  • Larger models β‰  better performance β€” 7B Mistral outperformed 70B+ models

πŸ”§ Technical Implementation

Evaluation Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  LLM Evaluation Pipeline                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  15 Tasks   β”‚ β†’  β”‚  5 Models   β”‚ β†’  β”‚  75 Total   β”‚     β”‚
β”‚  β”‚  3 Categoriesβ”‚    β”‚  HF API     β”‚    β”‚  Evaluationsβ”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                             β”‚
β”‚                          ↓                                  β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Scoring Functions                       β”‚   β”‚
β”‚  β”‚  β€’ contains / contains_lower (substring match)      β”‚   β”‚
β”‚  β”‚  β€’ json_valid (JSON parsing)                        β”‚   β”‚
β”‚  β”‚  β€’ line_count / word_count (format validation)      β”‚   β”‚
β”‚  β”‚  β€’ starts_with_lower (constraint checking)          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                             β”‚
β”‚                          ↓                                  β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Dashboard Visualization                 β”‚   β”‚
β”‚  β”‚  β€’ Accuracy bar charts                              β”‚   β”‚
β”‚  β”‚  β€’ Category heatmaps                                β”‚   β”‚
β”‚  β”‚  β€’ Latency comparisons                              β”‚   β”‚
β”‚  β”‚  β€’ Filterable results table                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scoring Methods

Check Type Description Example
contains Exact substring match "4" in "The answer is 4"
contains_lower Case-insensitive match "mars" in "MARS is red"
json_valid Valid JSON object {"name": "Alice"}
line_count Correct number of lines 3 lines for "list 3 colors"
word_count Correct word count 5 words for "5-word sentence"
starts_with_lower First word starts with letter "Apple" starts with "a"

Tech Stack

Component Technology Purpose
Frontend Gradio Interactive dashboard UI
Visualization Plotly Charts and heatmaps
LLM Access HuggingFace Inference API Free model inference
Data Pandas Results storage and analysis

πŸš€ Live Model Comparison

The dashboard includes a Live Comparison feature where you can:

  1. Enter any custom prompt
  2. Select which models to compare
  3. See responses side-by-side with latency metrics

⚠️ Limitations

  • Rate Limiting: HF Inference API has rate limits; some models may timeout
  • Task Coverage: 15 tasks is a sample, not comprehensive benchmark
  • Single Run: Results from one evaluation run (no statistical averaging)

πŸŽ“ What This Project Demonstrates

  • LLM Evaluation Design β€” Creating meaningful benchmarks
  • API Integration β€” Working with HuggingFace Inference API
  • Data Visualization β€” Building interactive dashboards
  • Scoring Systems β€” Implementing automated evaluation metrics

πŸ‘€ Author

Nav772 β€” Built as part of an AI/ML Engineering portfolio.

πŸ“„ License

MIT License