Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
metadata
title: LLM Evaluation Dashboard
emoji: π§ͺ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions
π§ͺ LLM Evaluation Dashboard
Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
π― What This Does
- Benchmark Results β View pre-computed evaluation results across 15 tasks
- Interactive Charts β Visualize accuracy and latency comparisons
- Live Testing β Test any model with your own custom prompts
- Detailed Analysis β Filter and explore results by model and category
π€ Models Evaluated
| Model | Parameters | Type | Organization |
|---|---|---|---|
| Mistral-7B-Instruct | 7B | General | Mistral AI |
| Llama-3.2-3B-Instruct | 3B | General | Meta |
| Llama-3.1-70B-Instruct | 70B | General | Meta |
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |
π Evaluation Categories
1. Reasoning (Math & Logic)
Tests mathematical computation and logical deduction abilities.
Example tasks:
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
2. Knowledge (Facts)
Tests factual accuracy across science, history, and geography.
Example tasks:
- "What is the chemical symbol for gold?"
- "What planet is known as the Red Planet?"
3. Instruction Following
Tests ability to follow specific format constraints.
Example tasks:
- "Return a JSON object with keys 'name' and 'age'"
- "List exactly 3 colors, one per line"
- "Write a sentence of exactly 5 words"
π Key Findings
| Category | Best Model | Score |
|---|---|---|
| Overall | Mistral-7B | 80% |
| Reasoning | Qwen2.5-Coder | 80% |
| Knowledge | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
| Instruction Following | Qwen2.5-72B | 100% |
Insights
- Mistral-7B achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
- Qwen2.5-Coder excelled at reasoning tasks despite being code-focused
- Qwen2.5-72B had perfect instruction following but struggled with reasoning
- Larger models β better performance β 7B Mistral outperformed 70B+ models
π§ Technical Implementation
Evaluation Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Evaluation Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β 15 Tasks β β β 5 Models β β β 75 Total β β
β β 3 Categoriesβ β HF API β β Evaluationsβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β β β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Scoring Functions β β
β β β’ contains / contains_lower (substring match) β β
β β β’ json_valid (JSON parsing) β β
β β β’ line_count / word_count (format validation) β β
β β β’ starts_with_lower (constraint checking) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β β β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Dashboard Visualization β β
β β β’ Accuracy bar charts β β
β β β’ Category heatmaps β β
β β β’ Latency comparisons β β
β β β’ Filterable results table β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Scoring Methods
| Check Type | Description | Example |
|---|---|---|
contains |
Exact substring match | "4" in "The answer is 4" |
contains_lower |
Case-insensitive match | "mars" in "MARS is red" |
json_valid |
Valid JSON object | {"name": "Alice"} |
line_count |
Correct number of lines | 3 lines for "list 3 colors" |
word_count |
Correct word count | 5 words for "5-word sentence" |
starts_with_lower |
First word starts with letter | "Apple" starts with "a" |
Tech Stack
| Component | Technology | Purpose |
|---|---|---|
| Frontend | Gradio | Interactive dashboard UI |
| Visualization | Plotly | Charts and heatmaps |
| LLM Access | HuggingFace Inference API | Free model inference |
| Data | Pandas | Results storage and analysis |
π Live Model Comparison
The dashboard includes a Live Comparison feature where you can:
- Enter any custom prompt
- Select which models to compare
- See responses side-by-side with latency metrics
β οΈ Limitations
- Rate Limiting: HF Inference API has rate limits; some models may timeout
- Task Coverage: 15 tasks is a sample, not comprehensive benchmark
- Single Run: Results from one evaluation run (no statistical averaging)
π What This Project Demonstrates
- LLM Evaluation Design β Creating meaningful benchmarks
- API Integration β Working with HuggingFace Inference API
- Data Visualization β Building interactive dashboards
- Scoring Systems β Implementing automated evaluation metrics
π€ Author
Nav772 β Built as part of an AI/ML Engineering portfolio.
π License
MIT License