Spaces:

Nav772
/

llm-evaluation-dashboard

Sleeping

App Files Files Community

llm-evaluation-dashboard / README.md

Nav772

Upload README.md with huggingface_hub

b5a6418 verified 18 days ago

preview code

raw

history blame contribute delete

7.44 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: LLM Evaluation Dashboard
emoji: 🧪
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions

🧪 LLM Evaluation Dashboard

Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.

🎯 What This Does

Benchmark Results — View pre-computed evaluation results across 15 tasks
Interactive Charts — Visualize accuracy and latency comparisons
Live Testing — Test any model with your own custom prompts
Detailed Analysis — Filter and explore results by model and category

🤖 Models Evaluated

Model	Parameters	Type	Organization
Mistral-7B-Instruct	7B	General	Mistral AI
Llama-3.2-3B-Instruct	3B	General	Meta
Llama-3.1-70B-Instruct	70B	General	Meta
Qwen2.5-72B-Instruct	72B	General	Alibaba
Qwen2.5-Coder-32B	32B	Code	Alibaba

📊 Evaluation Categories

1. Reasoning (Math & Logic)

Tests mathematical computation and logical deduction abilities.

Example tasks:

"A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
"If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"

2. Knowledge (Facts)

Tests factual accuracy across science, history, and geography.

Example tasks:

"What is the chemical symbol for gold?"
"What planet is known as the Red Planet?"

3. Instruction Following

Tests ability to follow specific format constraints.

Example tasks:

"Return a JSON object with keys 'name' and 'age'"
"List exactly 3 colors, one per line"
"Write a sentence of exactly 5 words"

📈 Key Findings

Category	Best Model	Score
Overall	Mistral-7B	80%
Reasoning	Qwen2.5-Coder	80%
Knowledge	Mistral-7B, Llama-3.2, Qwen-Coder	100%
Instruction Following	Qwen2.5-72B	100%

Insights

Mistral-7B achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
Qwen2.5-Coder excelled at reasoning tasks despite being code-focused
Qwen2.5-72B had perfect instruction following but struggled with reasoning
Larger models ≠ better performance — 7B Mistral outperformed 70B+ models

🔧 Technical Implementation

Evaluation Pipeline

┌─────────────────────────────────────────────────────────────┐
│                  LLM Evaluation Pipeline                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │  15 Tasks   │ →  │  5 Models   │ →  │  75 Total   │     │
│  │  3 Categories│    │  HF API     │    │  Evaluations│     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                             │
│                          ↓                                  │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Scoring Functions                       │   │
│  │  • contains / contains_lower (substring match)      │   │
│  │  • json_valid (JSON parsing)                        │   │
│  │  • line_count / word_count (format validation)      │   │
│  │  • starts_with_lower (constraint checking)          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│                          ↓                                  │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Dashboard Visualization                 │   │
│  │  • Accuracy bar charts                              │   │
│  │  • Category heatmaps                                │   │
│  │  • Latency comparisons                              │   │
│  │  • Filterable results table                         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Scoring Methods

Check Type	Description	Example
`contains`	Exact substring match	"4" in "The answer is 4"
`contains_lower`	Case-insensitive match	"mars" in "MARS is red"
`json_valid`	Valid JSON object	`{"name": "Alice"}`
`line_count`	Correct number of lines	3 lines for "list 3 colors"
`word_count`	Correct word count	5 words for "5-word sentence"
`starts_with_lower`	First word starts with letter	"Apple" starts with "a"

Tech Stack

Component	Technology	Purpose
Frontend	Gradio	Interactive dashboard UI
Visualization	Plotly	Charts and heatmaps
LLM Access	HuggingFace Inference API	Free model inference
Data	Pandas	Results storage and analysis

🚀 Live Model Comparison

The dashboard includes a Live Comparison feature where you can:

Enter any custom prompt
Select which models to compare
See responses side-by-side with latency metrics

⚠️ Limitations

Rate Limiting: HF Inference API has rate limits; some models may timeout
Task Coverage: 15 tasks is a sample, not comprehensive benchmark
Single Run: Results from one evaluation run (no statistical averaging)

🎓 What This Project Demonstrates

LLM Evaluation Design — Creating meaningful benchmarks
API Integration — Working with HuggingFace Inference API
Data Visualization — Building interactive dashboards
Scoring Systems — Implementing automated evaluation metrics

👤 Author

Nav772 — Built as part of an AI/ML Engineering portfolio.

📄 License

MIT License