Nav772's picture
Upload README.md with huggingface_hub
b5a6418 verified
---
title: LLM Evaluation Dashboard
emoji: πŸ§ͺ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions
---
# πŸ§ͺ LLM Evaluation Dashboard
Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
## 🎯 What This Does
1. **Benchmark Results** β€” View pre-computed evaluation results across 15 tasks
2. **Interactive Charts** β€” Visualize accuracy and latency comparisons
3. **Live Testing** β€” Test any model with your own custom prompts
4. **Detailed Analysis** β€” Filter and explore results by model and category
## πŸ€– Models Evaluated
| Model | Parameters | Type | Organization |
|-------|------------|------|--------------|
| Mistral-7B-Instruct | 7B | General | Mistral AI |
| Llama-3.2-3B-Instruct | 3B | General | Meta |
| Llama-3.1-70B-Instruct | 70B | General | Meta |
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |
## πŸ“Š Evaluation Categories
### 1. Reasoning (Math & Logic)
Tests mathematical computation and logical deduction abilities.
**Example tasks:**
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
### 2. Knowledge (Facts)
Tests factual accuracy across science, history, and geography.
**Example tasks:**
- "What is the chemical symbol for gold?"
- "What planet is known as the Red Planet?"
### 3. Instruction Following
Tests ability to follow specific format constraints.
**Example tasks:**
- "Return a JSON object with keys 'name' and 'age'"
- "List exactly 3 colors, one per line"
- "Write a sentence of exactly 5 words"
## πŸ“ˆ Key Findings
| Category | Best Model | Score |
|----------|------------|-------|
| **Overall** | Mistral-7B | 80% |
| **Reasoning** | Qwen2.5-Coder | 80% |
| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
| **Instruction Following** | Qwen2.5-72B | 100% |
### Insights
- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
- **Larger models β‰  better performance** β€” 7B Mistral outperformed 70B+ models
## πŸ”§ Technical Implementation
### Evaluation Pipeline
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Evaluation Pipeline β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 15 Tasks β”‚ β†’ β”‚ 5 Models β”‚ β†’ β”‚ 75 Total β”‚ β”‚
β”‚ β”‚ 3 Categoriesβ”‚ β”‚ HF API β”‚ β”‚ Evaluationsβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ ↓ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Scoring Functions β”‚ β”‚
β”‚ β”‚ β€’ contains / contains_lower (substring match) β”‚ β”‚
β”‚ β”‚ β€’ json_valid (JSON parsing) β”‚ β”‚
β”‚ β”‚ β€’ line_count / word_count (format validation) β”‚ β”‚
β”‚ β”‚ β€’ starts_with_lower (constraint checking) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ ↓ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Dashboard Visualization β”‚ β”‚
β”‚ β”‚ β€’ Accuracy bar charts β”‚ β”‚
β”‚ β”‚ β€’ Category heatmaps β”‚ β”‚
β”‚ β”‚ β€’ Latency comparisons β”‚ β”‚
β”‚ β”‚ β€’ Filterable results table β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Scoring Methods
| Check Type | Description | Example |
|------------|-------------|---------|
| `contains` | Exact substring match | "4" in "The answer is 4" |
| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
| `word_count` | Correct word count | 5 words for "5-word sentence" |
| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |
### Tech Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Frontend** | Gradio | Interactive dashboard UI |
| **Visualization** | Plotly | Charts and heatmaps |
| **LLM Access** | HuggingFace Inference API | Free model inference |
| **Data** | Pandas | Results storage and analysis |
## πŸš€ Live Model Comparison
The dashboard includes a **Live Comparison** feature where you can:
1. Enter any custom prompt
2. Select which models to compare
3. See responses side-by-side with latency metrics
## ⚠️ Limitations
- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
- **Single Run:** Results from one evaluation run (no statistical averaging)
## πŸŽ“ What This Project Demonstrates
- **LLM Evaluation Design** β€” Creating meaningful benchmarks
- **API Integration** β€” Working with HuggingFace Inference API
- **Data Visualization** β€” Building interactive dashboards
- **Scoring Systems** β€” Implementing automated evaluation metrics
## πŸ‘€ Author
**[Nav772](https://huggingface.co/Nav772)** β€” Built as part of an AI/ML Engineering portfolio.
## πŸ“„ License
MIT License