Spaces:
Sleeping
Sleeping
File size: 7,441 Bytes
ceaf70b b5a6418 ceaf70b b5a6418 ceaf70b b5a6418 ceaf70b b5a6418 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
title: LLM Evaluation Dashboard
emoji: π§ͺ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions
---
# π§ͺ LLM Evaluation Dashboard
Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.
## π― What This Does
1. **Benchmark Results** β View pre-computed evaluation results across 15 tasks
2. **Interactive Charts** β Visualize accuracy and latency comparisons
3. **Live Testing** β Test any model with your own custom prompts
4. **Detailed Analysis** β Filter and explore results by model and category
## π€ Models Evaluated
| Model | Parameters | Type | Organization |
|-------|------------|------|--------------|
| Mistral-7B-Instruct | 7B | General | Mistral AI |
| Llama-3.2-3B-Instruct | 3B | General | Meta |
| Llama-3.1-70B-Instruct | 70B | General | Meta |
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |
## π Evaluation Categories
### 1. Reasoning (Math & Logic)
Tests mathematical computation and logical deduction abilities.
**Example tasks:**
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"
### 2. Knowledge (Facts)
Tests factual accuracy across science, history, and geography.
**Example tasks:**
- "What is the chemical symbol for gold?"
- "What planet is known as the Red Planet?"
### 3. Instruction Following
Tests ability to follow specific format constraints.
**Example tasks:**
- "Return a JSON object with keys 'name' and 'age'"
- "List exactly 3 colors, one per line"
- "Write a sentence of exactly 5 words"
## π Key Findings
| Category | Best Model | Score |
|----------|------------|-------|
| **Overall** | Mistral-7B | 80% |
| **Reasoning** | Qwen2.5-Coder | 80% |
| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
| **Instruction Following** | Qwen2.5-72B | 100% |
### Insights
- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
- **Larger models β better performance** β 7B Mistral outperformed 70B+ models
## π§ Technical Implementation
### Evaluation Pipeline
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Evaluation Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β 15 Tasks β β β 5 Models β β β 75 Total β β
β β 3 Categoriesβ β HF API β β Evaluationsβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β β β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Scoring Functions β β
β β β’ contains / contains_lower (substring match) β β
β β β’ json_valid (JSON parsing) β β
β β β’ line_count / word_count (format validation) β β
β β β’ starts_with_lower (constraint checking) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β β β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Dashboard Visualization β β
β β β’ Accuracy bar charts β β
β β β’ Category heatmaps β β
β β β’ Latency comparisons β β
β β β’ Filterable results table β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Scoring Methods
| Check Type | Description | Example |
|------------|-------------|---------|
| `contains` | Exact substring match | "4" in "The answer is 4" |
| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
| `word_count` | Correct word count | 5 words for "5-word sentence" |
| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |
### Tech Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Frontend** | Gradio | Interactive dashboard UI |
| **Visualization** | Plotly | Charts and heatmaps |
| **LLM Access** | HuggingFace Inference API | Free model inference |
| **Data** | Pandas | Results storage and analysis |
## π Live Model Comparison
The dashboard includes a **Live Comparison** feature where you can:
1. Enter any custom prompt
2. Select which models to compare
3. See responses side-by-side with latency metrics
## β οΈ Limitations
- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
- **Single Run:** Results from one evaluation run (no statistical averaging)
## π What This Project Demonstrates
- **LLM Evaluation Design** β Creating meaningful benchmarks
- **API Integration** β Working with HuggingFace Inference API
- **Data Visualization** β Building interactive dashboards
- **Scoring Systems** β Implementing automated evaluation metrics
## π€ Author
**[Nav772](https://huggingface.co/Nav772)** β Built as part of an AI/ML Engineering portfolio.
## π License
MIT License
|