File size: 7,441 Bytes
ceaf70b
b5a6418
 
 
 
ceaf70b
b5a6418
ceaf70b
 
b5a6418
 
ceaf70b
 
b5a6418
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: LLM Evaluation Dashboard
emoji: πŸ§ͺ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions
---

# πŸ§ͺ LLM Evaluation Dashboard

Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.

## 🎯 What This Does

1. **Benchmark Results** β€” View pre-computed evaluation results across 15 tasks
2. **Interactive Charts** β€” Visualize accuracy and latency comparisons
3. **Live Testing** β€” Test any model with your own custom prompts
4. **Detailed Analysis** β€” Filter and explore results by model and category

## πŸ€– Models Evaluated

| Model | Parameters | Type | Organization |
|-------|------------|------|--------------|
| Mistral-7B-Instruct | 7B | General | Mistral AI |
| Llama-3.2-3B-Instruct | 3B | General | Meta |
| Llama-3.1-70B-Instruct | 70B | General | Meta |
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |

## πŸ“Š Evaluation Categories

### 1. Reasoning (Math & Logic)
Tests mathematical computation and logical deduction abilities.

**Example tasks:**
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"

### 2. Knowledge (Facts)
Tests factual accuracy across science, history, and geography.

**Example tasks:**
- "What is the chemical symbol for gold?"
- "What planet is known as the Red Planet?"

### 3. Instruction Following
Tests ability to follow specific format constraints.

**Example tasks:**
- "Return a JSON object with keys 'name' and 'age'"
- "List exactly 3 colors, one per line"
- "Write a sentence of exactly 5 words"

## πŸ“ˆ Key Findings

| Category | Best Model | Score |
|----------|------------|-------|
| **Overall** | Mistral-7B | 80% |
| **Reasoning** | Qwen2.5-Coder | 80% |
| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
| **Instruction Following** | Qwen2.5-72B | 100% |

### Insights

- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
- **Larger models β‰  better performance** β€” 7B Mistral outperformed 70B+ models

## πŸ”§ Technical Implementation

### Evaluation Pipeline
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  LLM Evaluation Pipeline                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  15 Tasks   β”‚ β†’  β”‚  5 Models   β”‚ β†’  β”‚  75 Total   β”‚     β”‚
β”‚  β”‚  3 Categoriesβ”‚    β”‚  HF API     β”‚    β”‚  Evaluationsβ”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                             β”‚
β”‚                          ↓                                  β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Scoring Functions                       β”‚   β”‚
β”‚  β”‚  β€’ contains / contains_lower (substring match)      β”‚   β”‚
β”‚  β”‚  β€’ json_valid (JSON parsing)                        β”‚   β”‚
β”‚  β”‚  β€’ line_count / word_count (format validation)      β”‚   β”‚
β”‚  β”‚  β€’ starts_with_lower (constraint checking)          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                             β”‚
β”‚                          ↓                                  β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Dashboard Visualization                 β”‚   β”‚
β”‚  β”‚  β€’ Accuracy bar charts                              β”‚   β”‚
β”‚  β”‚  β€’ Category heatmaps                                β”‚   β”‚
β”‚  β”‚  β€’ Latency comparisons                              β”‚   β”‚
β”‚  β”‚  β€’ Filterable results table                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Scoring Methods

| Check Type | Description | Example |
|------------|-------------|---------|
| `contains` | Exact substring match | "4" in "The answer is 4" |
| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
| `word_count` | Correct word count | 5 words for "5-word sentence" |
| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |

### Tech Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **Frontend** | Gradio | Interactive dashboard UI |
| **Visualization** | Plotly | Charts and heatmaps |
| **LLM Access** | HuggingFace Inference API | Free model inference |
| **Data** | Pandas | Results storage and analysis |

## πŸš€ Live Model Comparison

The dashboard includes a **Live Comparison** feature where you can:

1. Enter any custom prompt
2. Select which models to compare
3. See responses side-by-side with latency metrics

## ⚠️ Limitations

- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
- **Single Run:** Results from one evaluation run (no statistical averaging)

## πŸŽ“ What This Project Demonstrates

- **LLM Evaluation Design** β€” Creating meaningful benchmarks
- **API Integration** β€” Working with HuggingFace Inference API
- **Data Visualization** β€” Building interactive dashboards
- **Scoring Systems** β€” Implementing automated evaluation metrics

## πŸ‘€ Author

**[Nav772](https://huggingface.co/Nav772)** β€” Built as part of an AI/ML Engineering portfolio.

## πŸ“„ License

MIT License