Spaces:

Nav772
/

llm-evaluation-dashboard

Sleeping

File size: 7,441 Bytes

ceaf70b
b5a6418
 
 
 
ceaf70b
b5a6418
ceaf70b
 
b5a6418
 
ceaf70b
 
b5a6418

---
title: LLM Evaluation Dashboard
emoji: 🧪
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Compare LLMs on reasoning, knowledge & instructions
---

# 🧪 LLM Evaluation Dashboard

Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.

## 🎯 What This Does

1. **Benchmark Results** — View pre-computed evaluation results across 15 tasks
2. **Interactive Charts** — Visualize accuracy and latency comparisons
3. **Live Testing** — Test any model with your own custom prompts
4. **Detailed Analysis** — Filter and explore results by model and category

## 🤖 Models Evaluated

| Model | Parameters | Type | Organization |
|-------|------------|------|--------------|
| Mistral-7B-Instruct | 7B | General | Mistral AI |
| Llama-3.2-3B-Instruct | 3B | General | Meta |
| Llama-3.1-70B-Instruct | 70B | General | Meta |
| Qwen2.5-72B-Instruct | 72B | General | Alibaba |
| Qwen2.5-Coder-32B | 32B | Code | Alibaba |

## 📊 Evaluation Categories

### 1. Reasoning (Math & Logic)
Tests mathematical computation and logical deduction abilities.

**Example tasks:**
- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"

### 2. Knowledge (Facts)
Tests factual accuracy across science, history, and geography.

**Example tasks:**
- "What is the chemical symbol for gold?"
- "What planet is known as the Red Planet?"

### 3. Instruction Following
Tests ability to follow specific format constraints.

**Example tasks:**
- "Return a JSON object with keys 'name' and 'age'"
- "List exactly 3 colors, one per line"
- "Write a sentence of exactly 5 words"

## 📈 Key Findings

| Category | Best Model | Score |
|----------|------------|-------|
| **Overall** | Mistral-7B | 80% |
| **Reasoning** | Qwen2.5-Coder | 80% |
| **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% |
| **Instruction Following** | Qwen2.5-72B | 100% |

### Insights

- **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
- **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused
- **Qwen2.5-72B** had perfect instruction following but struggled with reasoning
- **Larger models ≠ better performance** — 7B Mistral outperformed 70B+ models

## 🔧 Technical Implementation

### Evaluation Pipeline
```
┌─────────────────────────────────────────────────────────────┐
│                  LLM Evaluation Pipeline                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │  15 Tasks   │ →  │  5 Models   │ →  │  75 Total   │     │
│  │  3 Categories│    │  HF API     │    │  Evaluations│     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                             │
│                          ↓                                  │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Scoring Functions                       │   │
│  │  • contains / contains_lower (substring match)      │   │
│  │  • json_valid (JSON parsing)                        │   │
│  │  • line_count / word_count (format validation)      │   │
│  │  • starts_with_lower (constraint checking)          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│                          ↓                                  │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Dashboard Visualization                 │   │
│  │  • Accuracy bar charts                              │   │
│  │  • Category heatmaps                                │   │
│  │  • Latency comparisons                              │   │
│  │  • Filterable results table                         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Scoring Methods

| Check Type | Description | Example |
|------------|-------------|---------|
| `contains` | Exact substring match | "4" in "The answer is 4" |
| `contains_lower` | Case-insensitive match | "mars" in "MARS is red" |
| `json_valid` | Valid JSON object | `{"name": "Alice"}` |
| `line_count` | Correct number of lines | 3 lines for "list 3 colors" |
| `word_count` | Correct word count | 5 words for "5-word sentence" |
| `starts_with_lower` | First word starts with letter | "Apple" starts with "a" |

### Tech Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **Frontend** | Gradio | Interactive dashboard UI |
| **Visualization** | Plotly | Charts and heatmaps |
| **LLM Access** | HuggingFace Inference API | Free model inference |
| **Data** | Pandas | Results storage and analysis |

## 🚀 Live Model Comparison

The dashboard includes a **Live Comparison** feature where you can:

1. Enter any custom prompt
2. Select which models to compare
3. See responses side-by-side with latency metrics

## ⚠️ Limitations

- **Rate Limiting:** HF Inference API has rate limits; some models may timeout
- **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark
- **Single Run:** Results from one evaluation run (no statistical averaging)

## 🎓 What This Project Demonstrates

- **LLM Evaluation Design** — Creating meaningful benchmarks
- **API Integration** — Working with HuggingFace Inference API
- **Data Visualization** — Building interactive dashboards
- **Scoring Systems** — Implementing automated evaluation metrics

## 👤 Author

**[Nav772](https://huggingface.co/Nav772)** — Built as part of an AI/ML Engineering portfolio.

## 📄 License

MIT License