Spaces:

Nav772
/

llm-evaluation-dashboard

Sleeping

App Files Files Community

llm-evaluation-dashboard / README.md

Nav772

Upload README.md with huggingface_hub

b5a6418 verified 19 days ago

preview code

raw

history blame contribute delete

7.44 kB

	---
	title: LLM Evaluation Dashboard
	emoji: 🧪
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 5.12.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Compare LLMs on reasoning, knowledge & instructions
	---

	# 🧪 LLM Evaluation Dashboard

	Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API.

	## 🎯 What This Does

	1. Benchmark Results — View pre-computed evaluation results across 15 tasks
	2. Interactive Charts — Visualize accuracy and latency comparisons
	3. Live Testing — Test any model with your own custom prompts
	4. Detailed Analysis — Filter and explore results by model and category

	## 🤖 Models Evaluated

	\| Model \| Parameters \| Type \| Organization \|
	\|-------\|------------\|------\|--------------\|
	\| Mistral-7B-Instruct \| 7B \| General \| Mistral AI \|
	\| Llama-3.2-3B-Instruct \| 3B \| General \| Meta \|
	\| Llama-3.1-70B-Instruct \| 70B \| General \| Meta \|
	\| Qwen2.5-72B-Instruct \| 72B \| General \| Alibaba \|
	\| Qwen2.5-Coder-32B \| 32B \| Code \| Alibaba \|

	## 📊 Evaluation Categories

	### 1. Reasoning (Math & Logic)
	Tests mathematical computation and logical deduction abilities.

	Example tasks:
	- "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?"
	- "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?"

	### 2. Knowledge (Facts)
	Tests factual accuracy across science, history, and geography.

	Example tasks:
	- "What is the chemical symbol for gold?"
	- "What planet is known as the Red Planet?"

	### 3. Instruction Following
	Tests ability to follow specific format constraints.

	Example tasks:
	- "Return a JSON object with keys 'name' and 'age'"
	- "List exactly 3 colors, one per line"
	- "Write a sentence of exactly 5 words"

	## 📈 Key Findings

	\| Category \| Best Model \| Score \|
	\|----------\|------------\|-------\|
	\| Overall \| Mistral-7B \| 80% \|
	\| Reasoning \| Qwen2.5-Coder \| 80% \|
	\| Knowledge \| Mistral-7B, Llama-3.2, Qwen-Coder \| 100% \|
	\| Instruction Following \| Qwen2.5-72B \| 100% \|

	### Insights

	- Mistral-7B achieved the best overall accuracy (80%) with the fastest response time (0.39s avg)
	- Qwen2.5-Coder excelled at reasoning tasks despite being code-focused
	- Qwen2.5-72B had perfect instruction following but struggled with reasoning
	- Larger models ≠ better performance — 7B Mistral outperformed 70B+ models

	## 🔧 Technical Implementation

	### Evaluation Pipeline
	```
	┌─────────────────────────────────────────────────────────────┐
	│ LLM Evaluation Pipeline │
	├─────────────────────────────────────────────────────────────┤
	│ │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
	│ │ 15 Tasks │ → │ 5 Models │ → │ 75 Total │ │
	│ │ 3 Categories│ │ HF API │ │ Evaluations│ │
	│ └─────────────┘ └─────────────┘ └─────────────┘ │
	│ │
	│ ↓ │
	│ │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ Scoring Functions │ │
	│ │ • contains / contains_lower (substring match) │ │
	│ │ • json_valid (JSON parsing) │ │
	│ │ • line_count / word_count (format validation) │ │
	│ │ • starts_with_lower (constraint checking) │ │
	│ └─────────────────────────────────────────────────────┘ │
	│ │
	│ ↓ │
	│ │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ Dashboard Visualization │ │
	│ │ • Accuracy bar charts │ │
	│ │ • Category heatmaps │ │
	│ │ • Latency comparisons │ │
	│ │ • Filterable results table │ │
	│ └─────────────────────────────────────────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────┘
	```

	### Scoring Methods

	\| Check Type \| Description \| Example \|
	\|------------\|-------------\|---------\|
	\| `contains` \| Exact substring match \| "4" in "The answer is 4" \|
	\| `contains_lower` \| Case-insensitive match \| "mars" in "MARS is red" \|
	\| `json_valid` \| Valid JSON object \| `{"name": "Alice"}` \|
	\| `line_count` \| Correct number of lines \| 3 lines for "list 3 colors" \|
	\| `word_count` \| Correct word count \| 5 words for "5-word sentence" \|
	\| `starts_with_lower` \| First word starts with letter \| "Apple" starts with "a" \|

	### Tech Stack

	\| Component \| Technology \| Purpose \|
	\|-----------\|------------\|---------\|
	\| Frontend \| Gradio \| Interactive dashboard UI \|
	\| Visualization \| Plotly \| Charts and heatmaps \|
	\| LLM Access \| HuggingFace Inference API \| Free model inference \|
	\| Data \| Pandas \| Results storage and analysis \|

	## 🚀 Live Model Comparison

	The dashboard includes a Live Comparison feature where you can:

	1. Enter any custom prompt
	2. Select which models to compare
	3. See responses side-by-side with latency metrics

	## ⚠️ Limitations

	- Rate Limiting: HF Inference API has rate limits; some models may timeout
	- Task Coverage: 15 tasks is a sample, not comprehensive benchmark
	- Single Run: Results from one evaluation run (no statistical averaging)

	## 🎓 What This Project Demonstrates

	- LLM Evaluation Design — Creating meaningful benchmarks
	- API Integration — Working with HuggingFace Inference API
	- Data Visualization — Building interactive dashboards
	- Scoring Systems — Implementing automated evaluation metrics

	## 👤 Author

	[Nav772](https://huggingface.co/Nav772) — Built as part of an AI/ML Engineering portfolio.

	## 📄 License

	MIT License