--- title: LLM Evaluation Dashboard emoji: πŸ§ͺ colorFrom: blue colorTo: green sdk: gradio sdk_version: 5.12.0 app_file: app.py pinned: false license: mit short_description: Compare LLMs on reasoning, knowledge & instructions --- # πŸ§ͺ LLM Evaluation Dashboard Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API. ## 🎯 What This Does 1. **Benchmark Results** β€” View pre-computed evaluation results across 15 tasks 2. **Interactive Charts** β€” Visualize accuracy and latency comparisons 3. **Live Testing** β€” Test any model with your own custom prompts 4. **Detailed Analysis** β€” Filter and explore results by model and category ## πŸ€– Models Evaluated | Model | Parameters | Type | Organization | |-------|------------|------|--------------| | Mistral-7B-Instruct | 7B | General | Mistral AI | | Llama-3.2-3B-Instruct | 3B | General | Meta | | Llama-3.1-70B-Instruct | 70B | General | Meta | | Qwen2.5-72B-Instruct | 72B | General | Alibaba | | Qwen2.5-Coder-32B | 32B | Code | Alibaba | ## πŸ“Š Evaluation Categories ### 1. Reasoning (Math & Logic) Tests mathematical computation and logical deduction abilities. **Example tasks:** - "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?" - "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?" ### 2. Knowledge (Facts) Tests factual accuracy across science, history, and geography. **Example tasks:** - "What is the chemical symbol for gold?" - "What planet is known as the Red Planet?" ### 3. Instruction Following Tests ability to follow specific format constraints. **Example tasks:** - "Return a JSON object with keys 'name' and 'age'" - "List exactly 3 colors, one per line" - "Write a sentence of exactly 5 words" ## πŸ“ˆ Key Findings | Category | Best Model | Score | |----------|------------|-------| | **Overall** | Mistral-7B | 80% | | **Reasoning** | Qwen2.5-Coder | 80% | | **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% | | **Instruction Following** | Qwen2.5-72B | 100% | ### Insights - **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg) - **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused - **Qwen2.5-72B** had perfect instruction following but struggled with reasoning - **Larger models β‰  better performance** β€” 7B Mistral outperformed 70B+ models ## πŸ”§ Technical Implementation ### Evaluation Pipeline ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ LLM Evaluation Pipeline β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ 15 Tasks β”‚ β†’ β”‚ 5 Models β”‚ β†’ β”‚ 75 Total β”‚ β”‚ β”‚ β”‚ 3 Categoriesβ”‚ β”‚ HF API β”‚ β”‚ Evaluationsβ”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ ↓ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Scoring Functions β”‚ β”‚ β”‚ β”‚ β€’ contains / contains_lower (substring match) β”‚ β”‚ β”‚ β”‚ β€’ json_valid (JSON parsing) β”‚ β”‚ β”‚ β”‚ β€’ line_count / word_count (format validation) β”‚ β”‚ β”‚ β”‚ β€’ starts_with_lower (constraint checking) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ ↓ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Dashboard Visualization β”‚ β”‚ β”‚ β”‚ β€’ Accuracy bar charts β”‚ β”‚ β”‚ β”‚ β€’ Category heatmaps β”‚ β”‚ β”‚ β”‚ β€’ Latency comparisons β”‚ β”‚ β”‚ β”‚ β€’ Filterable results table β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Scoring Methods | Check Type | Description | Example | |------------|-------------|---------| | `contains` | Exact substring match | "4" in "The answer is 4" | | `contains_lower` | Case-insensitive match | "mars" in "MARS is red" | | `json_valid` | Valid JSON object | `{"name": "Alice"}` | | `line_count` | Correct number of lines | 3 lines for "list 3 colors" | | `word_count` | Correct word count | 5 words for "5-word sentence" | | `starts_with_lower` | First word starts with letter | "Apple" starts with "a" | ### Tech Stack | Component | Technology | Purpose | |-----------|------------|---------| | **Frontend** | Gradio | Interactive dashboard UI | | **Visualization** | Plotly | Charts and heatmaps | | **LLM Access** | HuggingFace Inference API | Free model inference | | **Data** | Pandas | Results storage and analysis | ## πŸš€ Live Model Comparison The dashboard includes a **Live Comparison** feature where you can: 1. Enter any custom prompt 2. Select which models to compare 3. See responses side-by-side with latency metrics ## ⚠️ Limitations - **Rate Limiting:** HF Inference API has rate limits; some models may timeout - **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark - **Single Run:** Results from one evaluation run (no statistical averaging) ## πŸŽ“ What This Project Demonstrates - **LLM Evaluation Design** β€” Creating meaningful benchmarks - **API Integration** β€” Working with HuggingFace Inference API - **Data Visualization** β€” Building interactive dashboards - **Scoring Systems** β€” Implementing automated evaluation metrics ## πŸ‘€ Author **[Nav772](https://huggingface.co/Nav772)** β€” Built as part of an AI/ML Engineering portfolio. ## πŸ“„ License MIT License