Spaces:
Sleeping
Sleeping
| title: LLM Evaluation Dashboard | |
| emoji: π§ͺ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.12.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Compare LLMs on reasoning, knowledge & instructions | |
| # π§ͺ LLM Evaluation Dashboard | |
| Compare the performance of multiple Large Language Models across reasoning, knowledge, and instruction-following tasks using the HuggingFace Inference API. | |
| ## π― What This Does | |
| 1. **Benchmark Results** β View pre-computed evaluation results across 15 tasks | |
| 2. **Interactive Charts** β Visualize accuracy and latency comparisons | |
| 3. **Live Testing** β Test any model with your own custom prompts | |
| 4. **Detailed Analysis** β Filter and explore results by model and category | |
| ## π€ Models Evaluated | |
| | Model | Parameters | Type | Organization | | |
| |-------|------------|------|--------------| | |
| | Mistral-7B-Instruct | 7B | General | Mistral AI | | |
| | Llama-3.2-3B-Instruct | 3B | General | Meta | | |
| | Llama-3.1-70B-Instruct | 70B | General | Meta | | |
| | Qwen2.5-72B-Instruct | 72B | General | Alibaba | | |
| | Qwen2.5-Coder-32B | 32B | Code | Alibaba | | |
| ## π Evaluation Categories | |
| ### 1. Reasoning (Math & Logic) | |
| Tests mathematical computation and logical deduction abilities. | |
| **Example tasks:** | |
| - "A store sells apples for $2 each. If I buy 3 apples and pay with $10, how much change do I get?" | |
| - "If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly?" | |
| ### 2. Knowledge (Facts) | |
| Tests factual accuracy across science, history, and geography. | |
| **Example tasks:** | |
| - "What is the chemical symbol for gold?" | |
| - "What planet is known as the Red Planet?" | |
| ### 3. Instruction Following | |
| Tests ability to follow specific format constraints. | |
| **Example tasks:** | |
| - "Return a JSON object with keys 'name' and 'age'" | |
| - "List exactly 3 colors, one per line" | |
| - "Write a sentence of exactly 5 words" | |
| ## π Key Findings | |
| | Category | Best Model | Score | | |
| |----------|------------|-------| | |
| | **Overall** | Mistral-7B | 80% | | |
| | **Reasoning** | Qwen2.5-Coder | 80% | | |
| | **Knowledge** | Mistral-7B, Llama-3.2, Qwen-Coder | 100% | | |
| | **Instruction Following** | Qwen2.5-72B | 100% | | |
| ### Insights | |
| - **Mistral-7B** achieved the best overall accuracy (80%) with the fastest response time (0.39s avg) | |
| - **Qwen2.5-Coder** excelled at reasoning tasks despite being code-focused | |
| - **Qwen2.5-72B** had perfect instruction following but struggled with reasoning | |
| - **Larger models β better performance** β 7B Mistral outperformed 70B+ models | |
| ## π§ Technical Implementation | |
| ### Evaluation Pipeline | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LLM Evaluation Pipeline β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β βββββββββββββββ βββββββββββββββ βββββββββββββββ β | |
| β β 15 Tasks β β β 5 Models β β β 75 Total β β | |
| β β 3 Categoriesβ β HF API β β Evaluationsβ β | |
| β βββββββββββββββ βββββββββββββββ βββββββββββββββ β | |
| β β | |
| β β β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Scoring Functions β β | |
| β β β’ contains / contains_lower (substring match) β β | |
| β β β’ json_valid (JSON parsing) β β | |
| β β β’ line_count / word_count (format validation) β β | |
| β β β’ starts_with_lower (constraint checking) β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β β β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Dashboard Visualization β β | |
| β β β’ Accuracy bar charts β β | |
| β β β’ Category heatmaps β β | |
| β β β’ Latency comparisons β β | |
| β β β’ Filterable results table β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Scoring Methods | |
| | Check Type | Description | Example | | |
| |------------|-------------|---------| | |
| | `contains` | Exact substring match | "4" in "The answer is 4" | | |
| | `contains_lower` | Case-insensitive match | "mars" in "MARS is red" | | |
| | `json_valid` | Valid JSON object | `{"name": "Alice"}` | | |
| | `line_count` | Correct number of lines | 3 lines for "list 3 colors" | | |
| | `word_count` | Correct word count | 5 words for "5-word sentence" | | |
| | `starts_with_lower` | First word starts with letter | "Apple" starts with "a" | | |
| ### Tech Stack | |
| | Component | Technology | Purpose | | |
| |-----------|------------|---------| | |
| | **Frontend** | Gradio | Interactive dashboard UI | | |
| | **Visualization** | Plotly | Charts and heatmaps | | |
| | **LLM Access** | HuggingFace Inference API | Free model inference | | |
| | **Data** | Pandas | Results storage and analysis | | |
| ## π Live Model Comparison | |
| The dashboard includes a **Live Comparison** feature where you can: | |
| 1. Enter any custom prompt | |
| 2. Select which models to compare | |
| 3. See responses side-by-side with latency metrics | |
| ## β οΈ Limitations | |
| - **Rate Limiting:** HF Inference API has rate limits; some models may timeout | |
| - **Task Coverage:** 15 tasks is a sample, not comprehensive benchmark | |
| - **Single Run:** Results from one evaluation run (no statistical averaging) | |
| ## π What This Project Demonstrates | |
| - **LLM Evaluation Design** β Creating meaningful benchmarks | |
| - **API Integration** β Working with HuggingFace Inference API | |
| - **Data Visualization** β Building interactive dashboards | |
| - **Scoring Systems** β Implementing automated evaluation metrics | |
| ## π€ Author | |
| **[Nav772](https://huggingface.co/Nav772)** β Built as part of an AI/ML Engineering portfolio. | |
| ## π License | |
| MIT License | |