File size: 5,480 Bytes
24b70f7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | ---
license: mit
language:
- en
tags:
- llm-evaluation
- benchmarking
- nlp
- evaluation
- accuracy
- hallucination
- reasoning
- gpt
- claude
- gemini
- mistral
- llama
- mmlu
- truthfulqa
- open-source
- python
- fastapi
- streamlit
library_name: llm-evaluation-framework
pipeline_tag: text-generation
---
# LLM Evaluation Framework
<p align="center">
<img src="https://img.shields.io/badge/python-3.10%2B-22c55e?style=flat-square&logo=python&logoColor=white"/>
<img src="https://img.shields.io/badge/License-MIT-eab308?style=flat-square"/>
<img src="https://img.shields.io/badge/FastAPI-0.115-14b8a6?style=flat-square&logo=fastapi"/>
<img src="https://img.shields.io/badge/Streamlit-1.40-ef4444?style=flat-square&logo=streamlit"/>
<img src="https://img.shields.io/badge/LiteLLM-1.52-8b5cf6?style=flat-square"/>
<img src="https://img.shields.io/github/stars/vignesh2027/LLM-Evaluation-Framework?style=flat-square&color=eab308"/>
</p>
> **Production-grade open-source LLM benchmarking.**
> Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ side by side โ in one command.
## What This Is
This is the **model card / hub page** for the LLM Evaluation Framework.
The framework itself is a Python tool, not a neural network weight โ this page serves as
the HuggingFace hub entry point linking all resources together.
| Resource | Link |
|---|---|
| GitHub | https://github.com/vignesh2027/LLM-Evaluation-Framework |
| Live Demo | https://huggingface.co/spaces/vigneshwar234/llm-eval-demo |
| Dataset | https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark |
| Docs | https://vignesh2027.github.io/LLM-Evaluation-Framework/ |
## Quick Start
```bash
pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100
```
**Output:**
```
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Evaluation: gpt-4o-mini โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค
โ Accuracy โ 78.00% โ
โ Avg Latency โ 432 ms โ
โ P95 Latency โ 1240 ms โ
โ Total Cost โ $0.0023 โ
โ Hallucination โ 2.40% โ
โ Reasoning Score โ 7.2 / 10 โ
โฐโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโฏ
```
## 5 Evaluation Metrics
| Metric | Description | Output |
|---|---|---|
| **Accuracy** | 4-strategy cascade: exact โ normalized โ MC โ fuzzy | 0.0โ1.0 |
| **Latency** | p50, p75, p90, p95, p99 percentiles + SLA violation rate | ms |
| **Cost** | Real token counts ร pricing table for 15+ models | $/1K tokens |
| **Hallucination Rate** | Linguistic signal analysis (v1), NLI planned (v2) | 0.0โ1.0 |
| **Reasoning Quality** | Chain-of-thought depth scoring | 1โ10 |
## Supported Models
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus |
| Google | Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash |
| Mistral | Mistral Large, Mistral Small |
| Meta | Llama 3 70B, Llama 3 8B (via Together AI) |
| Local | Ollama, vLLM, HuggingFace TGI |
## Sample Benchmark Results (MMLU, 100 samples)
| Model | Accuracy | Latency | Cost/1K | Hallucination | Reasoning |
|---|---|---|---|---|---|
| GPT-4o | 88.2% | 892ms | $0.0080 | 1.8% | 8.4/10 |
| Claude 3.5 Sonnet | 87.6% | 1240ms | $0.0090 | 2.1% | 8.6/10 |
| GPT-4o-mini | 78.4% | 432ms | $0.0003 | 3.2% | 7.2/10 |
| Gemini 1.5 Flash | 76.8% | 380ms | $0.0001 | 4.1% | 6.8/10 |
| Claude 3 Haiku | 74.2% | 410ms | $0.0010 | 4.8% | 6.5/10 |
**Key finding:** GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.
## Features
- **Async parallel evaluation** โ 10 models at once via `asyncio.Semaphore`
- **Streamlit dashboard** โ radar charts, latency histograms, cost vs quality scatter
- **FastAPI REST API** โ 12 endpoints with OpenAPI docs
- **CLI tool** โ 7 subcommands with rich terminal output
- **PDF report generator** โ professional layout via ReportLab
- **SQLite persistence** โ zero-config, file-based storage
- **Docker ready** โ multi-stage build, `docker-compose up`
- **40+ tests, 95% coverage** โ pytest, no API keys needed
## Architecture
```
CLI / FastAPI / Streamlit / PDF Generator
โ
Core Evaluator (asyncio)
โ
โโโโโโโโโโโโผโโโโโโโโโโโฌโโโโโโโโโโโ
Metrics Benchmarks Database LiteLLM
accuracy MMLU SQLite OpenAI
latency TruthfulQA Anthropic
cost Custom CSV Google
hallucin. Mistral
reasoning Together
```
## Install
```bash
# pip
pip install llm-evaluation-framework
# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"
# Docker
docker-compose up -d
```
## License
MIT โ free for research and commercial use.
## Citation
```bibtex
@software{vigneshwar234_llm_eval_2025,
author = {Vigneshwar S},
title = {LLM Evaluation Framework},
year = {2025},
url = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
license = {MIT}
}
```
|