File size: 5,480 Bytes
24b70f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: mit
language:
  - en
tags:
  - llm-evaluation
  - benchmarking
  - nlp
  - evaluation
  - accuracy
  - hallucination
  - reasoning
  - gpt
  - claude
  - gemini
  - mistral
  - llama
  - mmlu
  - truthfulqa
  - open-source
  - python
  - fastapi
  - streamlit
library_name: llm-evaluation-framework
pipeline_tag: text-generation
---

# LLM Evaluation Framework

<p align="center">
  <img src="https://img.shields.io/badge/python-3.10%2B-22c55e?style=flat-square&logo=python&logoColor=white"/>
  <img src="https://img.shields.io/badge/License-MIT-eab308?style=flat-square"/>
  <img src="https://img.shields.io/badge/FastAPI-0.115-14b8a6?style=flat-square&logo=fastapi"/>
  <img src="https://img.shields.io/badge/Streamlit-1.40-ef4444?style=flat-square&logo=streamlit"/>
  <img src="https://img.shields.io/badge/LiteLLM-1.52-8b5cf6?style=flat-square"/>
  <img src="https://img.shields.io/github/stars/vignesh2027/LLM-Evaluation-Framework?style=flat-square&color=eab308"/>
</p>

> **Production-grade open-source LLM benchmarking.**
> Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ€” side by side โ€” in one command.

## What This Is

This is the **model card / hub page** for the LLM Evaluation Framework.
The framework itself is a Python tool, not a neural network weight โ€” this page serves as
the HuggingFace hub entry point linking all resources together.

| Resource | Link |
|---|---|
| GitHub | https://github.com/vignesh2027/LLM-Evaluation-Framework |
| Live Demo | https://huggingface.co/spaces/vigneshwar234/llm-eval-demo |
| Dataset | https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark |
| Docs | https://vignesh2027.github.io/LLM-Evaluation-Framework/ |

## Quick Start

```bash
pip install llm-evaluation-framework
export OPENAI_API_KEY="sk-..."
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100
```

**Output:**
```
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Evaluation: gpt-4o-mini             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Accuracy         โ”‚ 78.00%            โ”‚
โ”‚ Avg Latency      โ”‚ 432 ms            โ”‚
โ”‚ P95 Latency      โ”‚ 1240 ms           โ”‚
โ”‚ Total Cost       โ”‚ $0.0023           โ”‚
โ”‚ Hallucination    โ”‚ 2.40%             โ”‚
โ”‚ Reasoning Score  โ”‚ 7.2 / 10          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
```

## 5 Evaluation Metrics

| Metric | Description | Output |
|---|---|---|
| **Accuracy** | 4-strategy cascade: exact โ†’ normalized โ†’ MC โ†’ fuzzy | 0.0โ€“1.0 |
| **Latency** | p50, p75, p90, p95, p99 percentiles + SLA violation rate | ms |
| **Cost** | Real token counts ร— pricing table for 15+ models | $/1K tokens |
| **Hallucination Rate** | Linguistic signal analysis (v1), NLI planned (v2) | 0.0โ€“1.0 |
| **Reasoning Quality** | Chain-of-thought depth scoring | 1โ€“10 |

## Supported Models

| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus |
| Google | Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash |
| Mistral | Mistral Large, Mistral Small |
| Meta | Llama 3 70B, Llama 3 8B (via Together AI) |
| Local | Ollama, vLLM, HuggingFace TGI |

## Sample Benchmark Results (MMLU, 100 samples)

| Model | Accuracy | Latency | Cost/1K | Hallucination | Reasoning |
|---|---|---|---|---|---|
| GPT-4o | 88.2% | 892ms | $0.0080 | 1.8% | 8.4/10 |
| Claude 3.5 Sonnet | 87.6% | 1240ms | $0.0090 | 2.1% | 8.6/10 |
| GPT-4o-mini | 78.4% | 432ms | $0.0003 | 3.2% | 7.2/10 |
| Gemini 1.5 Flash | 76.8% | 380ms | $0.0001 | 4.1% | 6.8/10 |
| Claude 3 Haiku | 74.2% | 410ms | $0.0010 | 4.8% | 6.5/10 |

**Key finding:** GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost.

## Features

- **Async parallel evaluation** โ€” 10 models at once via `asyncio.Semaphore`
- **Streamlit dashboard** โ€” radar charts, latency histograms, cost vs quality scatter
- **FastAPI REST API** โ€” 12 endpoints with OpenAPI docs
- **CLI tool** โ€” 7 subcommands with rich terminal output
- **PDF report generator** โ€” professional layout via ReportLab
- **SQLite persistence** โ€” zero-config, file-based storage
- **Docker ready** โ€” multi-stage build, `docker-compose up`
- **40+ tests, 95% coverage** โ€” pytest, no API keys needed

## Architecture

```
CLI / FastAPI / Streamlit / PDF Generator
              โ”‚
        Core Evaluator (asyncio)
              โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Metrics  Benchmarks  Database  LiteLLM
accuracy  MMLU        SQLite    OpenAI
latency   TruthfulQA           Anthropic
cost      Custom CSV           Google
hallucin.                      Mistral
reasoning                      Together
```

## Install

```bash
# pip
pip install llm-evaluation-framework

# With extras
pip install "llm-evaluation-framework[dashboard,reports,dev]"

# Docker
docker-compose up -d
```

## License

MIT โ€” free for research and commercial use.

## Citation

```bibtex
@software{vigneshwar234_llm_eval_2025,
  author  = {Vigneshwar S},
  title   = {LLM Evaluation Framework},
  year    = {2025},
  url     = {https://github.com/vignesh2027/LLM-Evaluation-Framework},
  license = {MIT}
}
```