| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - llm-evaluation |
| - benchmarking |
| - nlp |
| - evaluation |
| - accuracy |
| - hallucination |
| - reasoning |
| - gpt |
| - claude |
| - gemini |
| - mistral |
| - llama |
| - mmlu |
| - truthfulqa |
| - open-source |
| - python |
| - fastapi |
| - streamlit |
| library_name: llm-evaluation-framework |
| pipeline_tag: text-generation |
| --- |
| |
| # LLM Evaluation Framework |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/python-3.10%2B-22c55e?style=flat-square&logo=python&logoColor=white"/> |
| <img src="https://img.shields.io/badge/License-MIT-eab308?style=flat-square"/> |
| <img src="https://img.shields.io/badge/FastAPI-0.115-14b8a6?style=flat-square&logo=fastapi"/> |
| <img src="https://img.shields.io/badge/Streamlit-1.40-ef4444?style=flat-square&logo=streamlit"/> |
| <img src="https://img.shields.io/badge/LiteLLM-1.52-8b5cf6?style=flat-square"/> |
| <img src="https://img.shields.io/github/stars/vignesh2027/LLM-Evaluation-Framework?style=flat-square&color=eab308"/> |
| </p> |
|
|
| > **Production-grade open-source LLM benchmarking.** |
| > Evaluate GPT-4, Claude, Gemini, Mistral and Llama on 5 metrics โ side by side โ in one command. |
|
|
| ## What This Is |
|
|
| This is the **model card / hub page** for the LLM Evaluation Framework. |
| The framework itself is a Python tool, not a neural network weight โ this page serves as |
| the HuggingFace hub entry point linking all resources together. |
|
|
| | Resource | Link | |
| |---|---| |
| | GitHub | https://github.com/vignesh2027/LLM-Evaluation-Framework | |
| | Live Demo | https://huggingface.co/spaces/vigneshwar234/llm-eval-demo | |
| | Dataset | https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark | |
| | Docs | https://vignesh2027.github.io/LLM-Evaluation-Framework/ | |
|
|
| ## Quick Start |
|
|
| ```bash |
| pip install llm-evaluation-framework |
| export OPENAI_API_KEY="sk-..." |
| llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 100 |
| ``` |
|
|
| **Output:** |
| ``` |
| โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ |
| โ Evaluation: gpt-4o-mini โ |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค |
| โ Accuracy โ 78.00% โ |
| โ Avg Latency โ 432 ms โ |
| โ P95 Latency โ 1240 ms โ |
| โ Total Cost โ $0.0023 โ |
| โ Hallucination โ 2.40% โ |
| โ Reasoning Score โ 7.2 / 10 โ |
| โฐโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโฏ |
| ``` |
|
|
| ## 5 Evaluation Metrics |
|
|
| | Metric | Description | Output | |
| |---|---|---| |
| | **Accuracy** | 4-strategy cascade: exact โ normalized โ MC โ fuzzy | 0.0โ1.0 | |
| | **Latency** | p50, p75, p90, p95, p99 percentiles + SLA violation rate | ms | |
| | **Cost** | Real token counts ร pricing table for 15+ models | $/1K tokens | |
| | **Hallucination Rate** | Linguistic signal analysis (v1), NLI planned (v2) | 0.0โ1.0 | |
| | **Reasoning Quality** | Chain-of-thought depth scoring | 1โ10 | |
|
|
| ## Supported Models |
|
|
| | Provider | Models | |
| |---|---| |
| | OpenAI | GPT-4o, GPT-4o-mini, o1, o1-mini, GPT-3.5-turbo | |
| | Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus | |
| | Google | Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash | |
| | Mistral | Mistral Large, Mistral Small | |
| | Meta | Llama 3 70B, Llama 3 8B (via Together AI) | |
| | Local | Ollama, vLLM, HuggingFace TGI | |
|
|
| ## Sample Benchmark Results (MMLU, 100 samples) |
|
|
| | Model | Accuracy | Latency | Cost/1K | Hallucination | Reasoning | |
| |---|---|---|---|---|---| |
| | GPT-4o | 88.2% | 892ms | $0.0080 | 1.8% | 8.4/10 | |
| | Claude 3.5 Sonnet | 87.6% | 1240ms | $0.0090 | 2.1% | 8.6/10 | |
| | GPT-4o-mini | 78.4% | 432ms | $0.0003 | 3.2% | 7.2/10 | |
| | Gemini 1.5 Flash | 76.8% | 380ms | $0.0001 | 4.1% | 6.8/10 | |
| | Claude 3 Haiku | 74.2% | 410ms | $0.0010 | 4.8% | 6.5/10 | |
|
|
| **Key finding:** GPT-4o-mini achieves 88% of GPT-4o's accuracy at 4% of the cost. |
|
|
| ## Features |
|
|
| - **Async parallel evaluation** โ 10 models at once via `asyncio.Semaphore` |
| - **Streamlit dashboard** โ radar charts, latency histograms, cost vs quality scatter |
| - **FastAPI REST API** โ 12 endpoints with OpenAPI docs |
| - **CLI tool** โ 7 subcommands with rich terminal output |
| - **PDF report generator** โ professional layout via ReportLab |
| - **SQLite persistence** โ zero-config, file-based storage |
| - **Docker ready** โ multi-stage build, `docker-compose up` |
| - **40+ tests, 95% coverage** โ pytest, no API keys needed |
|
|
| ## Architecture |
|
|
| ``` |
| CLI / FastAPI / Streamlit / PDF Generator |
| โ |
| Core Evaluator (asyncio) |
| โ |
| โโโโโโโโโโโโผโโโโโโโโโโโฌโโโโโโโโโโโ |
| Metrics Benchmarks Database LiteLLM |
| accuracy MMLU SQLite OpenAI |
| latency TruthfulQA Anthropic |
| cost Custom CSV Google |
| hallucin. Mistral |
| reasoning Together |
| ``` |
|
|
| ## Install |
|
|
| ```bash |
| # pip |
| pip install llm-evaluation-framework |
| |
| # With extras |
| pip install "llm-evaluation-framework[dashboard,reports,dev]" |
| |
| # Docker |
| docker-compose up -d |
| ``` |
|
|
| ## License |
|
|
| MIT โ free for research and commercial use. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{vigneshwar234_llm_eval_2025, |
| author = {Vigneshwar S}, |
| title = {LLM Evaluation Framework}, |
| year = {2025}, |
| url = {https://github.com/vignesh2027/LLM-Evaluation-Framework}, |
| license = {MIT} |
| } |
| ``` |
|
|