Spaces:
Running on Zero
Systematic SLM evaluation: accuracy + cost + hallucination for RAG model selection
Hi aizip-dev team ๐
SLM RAG Arena is great for head-to-head comparison. For teams who want reproducible, quantitative benchmarks (not just preference-based arena scores), I built a complementary evaluation framework.
LLM Evaluation Framework for SLMs in RAG systems:
โ ๐ฏ Accuracy โ reproducible, not subjective
โ ๐ Hallucination Rate โ critical for RAG where models must stay grounded to retrieved context
โ ๐ฐ Cost per 1K tokens โ SLMs are often chosen for cost reasons, quantify this precisely
โ โก Latency p95 โ RAG pipelines are latency-sensitive
โ ๐ง Reasoning Quality โ for SLMs that explain their retrieval reasoning
Arena preference + quantitative metrics = better SLM selection.
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework