Systematic SLM evaluation: accuracy + cost + hallucination for RAG model selection

#2
by vigneshwar234 - opened

Hi aizip-dev team ๐Ÿ‘‹

SLM RAG Arena is great for head-to-head comparison. For teams who want reproducible, quantitative benchmarks (not just preference-based arena scores), I built a complementary evaluation framework.

LLM Evaluation Framework for SLMs in RAG systems:

โ†’ ๐ŸŽฏ Accuracy โ€” reproducible, not subjective
โ†’ ๐Ÿ” Hallucination Rate โ€” critical for RAG where models must stay grounded to retrieved context
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” SLMs are often chosen for cost reasons, quantify this precisely
โ†’ โšก Latency p95 โ€” RAG pipelines are latency-sensitive
โ†’ ๐Ÿง  Reasoning Quality โ€” for SLMs that explain their retrieval reasoning

Arena preference + quantitative metrics = better SLM selection.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Sign up or log in to comment