Batch hallucination + accuracy evaluation tool for RAG system LLMs

#1
by vigneshwar234 - opened

Hi ๐Ÿ‘‹

RAGognizer's real-time token-level hallucination detection is impressive! For teams running batch evaluation across multiple LLMs before choosing a backbone for their RAG system, I built a complementary tool.

LLM Evaluation Framework provides batch evaluation with:

โ†’ ๐Ÿ” Hallucination Rate โ€” batch-scored across all test samples, gives a single 0.0-1.0 rate
โ†’ ๐ŸŽฏ Accuracy โ€” verified against ground truth answers
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” so you can budget your RAG inference
โ†’ โšก Latency p95 โ€” RAG pipelines are latency-sensitive, tail latency matters
โ†’ ๐Ÿง  Reasoning Quality โ€” for RAG models that cite reasoning

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love to discuss combining batch evaluation with real-time token-level detection!

Sign up or log in to comment