Spaces:
Running on Zero
Batch hallucination + accuracy evaluation tool for RAG system LLMs
Hi ๐
RAGognizer's real-time token-level hallucination detection is impressive! For teams running batch evaluation across multiple LLMs before choosing a backbone for their RAG system, I built a complementary tool.
LLM Evaluation Framework provides batch evaluation with:
โ ๐ Hallucination Rate โ batch-scored across all test samples, gives a single 0.0-1.0 rate
โ ๐ฏ Accuracy โ verified against ground truth answers
โ ๐ฐ Cost per 1K tokens โ so you can budget your RAG inference
โ โก Latency p95 โ RAG pipelines are latency-sensitive, tail latency matters
โ ๐ง Reasoning Quality โ for RAG models that cite reasoning
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Would love to discuss combining batch evaluation with real-time token-level detection!