Evaluate LLM generation quality on top of your retrieval โ cost + hallucination
Hi Sentence Transformers team ๐
Quantized retrieval is impressive work on the retrieval side. For the LLM generation layer on top of retrieved passages, I built an evaluation framework.
LLM Evaluation Framework evaluates the generation side of RAG:
โ ๐ Hallucination Rate โ does the LLM stay grounded to retrieved content or hallucinate?
โ ๐ฏ Accuracy โ answer quality against ground truth
โ ๐ฐ Cost per 1K tokens โ the other side of RAG optimization alongside quantized retrieval
โ โก Latency p95 โ generation latency on top of retrieval latency
โ ๐ง Reasoning Quality โ for models that cite their retrieved reasoning
Quantized retrieval (fast, cheap) + cost-optimized generation (this tool) = efficient full RAG stack.
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework