Evaluate LLM generation quality on top of your retrieval โ€” cost + hallucination

#8
by vigneshwar234 - opened

Hi Sentence Transformers team ๐Ÿ‘‹

Quantized retrieval is impressive work on the retrieval side. For the LLM generation layer on top of retrieved passages, I built an evaluation framework.

LLM Evaluation Framework evaluates the generation side of RAG:

โ†’ ๐Ÿ” Hallucination Rate โ€” does the LLM stay grounded to retrieved content or hallucinate?
โ†’ ๐ŸŽฏ Accuracy โ€” answer quality against ground truth
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” the other side of RAG optimization alongside quantized retrieval
โ†’ โšก Latency p95 โ€” generation latency on top of retrieval latency
โ†’ ๐Ÿง  Reasoning Quality โ€” for models that cite their retrieved reasoning

Quantized retrieval (fast, cheap) + cost-optimized generation (this tool) = efficient full RAG stack.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Sign up or log in to comment