Complementary text LLM evaluation: accuracy + cost + hallucination

#136
by vigneshwar234 - opened

Hi TIGER-Lab team ๐Ÿ‘‹

MMEB's multimodal embedding evaluation is impressive. For the text-side evaluation of models you're benchmarking, I built a complementary framework.

LLM Evaluation Framework covers text LLM evaluation with 5 simultaneous metrics:

โ†’ ๐ŸŽฏ Accuracy โ€” 4-strategy cascade on MMLU and TruthfulQA
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” especially relevant for embedding models at scale
โ†’ โšก Latency p50/p95/p99
โ†’ ๐Ÿ” Hallucination Rate โ€” runs locally
โ†’ ๐Ÿง  Reasoning Quality โ€” CoT depth

Live demo (no API key): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source, free forever. Feedback from the TIGER-Lab community welcome!

Sign up or log in to comment