Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Complementary text LLM evaluation: accuracy + cost + hallucination
#136
by vigneshwar234 - opened
Hi TIGER-Lab team ๐
MMEB's multimodal embedding evaluation is impressive. For the text-side evaluation of models you're benchmarking, I built a complementary framework.
LLM Evaluation Framework covers text LLM evaluation with 5 simultaneous metrics:
โ ๐ฏ Accuracy โ 4-strategy cascade on MMLU and TruthfulQA
โ ๐ฐ Cost per 1K tokens โ especially relevant for embedding models at scale
โ โก Latency p50/p95/p99
โ ๐ Hallucination Rate โ runs locally
โ ๐ง Reasoning Quality โ CoT depth
Live demo (no API key): https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Open source, free forever. Feedback from the TIGER-Lab community welcome!