Open source cost + hallucination evaluation to complement coding agent rankings

#7
by vigneshwar234 - opened

Hi ๐Ÿ‘‹

Coding agent evaluation is fascinating. For teams choosing a coding agent for real development workflows, cost per task and hallucination rate (confabulated APIs, wrong library versions) are just as important as benchmark scores.

I built an open source LLM Evaluation Framework that captures:

โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” essential for high-volume coding tasks
โ†’ ๐Ÿ” Hallucination Rate โ€” catches overconfident wrong API suggestions
โ†’ โšก Latency p95 โ€” critical for interactive coding assistant UX
โ†’ ๐ŸŽฏ Accuracy โ€” 4-strategy scorer for MC and exact-match coding tasks
โ†’ ๐Ÿง  Reasoning Quality โ€” CoT depth (especially important for step-by-step code generation)

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source, free. Happy to discuss integrating with coding agent benchmarks!

Sign up or log in to comment