Open source cost + hallucination evaluation to complement coding agent rankings
Hi ๐
Coding agent evaluation is fascinating. For teams choosing a coding agent for real development workflows, cost per task and hallucination rate (confabulated APIs, wrong library versions) are just as important as benchmark scores.
I built an open source LLM Evaluation Framework that captures:
โ ๐ฐ Cost per 1K tokens โ essential for high-volume coding tasks
โ ๐ Hallucination Rate โ catches overconfident wrong API suggestions
โ โก Latency p95 โ critical for interactive coding assistant UX
โ ๐ฏ Accuracy โ 4-strategy scorer for MC and exact-match coding tasks
โ ๐ง Reasoning Quality โ CoT depth (especially important for step-by-step code generation)
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Open source, free. Happy to discuss integrating with coding agent benchmarks!