Spaces:

taagarwa
/

coding-agent-leaderboard

Running

Open source cost + hallucination evaluation to complement coding agent rankings

by vigneshwar234 - opened 26 days ago

Hi 👋

Coding agent evaluation is fascinating. For teams choosing a coding agent for real development workflows, cost per task and hallucination rate (confabulated APIs, wrong library versions) are just as important as benchmark scores.

I built an open source LLM Evaluation Framework that captures:

→ 💰 Cost per 1K tokens — essential for high-volume coding tasks
→ 🔍 Hallucination Rate — catches overconfident wrong API suggestions
→ ⚡ Latency p95 — critical for interactive coding assistant UX
→ 🎯 Accuracy — 4-strategy scorer for MC and exact-match coding tasks
→ 🧠 Reasoning Quality — CoT depth (especially important for step-by-step code generation)

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source, free. Happy to discuss integrating with coding agent benchmarks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment