Spaces:

JetBrains-Research
/

long-code-arena

Running

Multi-metric evaluation: cost + latency + hallucination for long-context code LLMs

by vigneshwar234 - opened 25 days ago

Hi JetBrains Research team 👋

Long Code Arena is doing really valuable work on long-context code evaluation. For teams evaluating long-context code models for IDE integration, latency and cost scale non-linearly with context length.

I built an open source LLM Evaluation Framework that tracks:

→ ⚡ Latency p50/p95/p99 — tail latency gets severe with long contexts, this captures it
→ 💰 Cost per 1K tokens — long-context models can cost 5-10x more per task
→ 🔍 Hallucination Rate — long-context models sometimes hallucinate file paths and function names
→ 🎯 Accuracy — verifiable task completion scoring
→ 🧠 Reasoning Quality — CoT depth for code explanation tasks

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love to discuss how latency/cost measurement applies to long-context benchmarks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment