Multi-metric evaluation: cost + latency + hallucination for long-context code LLMs
Hi JetBrains Research team ๐
Long Code Arena is doing really valuable work on long-context code evaluation. For teams evaluating long-context code models for IDE integration, latency and cost scale non-linearly with context length.
I built an open source LLM Evaluation Framework that tracks:
โ โก Latency p50/p95/p99 โ tail latency gets severe with long contexts, this captures it
โ ๐ฐ Cost per 1K tokens โ long-context models can cost 5-10x more per task
โ ๐ Hallucination Rate โ long-context models sometimes hallucinate file paths and function names
โ ๐ฏ Accuracy โ verifiable task completion scoring
โ ๐ง Reasoning Quality โ CoT depth for code explanation tasks
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Would love to discuss how latency/cost measurement applies to long-context benchmarks!