Spaces:

bench-labs
/

blog

Running

App Files Files Community

blog / posts /evaluatingllm.html

wop

Create posts/evaluatingllm.html

9d01506 verified 13 days ago

Raw

History Blame Contribute Delete

5.71 kB

	<!doctype html>
	<html lang="en">
	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Evaluating LLMs on Custom Benchmarks</title>

	<!-- KaTeX -->
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.css" crossorigin="anonymous">
	<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.js" crossorigin="anonymous"></script>
	<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/auto-render.min.js" crossorigin="anonymous"
	onload="renderMathInElement(document.body,{delimiters:[{left:'\[ ',right:' \]',display:true},{left:'\( ',right:' \)',display:false}],throwOnError:false});"></script>

	<!-- Your stylesheet -->
	<link rel="stylesheet" href="../style.css">
	</head>
	<body><script src="../script.js" defer></script>

	<div class="background"></div>

	<main class="article">

	<a class="back" href="../index.html">← Back</a>

	<div class="meta">June 2026 · AI Evaluation</div>

	<h1>Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator</h1>

	<p>Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.</p>

	<h2>The Benchmark</h2>
	<p>The dataset contains hundreds of straightforward questions across 15+ categories:</p>

	<ul>
	<li><strong>Pattern-matching</strong>: Analogies, odd-one-out, sequence completion.</li>
	<li><strong>Commonsense</strong>: Simulation, causality, everyday reasoning.</li>
	<li><strong>Logic</strong>: Consistency, deduction, formal patterns.</li>
	<li><strong>Math</strong>: Arithmetic, reasoning, number patterns.</li>
	<li><strong>Language</strong>: Structure, transformation, comprehension.</li>
	<li><strong>Knowledge</strong>: Definitions and basic facts.</li>
	</ul>

	<p>Each entry follows a simple JSON structure with <code>category</code>, <code>difficulty</code>, <code>question</code>, and <code>answer</code>. Many answers are semantic, making naive exact-match evaluation misleading.</p>

	<h2>Why Standard Tools Fall Short</h2>
	<p>Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.</p>

	<p>Key challenges:</p>
	<ul>
	<li>Semantic equivalence vs. literal text</li>
	<li>Verbose model outputs that bury the actual answer</li>
	<li>Category-specific correctness criteria</li>
	<li>Dataset duplicates (especially in logic)</li>
	</ul>

	<h2>The Hybrid Evaluation System</h2>
	<p>We implemented a <strong>category-aware evaluator router</strong> with mixed 0–1 scoring:</p>

	<table>
	<thead>
	<tr><th>Category Type</th><th>Scoring Method</th><th>Examples</th></tr>
	</thead>
	<tbody>
	<tr><td>Strict Exact</td><td>Normalized string equality</td><td>Math-arithmetic, Logic, Knowledge-basic</td></tr>
	<tr><td>Flexible Exact</td><td>Lowercase + punctuation removal</td><td>Pattern-matching, Language-structure</td></tr>
	<tr><td>Semantic</td><td>Embedding cosine similarity (threshold \~0.88)</td><td>Commonsense, Language-comprehension</td></tr>
	<tr><td>Hybrid</td><td>Combination of above</td><td>Language-transformation</td></tr>
	</tbody>
	</table>

	<p>Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.</p>

	<h2>Results on Qwen2.5-1.5B-Instruct</h2>
	<p>On this compact instruction-tuned model we observed clear patterns:</p>

	<h3>Strengths (75–85%)</h3>
	<ul>
	<li>Commonsense simulation, causality, and reasoning</li>
	<li>Language transformation & comprehension</li>
	<li>Knowledge definitions</li>
	</ul>

	<h3>Weaknesses</h3>
	<ul>
	<li>Pattern-matching (raw score low due to explanatory outputs)</li>
	<li>Symbolic logic (0–33%)</li>
	<li>Math pattern continuation</li>
	</ul>

	<blockquote>
	<p>The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.</p>
	</blockquote>

	<h2>Key Takeaways</h2>
	<ol>
	<li><strong>Answer extraction is critical</strong> — strip explanations, use last sentence/word heuristics, or category-specific parsers.</li>
	<li><strong>One-size-fits-all metrics hide truth</strong>. Per-category scoring reveals real strengths and weaknesses.</li>
	<li><strong>Embeddings provide cheap, reliable semantic judgment</strong> without relying on another LLM judge.</li>
	<li><strong>Partial credit systems improve signal</strong> — reward conceptual correctness even when formatting is imperfect.</li>
	<li>Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.</li>
	</ol>

	<h2>Better Benchmarks Ahead</h2>
	<p>Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.</p>

	<p>Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.</p>

	<p>Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?</p>

	<hr>

	<p><em>Built with a hybrid Python framework combining direct inference and category-specific scoring logic.</em></p>

	</main>

	</body>
	</html>