blog / posts /evaluatingllm.html
wop's picture
Create posts/evaluatingllm.html
9d01506 verified
Raw
History Blame Contribute Delete
5.71 kB
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Evaluating LLMs on Custom Benchmarks</title>
<!-- KaTeX -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.css" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.js" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/auto-render.min.js" crossorigin="anonymous"
onload="renderMathInElement(document.body,{delimiters:[{left:'\[ ',right:' \]',display:true},{left:'\( ',right:' \)',display:false}],throwOnError:false});"></script>
<!-- Your stylesheet -->
<link rel="stylesheet" href="../style.css">
</head>
<body><script src="../script.js" defer></script>
<div class="background"></div>
<main class="article">
<a class="back" href="../index.html">&larr; Back</a>
<div class="meta">June 2026 &middot; AI Evaluation</div>
<h1>Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator</h1>
<p>Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.</p>
<h2>The Benchmark</h2>
<p>The dataset contains hundreds of straightforward questions across 15+ categories:</p>
<ul>
<li><strong>Pattern-matching</strong>: Analogies, odd-one-out, sequence completion.</li>
<li><strong>Commonsense</strong>: Simulation, causality, everyday reasoning.</li>
<li><strong>Logic</strong>: Consistency, deduction, formal patterns.</li>
<li><strong>Math</strong>: Arithmetic, reasoning, number patterns.</li>
<li><strong>Language</strong>: Structure, transformation, comprehension.</li>
<li><strong>Knowledge</strong>: Definitions and basic facts.</li>
</ul>
<p>Each entry follows a simple JSON structure with <code>category</code>, <code>difficulty</code>, <code>question</code>, and <code>answer</code>. Many answers are semantic, making naive exact-match evaluation misleading.</p>
<h2>Why Standard Tools Fall Short</h2>
<p>Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.</p>
<p>Key challenges:</p>
<ul>
<li>Semantic equivalence vs. literal text</li>
<li>Verbose model outputs that bury the actual answer</li>
<li>Category-specific correctness criteria</li>
<li>Dataset duplicates (especially in logic)</li>
</ul>
<h2>The Hybrid Evaluation System</h2>
<p>We implemented a <strong>category-aware evaluator router</strong> with mixed 0–1 scoring:</p>
<table>
<thead>
<tr><th>Category Type</th><th>Scoring Method</th><th>Examples</th></tr>
</thead>
<tbody>
<tr><td>Strict Exact</td><td>Normalized string equality</td><td>Math-arithmetic, Logic, Knowledge-basic</td></tr>
<tr><td>Flexible Exact</td><td>Lowercase + punctuation removal</td><td>Pattern-matching, Language-structure</td></tr>
<tr><td>Semantic</td><td>Embedding cosine similarity (threshold \~0.88)</td><td>Commonsense, Language-comprehension</td></tr>
<tr><td>Hybrid</td><td>Combination of above</td><td>Language-transformation</td></tr>
</tbody>
</table>
<p>Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.</p>
<h2>Results on Qwen2.5-1.5B-Instruct</h2>
<p>On this compact instruction-tuned model we observed clear patterns:</p>
<h3>Strengths (75–85%)</h3>
<ul>
<li>Commonsense simulation, causality, and reasoning</li>
<li>Language transformation &amp; comprehension</li>
<li>Knowledge definitions</li>
</ul>
<h3>Weaknesses</h3>
<ul>
<li>Pattern-matching (raw score low due to explanatory outputs)</li>
<li>Symbolic logic (0–33%)</li>
<li>Math pattern continuation</li>
</ul>
<blockquote>
<p>The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.</p>
</blockquote>
<h2>Key Takeaways</h2>
<ol>
<li><strong>Answer extraction is critical</strong> — strip explanations, use last sentence/word heuristics, or category-specific parsers.</li>
<li><strong>One-size-fits-all metrics hide truth</strong>. Per-category scoring reveals real strengths and weaknesses.</li>
<li><strong>Embeddings provide cheap, reliable semantic judgment</strong> without relying on another LLM judge.</li>
<li><strong>Partial credit systems improve signal</strong> — reward conceptual correctness even when formatting is imperfect.</li>
<li>Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.</li>
</ol>
<h2>Better Benchmarks Ahead</h2>
<p>Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.</p>
<p>Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.</p>
<p>Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?</p>
<hr>
<p><em>Built with a hybrid Python framework combining direct inference and category-specific scoring logic.</em></p>
</main>
</body>
</html>