Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1"> | |
| <title>Evaluating LLMs on Custom Benchmarks</title> | |
| <!-- KaTeX --> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.css" crossorigin="anonymous"> | |
| <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/katex.min.js" crossorigin="anonymous"></script> | |
| <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.11/dist/contrib/auto-render.min.js" crossorigin="anonymous" | |
| onload="renderMathInElement(document.body,{delimiters:[{left:'\[ ',right:' \]',display:true},{left:'\( ',right:' \)',display:false}],throwOnError:false});"></script> | |
| <!-- Your stylesheet --> | |
| <link rel="stylesheet" href="../style.css"> | |
| </head> | |
| <body><script src="../script.js" defer></script> | |
| <div class="background"></div> | |
| <main class="article"> | |
| <a class="back" href="../index.html">← Back</a> | |
| <div class="meta">June 2026 · AI Evaluation</div> | |
| <h1>Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator</h1> | |
| <p>Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.</p> | |
| <h2>The Benchmark</h2> | |
| <p>The dataset contains hundreds of straightforward questions across 15+ categories:</p> | |
| <ul> | |
| <li><strong>Pattern-matching</strong>: Analogies, odd-one-out, sequence completion.</li> | |
| <li><strong>Commonsense</strong>: Simulation, causality, everyday reasoning.</li> | |
| <li><strong>Logic</strong>: Consistency, deduction, formal patterns.</li> | |
| <li><strong>Math</strong>: Arithmetic, reasoning, number patterns.</li> | |
| <li><strong>Language</strong>: Structure, transformation, comprehension.</li> | |
| <li><strong>Knowledge</strong>: Definitions and basic facts.</li> | |
| </ul> | |
| <p>Each entry follows a simple JSON structure with <code>category</code>, <code>difficulty</code>, <code>question</code>, and <code>answer</code>. Many answers are semantic, making naive exact-match evaluation misleading.</p> | |
| <h2>Why Standard Tools Fall Short</h2> | |
| <p>Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.</p> | |
| <p>Key challenges:</p> | |
| <ul> | |
| <li>Semantic equivalence vs. literal text</li> | |
| <li>Verbose model outputs that bury the actual answer</li> | |
| <li>Category-specific correctness criteria</li> | |
| <li>Dataset duplicates (especially in logic)</li> | |
| </ul> | |
| <h2>The Hybrid Evaluation System</h2> | |
| <p>We implemented a <strong>category-aware evaluator router</strong> with mixed 0–1 scoring:</p> | |
| <table> | |
| <thead> | |
| <tr><th>Category Type</th><th>Scoring Method</th><th>Examples</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><td>Strict Exact</td><td>Normalized string equality</td><td>Math-arithmetic, Logic, Knowledge-basic</td></tr> | |
| <tr><td>Flexible Exact</td><td>Lowercase + punctuation removal</td><td>Pattern-matching, Language-structure</td></tr> | |
| <tr><td>Semantic</td><td>Embedding cosine similarity (threshold \~0.88)</td><td>Commonsense, Language-comprehension</td></tr> | |
| <tr><td>Hybrid</td><td>Combination of above</td><td>Language-transformation</td></tr> | |
| </tbody> | |
| </table> | |
| <p>Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.</p> | |
| <h2>Results on Qwen2.5-1.5B-Instruct</h2> | |
| <p>On this compact instruction-tuned model we observed clear patterns:</p> | |
| <h3>Strengths (75–85%)</h3> | |
| <ul> | |
| <li>Commonsense simulation, causality, and reasoning</li> | |
| <li>Language transformation & comprehension</li> | |
| <li>Knowledge definitions</li> | |
| </ul> | |
| <h3>Weaknesses</h3> | |
| <ul> | |
| <li>Pattern-matching (raw score low due to explanatory outputs)</li> | |
| <li>Symbolic logic (0–33%)</li> | |
| <li>Math pattern continuation</li> | |
| </ul> | |
| <blockquote> | |
| <p>The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.</p> | |
| </blockquote> | |
| <h2>Key Takeaways</h2> | |
| <ol> | |
| <li><strong>Answer extraction is critical</strong> — strip explanations, use last sentence/word heuristics, or category-specific parsers.</li> | |
| <li><strong>One-size-fits-all metrics hide truth</strong>. Per-category scoring reveals real strengths and weaknesses.</li> | |
| <li><strong>Embeddings provide cheap, reliable semantic judgment</strong> without relying on another LLM judge.</li> | |
| <li><strong>Partial credit systems improve signal</strong> — reward conceptual correctness even when formatting is imperfect.</li> | |
| <li>Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.</li> | |
| </ol> | |
| <h2>Better Benchmarks Ahead</h2> | |
| <p>Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.</p> | |
| <p>Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.</p> | |
| <p>Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?</p> | |
| <hr> | |
| <p><em>Built with a hybrid Python framework combining direct inference and category-specific scoring logic.</em></p> | |
| </main> | |
| </body> | |
| </html> |