# Benchmark Evaluation Guide This document is the canonical repository guidance for running HAKARI-Bench evaluations. Do not rely on skill-local benchmark instructions as the source of truth. Skill files may point here, but evaluation commands, variant policy, and coverage checks should be maintained in this document. ## Core Workflow 1. Read `AGENTS.md`, this document, `README.md`, `pyproject.toml`, relevant dataset configs under `config/`, and current CLI help from the installed checkout. 2. Identify the requested models, datasets, result directory, cache policy, and whether existing result JSON should be reused or overwritten. 3. For each model, check model-specific requirements in [`docs/model_specific_benchmarking_notes.md`](model_specific_benchmarking_notes.md) before choosing prompts, attention implementation, dtype, or compatibility fallbacks. 4. Prefer the attention implementation officially recommended by the model author. If no explicit attention implementation is passed, the CLI will warn because long benchmark inference can be much slower for some models. 5. Decide the full embedding-variant plan before starting any large run. 6. Run a small validation command when options are uncertain, then scale to the requested benchmark set. 7. Keep an ignored progress checklist under `tmp/` for long benchmark waves. 8. After benchmarking, rebuild DuckDB/HTML viewer artifacts when the user asks for comparisons, leaderboards, or viewer updates. If results are split across multiple result roots, pass repeated `--results-dir` options in priority order; earlier directories win duplicate model-task JSON conflicts. 9. Audit result coverage before treating a leaderboard as final. ## Target Selection Use `--all` when the requested run should cover every built-in dataset from `config/datasets/`. Existing per-task result JSON files are skipped unless `--overwrite` is set, so `--all` can be used to fill missing benchmark coverage: ```bash uv run hakari-bench evaluate reranker \ --model MODEL_NAME \ --all \ --candidate-ranking bm25 ``` Use `--dataset` or `--collection` only for intentionally narrower runs. `--all` is mutually exclusive with `--dataset`, `--collection`, and `--split`. Common examples: ```bash # Fill missing dense results for every built-in dataset. Existing task JSON is # reused automatically; only missing tasks are evaluated. uv run --group tf4-fa2 hakari-bench evaluate dense \ --model BAAI/bge-m3 \ --all \ --dtype bf16 \ --device cuda:0 ``` ```bash # Fill missing reranker results for every built-in dataset using the dataset BM25 # candidate subset. This is the preferred default for CrossEncoder rerankers. uv run --group tf4-fa2 hakari-bench evaluate reranker \ --model BAAI/bge-reranker-v2-m3 \ --all \ --candidate-ranking bm25 \ --rerank-top-k 100 \ --batch-size 128 \ --dtype bf16 \ --device cuda:0 ``` ```bash # Pin a physical GPU for a single process. Inside the process the visible GPU is # still addressed as cuda:0. CUDA_VISIBLE_DEVICES=1 uv run --group tf4-fa2 hakari-bench evaluate reranker \ --model hotchpotch/japanese-reranker-xsmall-v2 \ --all \ --candidate-ranking bm25 \ --rerank-top-k 100 \ --batch-size 256 \ --dtype bf16 \ --device cuda:0 \ --flash-attn2 ``` ```bash # Equivalent structured target selection for scripts or job manifests. uv run --group tf4-fa2 hakari-bench evaluate reranker \ --params-json '{ "model": {"source": "BAAI/bge-reranker-v2-m3"}, "target": {"all": true}, "runtime": {"dtype": "bf16", "device": "cuda:0", "batch_size": 128}, "reranker": {"candidate_ranking": "bm25", "rerank_top_k": 100} }' ``` Use `--overwrite` only when intentionally correcting or replacing prior results. Without `--overwrite`, `--all` is safe for resuming interrupted runs and filling newly added benchmarks. ## Model Research Checklist For every model: - Check whether it is a Sentence Transformers model with prompt configuration. Prefer the built-in prompt configuration when present. - If no usable Sentence Transformers prompt configuration exists, inspect the Hugging Face model card first, then relevant articles or papers for retrieval prefixes such as query/document/passage instructions. - Use explicit prompt options only when the model requires them: `--query-prompt`, `--document-prompt`, `--query-prompt-name`, `--document-prompt-name`, `--query-encode-task`, or `--document-encode-task`. - Check whether `--trust-remote-code` is required. - Check the model's default maximum sequence length, but do not override it unless the user explicitly asks. - Do not shorten context length to avoid slow execution or memory pressure. Reduce batch size first. - If reproducibility requires a fixed dataset state, use `--dataset-revision REV`. Otherwise verify that output JSON records the resolved Hugging Face dataset SHA. - If reproducibility requires a fixed model state, use `--model-revision REV`. Output JSON records the resolved Hugging Face model SHA as a short revision when it can be resolved. ## Dense Evaluation Use the dense subcommand for ordinary SentenceTransformers-compatible embedding models: ```bash uv run hakari-bench evaluate dense \ --model MODEL_NAME \ --dataset DATASET_NAME \ --dtype bf16 ``` Dense models automatically run normalized `int8` and binary quantized search variants plus top-100 float-rescored variants whenever `--no-default-embedding-variants` is not set. Explicit dense variants no longer disable these defaults. This is the most important coverage rule: > For dense models, specify truncation dimensions with > `--embedding-variant truncate:DIMS` when dimensional comparisons are needed. > The CLI will automatically add standalone truncation, full-dim quantized and > rescored variants, and truncation x quantized/rescore variants for those dims. If a requested truncation dimension matches the encoded base embedding dimension, evaluation emits a warning and skips that no-op truncate variant because it would duplicate the original full-dimension result. Use `--no-default-embedding-variants` only when the run intentionally needs base results without automatic dense quantized/rescore variants. `--retrieval-score-device auto` keeps supported post-encode score/top-k work on the model output device. Use `--retrieval-score-device cpu` or `--retrieval-score-device cuda` only when intentionally forcing that work. ## Dense Variant Plans For plain dense baselines with no dimensional comparisons, omit explicit embedding variants and let the dense defaults run: ```bash uv run hakari-bench evaluate dense \ --model MODEL_NAME \ --dataset DATASET_NAME \ --dtype bf16 ``` For Matryoshka or other dimension comparisons, provide the truncation dimensions: - standalone truncation variants, - standalone quantized search and rescore variants at the original dimension, - truncation x quantized search and truncation x rescore grids. Example: ```bash uv run hakari-bench evaluate dense \ --model MODEL_NAME \ --dataset DATASET_NAME \ --dtype bf16 \ --embedding-variant truncate:256,128,64 ``` This command produces the complete comparison set because standalone dimensions isolate the dimension trade-off, standalone quantized/rescore variants isolate the quantization trade-off at the original dimension, and the automatically expanded grids measure combined trade-offs such as `128dim x int8` and `64dim x binary`, with and without top-100 float rescore. If a user asks only for truncation and explicitly does not want quantization, disable dense defaults and state that quantized/rescore variants are intentionally omitted: ```bash uv run hakari-bench evaluate dense \ --model MODEL_NAME \ --dataset DATASET_NAME \ --no-default-embedding-variants \ --embedding-variant truncate:512,256 ``` The benchmark implementation applies derived embedding variants after a single base encoding pass. Cross variants add transform/scoring work, not additional model encoding. No-op truncation variants whose requested dimension equals the base embedding dimension are skipped with a warning. ## Sparse Evaluation Use `evaluate sparse` for SentenceTransformers `SparseEncoder` models: ```bash uv run hakari-bench evaluate sparse \ --model MODEL_NAME \ --dataset DATASET_NAME ``` Do not add dense quantized embedding variants for sparse/SPLADE-style models. Sparse quantization is intentionally unsupported in the CLI. Sparse runs automatically include post-encode query/document max-active-dims grid variants unless `--no-default-embedding-variants` is set: - query max active dims: `8,16,24,32` - document max active dims: `64,128,256,512` These variants are derived after one full sparse model encode and do not run additional model inference. Additional query-only sparsity limits: ```bash uv run hakari-bench evaluate sparse \ --model MODEL_NAME \ --dataset DATASET_NAME \ --embedding-variant sparse-query-max-active-dims:48 ``` Additional query/document grids: ```bash uv run hakari-bench evaluate sparse \ --model MODEL_NAME \ --dataset DATASET_NAME \ --embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768 ``` The base no-limit result is always included as `evaluation.embedding_evaluations[0]`. Use `--no-default-embedding-variants` when intentionally running only the base no-limit result or only explicitly specified sparse variants. ## Late-Interaction, Reranker, And BM25 Use `evaluate late-interaction` for PyLate ColBERT models. Check model-specific query/document prefixes, sequence lengths, `--trust-remote-code`, and `--late-interaction-attend-to-expansion-tokens` before running. For `jinaai/jina-colbert-v2`, the documented PyLate initialization uses `query_prefix="[QueryMarker]"`, `document_prefix="[DocumentMarker]"`, `attend_to_expansion_tokens=True`, and `trust_remote_code=True`. Use the matching CLI options unless the current model card says otherwise. Use `evaluate reranker` for CrossEncoder-style rerankers. They score only the candidate subset and require a candidate ranking such as BM25. Use `evaluate bm25` for BM25: ```bash uv run hakari-bench evaluate bm25 \ --dataset DATASET_NAME ``` By default, BM25 reads the selected dataset candidate subset. Use local BM25 computation only when explicitly requested with `--bm25-source computed`, or from `build-candidates bm25` when generating candidate subsets. ## Attention And Runtime Choices - Prefer the attention implementation officially recommended by the model author or model card. Use `--attn-implementation sdpa`, `--flash-attn2`, or `--attn-implementation flash_attention_2` explicitly when that is the intended runtime. Unspecified attention falls back to the Transformers/model default and may be substantially slower during long benchmark runs. - Do not assume Flash Attention 2 works with every model or every Transformers major version. - Compare practical options before large runs: - Transformers 4.x + Flash Attention 2. - Transformers 5.x + SDPA. - Use the `tf4-fa2` uv dependency group for the Transformers 4.x + Flash Attention 2 runtime: ```bash uv run --group tf4-fa2 hakari-bench evaluate dense \ --model MODEL_ID \ --dataset DATASET_NAME \ --flash-attn2 ``` - Treat Transformers 5.x + Flash Attention 2 as suspect unless already verified for that model in this environment. - Prefer the fastest verified configuration that preserves correctness and model defaults. - For models that fail with Flash Attention 2, retry with SDPA or no explicit attention implementation before skipping. - If CUDA OOM or repeated runtime errors occur, retry with smaller batch sizes before changing dtype, attention implementation, or sequence length. ## Long Benchmark Waves Use both GPUs when available by assigning separate processes with `CUDA_VISIBLE_DEVICES` or `--device`. Keep concurrent jobs writing to distinct model output directories. For long benchmark waves, keep an ignored checklist under `tmp/` and update it as tasks complete. Respect existing result JSON. Cached results are skipped unless `--overwrite` is provided. Use `--overwrite` only when correcting an intentionally changed run configuration. When all required Hugging Face datasets and models are already available in the local cache, run benchmark commands with `HF_DATASETS_OFFLINE=1` and `HF_HUB_OFFLINE=1`. This prevents the datasets and hub clients from calling the Hugging Face API for metadata checks, which can make repeated local evaluation runs faster and less sensitive to transient hub errors. Do not use these variables for a first run or any run that needs to download missing artifacts. Example: ```bash HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \ uv run hakari-bench evaluate dense \ --model MODEL_NAME \ --dataset DATASET_NAME \ --dtype bf16 ``` ## Result Hygiene Per-task result JSON should preserve enough metadata to explain the run: - dataset revision, - model revision, - prompts and prompt names, - embedding variants and representation metadata, - dtype and attention implementation, - Transformers, Sentence Transformers, and Torch versions, - batch size, - timing, - parameter counts, - model maximum sequence length. Top-100 ranking artifacts are optional because they are much larger than the summary task JSON. Pass `--save-top-rankings` when a run needs per-query ranked corpus ids for metric recomputation, rank-fusion analysis, or candidate audits. When enabled, each evaluated task writes a referenced artifact under `rankings/{split_or_task}.top100.json` containing base retrieval, available embedding variants, BM25/reranker outputs, and candidate-rerank outputs. Rebuild the DuckDB warehouse after evaluation to expose these rows in `retrieval_rankings`. When comparing models, check that prompt and embedding-variant choices are fair and intentional. ## Coverage Audit Before Reporting Before reporting a leaderboard or diagnosing model differences, audit coverage: 1. Confirm every base model has the expected task count for the selected view. 2. Confirm each intended embedding-variant category exists for each model: base, standalone truncation, standalone quantized search, rescore, truncation x quantized search, and truncation x rescore when those comparisons were intended. 3. Compare variant task counts against the model's base task count. Any variant with fewer rows needs investigation before it is used in a ranking. If the missing variant is a truncate dimension equal to the base embedding dimension, it should have been skipped as a no-op. 4. Inspect missing `(benchmark, task_key)` pairs for incomplete variants. 5. Confirm output JSON `config.embedding_variants` contains the intended variants. A dense truncation run should include standalone truncation, full-dim quantized/rescore, and truncation x quantized/rescore variants unless `--no-default-embedding-variants` was used. 6. Rebuild DuckDB/HTML viewer artifacts after adding or correcting benchmark results. Useful DuckDB checks: ```sql SELECT model_name, COALESCE(embedding_variant_name, 'base') AS variant, embedding_dim, quantization, COUNT(*) AS task_rows, COUNT(DISTINCT benchmark || '::' || task_key) AS distinct_tasks FROM task_results GROUP BY ALL ORDER BY model_name, variant; ``` ```sql WITH base_tasks AS ( SELECT DISTINCT model_name, benchmark, task_key FROM task_results WHERE embedding_variant_name IS NULL ), variants AS ( SELECT DISTINCT model_name, embedding_variant_name FROM task_results WHERE embedding_variant_name IS NOT NULL ), variant_tasks AS ( SELECT DISTINCT model_name, embedding_variant_name, benchmark, task_key FROM task_results WHERE embedding_variant_name IS NOT NULL ) SELECT v.model_name, v.embedding_variant_name, COUNT(*) AS missing_tasks FROM variants v JOIN base_tasks b USING (model_name) LEFT JOIN variant_tasks vt ON vt.model_name = v.model_name AND vt.embedding_variant_name = v.embedding_variant_name AND vt.benchmark = b.benchmark AND vt.task_key = b.task_key WHERE vt.task_key IS NULL GROUP BY ALL ORDER BY model_name, embedding_variant_name; ``` Summarize failures plainly. If a model keeps failing after reasonable batch-size and attention fallbacks, mark it skipped with the exact reason.