Spaces:
Running
Running
| # Benchmark Evaluation Guide | |
| This document is the canonical repository guidance for running HAKARI-Bench | |
| evaluations. Do not rely on skill-local benchmark instructions as the source of | |
| truth. Skill files may point here, but evaluation commands, variant policy, and | |
| coverage checks should be maintained in this document. | |
| ## Core Workflow | |
| 1. Read `AGENTS.md`, this document, `README.md`, `pyproject.toml`, relevant | |
| dataset configs under `config/`, and current CLI help from the installed | |
| checkout. | |
| 2. Identify the requested models, datasets, result directory, cache policy, and | |
| whether existing result JSON should be reused or overwritten. | |
| 3. For each model, check model-specific requirements in | |
| [`docs/model_specific_benchmarking_notes.md`](model_specific_benchmarking_notes.md) | |
| before choosing prompts, attention implementation, dtype, or compatibility | |
| fallbacks. | |
| 4. Prefer the attention implementation officially recommended by the model | |
| author. If no explicit attention implementation is passed, the CLI will warn | |
| because long benchmark inference can be much slower for some models. | |
| 5. Decide the full embedding-variant plan before starting any large run. | |
| 6. Run a small validation command when options are uncertain, then scale to the | |
| requested benchmark set. | |
| 7. Keep an ignored progress checklist under `tmp/` for long benchmark waves. | |
| 8. After benchmarking, rebuild DuckDB/HTML viewer artifacts when the user asks | |
| for comparisons, leaderboards, or viewer updates. If results are split | |
| across multiple result roots, pass repeated `--results-dir` options in | |
| priority order; earlier directories win duplicate model-task JSON conflicts. | |
| 9. Audit result coverage before treating a leaderboard as final. | |
| ## Target Selection | |
| Use `--all` when the requested run should cover every built-in dataset from | |
| `config/datasets/`. Existing per-task result JSON files are skipped unless | |
| `--overwrite` is set, so `--all` can be used to fill missing benchmark coverage: | |
| ```bash | |
| uv run hakari-bench evaluate reranker \ | |
| --model MODEL_NAME \ | |
| --all \ | |
| --candidate-ranking bm25 | |
| ``` | |
| Use `--dataset` or `--collection` only for intentionally narrower runs. `--all` | |
| is mutually exclusive with `--dataset`, `--collection`, and `--split`. | |
| Common examples: | |
| ```bash | |
| # Fill missing dense results for every built-in dataset. Existing task JSON is | |
| # reused automatically; only missing tasks are evaluated. | |
| uv run --group tf4-fa2 hakari-bench evaluate dense \ | |
| --model BAAI/bge-m3 \ | |
| --all \ | |
| --dtype bf16 \ | |
| --device cuda:0 | |
| ``` | |
| ```bash | |
| # Fill missing reranker results for every built-in dataset using the dataset BM25 | |
| # candidate subset. This is the preferred default for CrossEncoder rerankers. | |
| uv run --group tf4-fa2 hakari-bench evaluate reranker \ | |
| --model BAAI/bge-reranker-v2-m3 \ | |
| --all \ | |
| --candidate-ranking bm25 \ | |
| --rerank-top-k 100 \ | |
| --batch-size 128 \ | |
| --dtype bf16 \ | |
| --device cuda:0 | |
| ``` | |
| ```bash | |
| # Pin a physical GPU for a single process. Inside the process the visible GPU is | |
| # still addressed as cuda:0. | |
| CUDA_VISIBLE_DEVICES=1 uv run --group tf4-fa2 hakari-bench evaluate reranker \ | |
| --model hotchpotch/japanese-reranker-xsmall-v2 \ | |
| --all \ | |
| --candidate-ranking bm25 \ | |
| --rerank-top-k 100 \ | |
| --batch-size 256 \ | |
| --dtype bf16 \ | |
| --device cuda:0 \ | |
| --flash-attn2 | |
| ``` | |
| ```bash | |
| # Equivalent structured target selection for scripts or job manifests. | |
| uv run --group tf4-fa2 hakari-bench evaluate reranker \ | |
| --params-json '{ | |
| "model": {"source": "BAAI/bge-reranker-v2-m3"}, | |
| "target": {"all": true}, | |
| "runtime": {"dtype": "bf16", "device": "cuda:0", "batch_size": 128}, | |
| "reranker": {"candidate_ranking": "bm25", "rerank_top_k": 100} | |
| }' | |
| ``` | |
| Use `--overwrite` only when intentionally correcting or replacing prior results. | |
| Without `--overwrite`, `--all` is safe for resuming interrupted runs and filling | |
| newly added benchmarks. | |
| ## Model Research Checklist | |
| For every model: | |
| - Check whether it is a Sentence Transformers model with prompt configuration. | |
| Prefer the built-in prompt configuration when present. | |
| - If no usable Sentence Transformers prompt configuration exists, inspect the | |
| Hugging Face model card first, then relevant articles or papers for retrieval | |
| prefixes such as query/document/passage instructions. | |
| - Use explicit prompt options only when the model requires them: | |
| `--query-prompt`, `--document-prompt`, `--query-prompt-name`, | |
| `--document-prompt-name`, `--query-encode-task`, or | |
| `--document-encode-task`. | |
| - Check whether `--trust-remote-code` is required. | |
| - Check the model's default maximum sequence length, but do not override it | |
| unless the user explicitly asks. | |
| - Do not shorten context length to avoid slow execution or memory pressure. | |
| Reduce batch size first. | |
| - If reproducibility requires a fixed dataset state, use | |
| `--dataset-revision REV`. Otherwise verify that output JSON records the | |
| resolved Hugging Face dataset SHA. | |
| - If reproducibility requires a fixed model state, use `--model-revision REV`. | |
| Output JSON records the resolved Hugging Face model SHA as a short revision | |
| when it can be resolved. | |
| ## Dense Evaluation | |
| Use the dense subcommand for ordinary SentenceTransformers-compatible embedding | |
| models: | |
| ```bash | |
| uv run hakari-bench evaluate dense \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --dtype bf16 | |
| ``` | |
| Dense models automatically run normalized `int8` and binary quantized search | |
| variants plus top-100 float-rescored variants whenever | |
| `--no-default-embedding-variants` is not set. Explicit dense variants no longer | |
| disable these defaults. | |
| This is the most important coverage rule: | |
| > For dense models, specify truncation dimensions with | |
| > `--embedding-variant truncate:DIMS` when dimensional comparisons are needed. | |
| > The CLI will automatically add standalone truncation, full-dim quantized and | |
| > rescored variants, and truncation x quantized/rescore variants for those dims. | |
| If a requested truncation dimension matches the encoded base embedding | |
| dimension, evaluation emits a warning and skips that no-op truncate variant | |
| because it would duplicate the original full-dimension result. | |
| Use `--no-default-embedding-variants` only when the run intentionally needs base | |
| results without automatic dense quantized/rescore variants. | |
| `--retrieval-score-device auto` keeps supported post-encode score/top-k work on | |
| the model output device. Use `--retrieval-score-device cpu` or | |
| `--retrieval-score-device cuda` only when intentionally forcing that work. | |
| ## Dense Variant Plans | |
| For plain dense baselines with no dimensional comparisons, omit explicit | |
| embedding variants and let the dense defaults run: | |
| ```bash | |
| uv run hakari-bench evaluate dense \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --dtype bf16 | |
| ``` | |
| For Matryoshka or other dimension comparisons, provide the truncation dimensions: | |
| - standalone truncation variants, | |
| - standalone quantized search and rescore variants at the original dimension, | |
| - truncation x quantized search and truncation x rescore grids. | |
| Example: | |
| ```bash | |
| uv run hakari-bench evaluate dense \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --dtype bf16 \ | |
| --embedding-variant truncate:256,128,64 | |
| ``` | |
| This command produces the complete comparison set because standalone dimensions | |
| isolate the dimension trade-off, standalone quantized/rescore variants isolate | |
| the quantization trade-off at the original dimension, and the automatically | |
| expanded grids measure combined trade-offs such as `128dim x int8` and | |
| `64dim x binary`, with and without top-100 float rescore. | |
| If a user asks only for truncation and explicitly does not want quantization, | |
| disable dense defaults and state that quantized/rescore variants are | |
| intentionally omitted: | |
| ```bash | |
| uv run hakari-bench evaluate dense \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --no-default-embedding-variants \ | |
| --embedding-variant truncate:512,256 | |
| ``` | |
| The benchmark implementation applies derived embedding variants after a single | |
| base encoding pass. Cross variants add transform/scoring work, not additional | |
| model encoding. | |
| No-op truncation variants whose requested dimension equals the base embedding | |
| dimension are skipped with a warning. | |
| ## Sparse Evaluation | |
| Use `evaluate sparse` for SentenceTransformers `SparseEncoder` models: | |
| ```bash | |
| uv run hakari-bench evaluate sparse \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME | |
| ``` | |
| Do not add dense quantized embedding variants for sparse/SPLADE-style models. | |
| Sparse quantization is intentionally unsupported in the CLI. Sparse runs | |
| automatically include post-encode query/document max-active-dims grid variants | |
| unless `--no-default-embedding-variants` is set: | |
| - query max active dims: `8,16,24,32` | |
| - document max active dims: `64,128,256,512` | |
| These variants are derived after one full sparse model encode and do not run | |
| additional model inference. | |
| Additional query-only sparsity limits: | |
| ```bash | |
| uv run hakari-bench evaluate sparse \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --embedding-variant sparse-query-max-active-dims:48 | |
| ``` | |
| Additional query/document grids: | |
| ```bash | |
| uv run hakari-bench evaluate sparse \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768 | |
| ``` | |
| The base no-limit result is always included as | |
| `evaluation.embedding_evaluations[0]`. Use `--no-default-embedding-variants` | |
| when intentionally running only the base no-limit result or only explicitly | |
| specified sparse variants. | |
| ## Late-Interaction, Reranker, And BM25 | |
| Use `evaluate late-interaction` for PyLate ColBERT models. Check | |
| model-specific query/document prefixes, sequence lengths, `--trust-remote-code`, | |
| and `--late-interaction-attend-to-expansion-tokens` before running. | |
| For `jinaai/jina-colbert-v2`, the documented PyLate initialization uses | |
| `query_prefix="[QueryMarker]"`, `document_prefix="[DocumentMarker]"`, | |
| `attend_to_expansion_tokens=True`, and `trust_remote_code=True`. Use the | |
| matching CLI options unless the current model card says otherwise. | |
| Use `evaluate reranker` for CrossEncoder-style rerankers. They score only the | |
| candidate subset and require a candidate ranking such as BM25. | |
| Use `evaluate bm25` for BM25: | |
| ```bash | |
| uv run hakari-bench evaluate bm25 \ | |
| --dataset DATASET_NAME | |
| ``` | |
| By default, BM25 reads the selected dataset candidate subset. Use local BM25 | |
| computation only when explicitly requested with `--bm25-source computed`, or | |
| from `build-candidates bm25` when generating candidate subsets. | |
| ## Attention And Runtime Choices | |
| - Prefer the attention implementation officially recommended by the model author | |
| or model card. Use `--attn-implementation sdpa`, `--flash-attn2`, or | |
| `--attn-implementation flash_attention_2` explicitly when that is the intended | |
| runtime. Unspecified attention falls back to the Transformers/model default and | |
| may be substantially slower during long benchmark runs. | |
| - Do not assume Flash Attention 2 works with every model or every Transformers | |
| major version. | |
| - Compare practical options before large runs: | |
| - Transformers 4.x + Flash Attention 2. | |
| - Transformers 5.x + SDPA. | |
| - Use the `tf4-fa2` uv dependency group for the Transformers 4.x + Flash | |
| Attention 2 runtime: | |
| ```bash | |
| uv run --group tf4-fa2 hakari-bench evaluate dense \ | |
| --model MODEL_ID \ | |
| --dataset DATASET_NAME \ | |
| --flash-attn2 | |
| ``` | |
| - Treat Transformers 5.x + Flash Attention 2 as suspect unless already verified | |
| for that model in this environment. | |
| - Prefer the fastest verified configuration that preserves correctness and | |
| model defaults. | |
| - For models that fail with Flash Attention 2, retry with SDPA or no explicit | |
| attention implementation before skipping. | |
| - If CUDA OOM or repeated runtime errors occur, retry with smaller batch sizes | |
| before changing dtype, attention implementation, or sequence length. | |
| ## Long Benchmark Waves | |
| Use both GPUs when available by assigning separate processes with | |
| `CUDA_VISIBLE_DEVICES` or `--device`. Keep concurrent jobs writing to distinct | |
| model output directories. For long benchmark waves, keep an ignored checklist | |
| under `tmp/` and update it as tasks complete. | |
| Respect existing result JSON. Cached results are skipped unless `--overwrite` | |
| is provided. Use `--overwrite` only when correcting an intentionally changed run | |
| configuration. | |
| When all required Hugging Face datasets and models are already available in the | |
| local cache, run benchmark commands with `HF_DATASETS_OFFLINE=1` and | |
| `HF_HUB_OFFLINE=1`. This prevents the datasets and hub clients from calling the | |
| Hugging Face API for metadata checks, which can make repeated local evaluation | |
| runs faster and less sensitive to transient hub errors. Do not use these | |
| variables for a first run or any run that needs to download missing artifacts. | |
| Example: | |
| ```bash | |
| HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \ | |
| uv run hakari-bench evaluate dense \ | |
| --model MODEL_NAME \ | |
| --dataset DATASET_NAME \ | |
| --dtype bf16 | |
| ``` | |
| ## Result Hygiene | |
| Per-task result JSON should preserve enough metadata to explain the run: | |
| - dataset revision, | |
| - model revision, | |
| - prompts and prompt names, | |
| - embedding variants and representation metadata, | |
| - dtype and attention implementation, | |
| - Transformers, Sentence Transformers, and Torch versions, | |
| - batch size, | |
| - timing, | |
| - parameter counts, | |
| - model maximum sequence length. | |
| Top-100 ranking artifacts are optional because they are much larger than the | |
| summary task JSON. Pass `--save-top-rankings` when a run needs per-query ranked | |
| corpus ids for metric recomputation, rank-fusion analysis, or candidate audits. | |
| When enabled, each evaluated task writes a referenced artifact under | |
| `rankings/{split_or_task}.top100.json` containing base retrieval, available | |
| embedding variants, BM25/reranker outputs, and candidate-rerank outputs. Rebuild | |
| the DuckDB warehouse after evaluation to expose these rows in | |
| `retrieval_rankings`. | |
| When comparing models, check that prompt and embedding-variant choices are fair | |
| and intentional. | |
| ## Coverage Audit Before Reporting | |
| Before reporting a leaderboard or diagnosing model differences, audit coverage: | |
| 1. Confirm every base model has the expected task count for the selected view. | |
| 2. Confirm each intended embedding-variant category exists for each model: | |
| base, standalone truncation, standalone quantized search, rescore, truncation | |
| x quantized search, and truncation x rescore when those comparisons were | |
| intended. | |
| 3. Compare variant task counts against the model's base task count. Any variant | |
| with fewer rows needs investigation before it is used in a ranking. If the | |
| missing variant is a truncate dimension equal to the base embedding dimension, | |
| it should have been skipped as a no-op. | |
| 4. Inspect missing `(benchmark, task_key)` pairs for incomplete variants. | |
| 5. Confirm output JSON `config.embedding_variants` contains the intended | |
| variants. A dense truncation run should include standalone truncation, | |
| full-dim quantized/rescore, and truncation x quantized/rescore variants | |
| unless `--no-default-embedding-variants` was used. | |
| 6. Rebuild DuckDB/HTML viewer artifacts after adding or correcting benchmark | |
| results. | |
| Useful DuckDB checks: | |
| ```sql | |
| SELECT | |
| model_name, | |
| COALESCE(embedding_variant_name, 'base') AS variant, | |
| embedding_dim, | |
| quantization, | |
| COUNT(*) AS task_rows, | |
| COUNT(DISTINCT benchmark || '::' || task_key) AS distinct_tasks | |
| FROM task_results | |
| GROUP BY ALL | |
| ORDER BY model_name, variant; | |
| ``` | |
| ```sql | |
| WITH base_tasks AS ( | |
| SELECT DISTINCT model_name, benchmark, task_key | |
| FROM task_results | |
| WHERE embedding_variant_name IS NULL | |
| ), | |
| variants AS ( | |
| SELECT DISTINCT model_name, embedding_variant_name | |
| FROM task_results | |
| WHERE embedding_variant_name IS NOT NULL | |
| ), | |
| variant_tasks AS ( | |
| SELECT DISTINCT model_name, embedding_variant_name, benchmark, task_key | |
| FROM task_results | |
| WHERE embedding_variant_name IS NOT NULL | |
| ) | |
| SELECT | |
| v.model_name, | |
| v.embedding_variant_name, | |
| COUNT(*) AS missing_tasks | |
| FROM variants v | |
| JOIN base_tasks b USING (model_name) | |
| LEFT JOIN variant_tasks vt | |
| ON vt.model_name = v.model_name | |
| AND vt.embedding_variant_name = v.embedding_variant_name | |
| AND vt.benchmark = b.benchmark | |
| AND vt.task_key = b.task_key | |
| WHERE vt.task_key IS NULL | |
| GROUP BY ALL | |
| ORDER BY model_name, embedding_variant_name; | |
| ``` | |
| Summarize failures plainly. If a model keeps failing after reasonable batch-size | |
| and attention fallbacks, mark it skipped with the exact reason. | |