Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /benchmark_evaluation.md

hotchpotch

Compact leaderboard selection tiles

616eae1 verified 12 days ago

preview code

raw

history blame contribute delete

16.4 kB

	# Benchmark Evaluation Guide

	This document is the canonical repository guidance for running HAKARI-Bench
	evaluations. Do not rely on skill-local benchmark instructions as the source of
	truth. Skill files may point here, but evaluation commands, variant policy, and
	coverage checks should be maintained in this document.

	## Core Workflow

	1. Read `AGENTS.md`, this document, `README.md`, `pyproject.toml`, relevant
	dataset configs under `config/`, and current CLI help from the installed
	checkout.
	2. Identify the requested models, datasets, result directory, cache policy, and
	whether existing result JSON should be reused or overwritten.
	3. For each model, check model-specific requirements in
	[`docs/model_specific_benchmarking_notes.md`](model_specific_benchmarking_notes.md)
	before choosing prompts, attention implementation, dtype, or compatibility
	fallbacks.
	4. Prefer the attention implementation officially recommended by the model
	author. If no explicit attention implementation is passed, the CLI will warn
	because long benchmark inference can be much slower for some models.
	5. Decide the full embedding-variant plan before starting any large run.
	6. Run a small validation command when options are uncertain, then scale to the
	requested benchmark set.
	7. Keep an ignored progress checklist under `tmp/` for long benchmark waves.
	8. After benchmarking, rebuild DuckDB/HTML viewer artifacts when the user asks
	for comparisons, leaderboards, or viewer updates. If results are split
	across multiple result roots, pass repeated `--results-dir` options in
	priority order; earlier directories win duplicate model-task JSON conflicts.
	9. Audit result coverage before treating a leaderboard as final.

	## Target Selection

	Use `--all` when the requested run should cover every built-in dataset from
	`config/datasets/`. Existing per-task result JSON files are skipped unless
	`--overwrite` is set, so `--all` can be used to fill missing benchmark coverage:

	```bash
	uv run hakari-bench evaluate reranker \
	--model MODEL_NAME \
	--all \
	--candidate-ranking bm25
	```

	Use `--dataset` or `--collection` only for intentionally narrower runs. `--all`
	is mutually exclusive with `--dataset`, `--collection`, and `--split`.

	Common examples:

	```bash
	# Fill missing dense results for every built-in dataset. Existing task JSON is
	# reused automatically; only missing tasks are evaluated.
	uv run --group tf4-fa2 hakari-bench evaluate dense \
	--model BAAI/bge-m3 \
	--all \
	--dtype bf16 \
	--device cuda:0
	```

	```bash
	# Fill missing reranker results for every built-in dataset using the dataset BM25
	# candidate subset. This is the preferred default for CrossEncoder rerankers.
	uv run --group tf4-fa2 hakari-bench evaluate reranker \
	--model BAAI/bge-reranker-v2-m3 \
	--all \
	--candidate-ranking bm25 \
	--rerank-top-k 100 \
	--batch-size 128 \
	--dtype bf16 \
	--device cuda:0
	```

	```bash
	# Pin a physical GPU for a single process. Inside the process the visible GPU is
	# still addressed as cuda:0.
	CUDA_VISIBLE_DEVICES=1 uv run --group tf4-fa2 hakari-bench evaluate reranker \
	--model hotchpotch/japanese-reranker-xsmall-v2 \
	--all \
	--candidate-ranking bm25 \
	--rerank-top-k 100 \
	--batch-size 256 \
	--dtype bf16 \
	--device cuda:0 \
	--flash-attn2
	```

	```bash
	# Equivalent structured target selection for scripts or job manifests.
	uv run --group tf4-fa2 hakari-bench evaluate reranker \
	--params-json '{
	"model": {"source": "BAAI/bge-reranker-v2-m3"},
	"target": {"all": true},
	"runtime": {"dtype": "bf16", "device": "cuda:0", "batch_size": 128},
	"reranker": {"candidate_ranking": "bm25", "rerank_top_k": 100}
	}'
	```

	Use `--overwrite` only when intentionally correcting or replacing prior results.
	Without `--overwrite`, `--all` is safe for resuming interrupted runs and filling
	newly added benchmarks.

	## Model Research Checklist

	For every model:

	- Check whether it is a Sentence Transformers model with prompt configuration.
	Prefer the built-in prompt configuration when present.
	- If no usable Sentence Transformers prompt configuration exists, inspect the
	Hugging Face model card first, then relevant articles or papers for retrieval
	prefixes such as query/document/passage instructions.
	- Use explicit prompt options only when the model requires them:
	`--query-prompt`, `--document-prompt`, `--query-prompt-name`,
	`--document-prompt-name`, `--query-encode-task`, or
	`--document-encode-task`.
	- Check whether `--trust-remote-code` is required.
	- Check the model's default maximum sequence length, but do not override it
	unless the user explicitly asks.
	- Do not shorten context length to avoid slow execution or memory pressure.
	Reduce batch size first.
	- If reproducibility requires a fixed dataset state, use
	`--dataset-revision REV`. Otherwise verify that output JSON records the
	resolved Hugging Face dataset SHA.
	- If reproducibility requires a fixed model state, use `--model-revision REV`.
	Output JSON records the resolved Hugging Face model SHA as a short revision
	when it can be resolved.

	## Dense Evaluation

	Use the dense subcommand for ordinary SentenceTransformers-compatible embedding
	models:

	```bash
	uv run hakari-bench evaluate dense \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--dtype bf16
	```

	Dense models automatically run normalized `int8` and binary quantized search
	variants plus top-100 float-rescored variants whenever
	`--no-default-embedding-variants` is not set. Explicit dense variants no longer
	disable these defaults.

	This is the most important coverage rule:

	> For dense models, specify truncation dimensions with
	> `--embedding-variant truncate:DIMS` when dimensional comparisons are needed.
	> The CLI will automatically add standalone truncation, full-dim quantized and
	> rescored variants, and truncation x quantized/rescore variants for those dims.

	If a requested truncation dimension matches the encoded base embedding
	dimension, evaluation emits a warning and skips that no-op truncate variant
	because it would duplicate the original full-dimension result.

	Use `--no-default-embedding-variants` only when the run intentionally needs base
	results without automatic dense quantized/rescore variants.

	`--retrieval-score-device auto` keeps supported post-encode score/top-k work on
	the model output device. Use `--retrieval-score-device cpu` or
	`--retrieval-score-device cuda` only when intentionally forcing that work.

	## Dense Variant Plans

	For plain dense baselines with no dimensional comparisons, omit explicit
	embedding variants and let the dense defaults run:

	```bash
	uv run hakari-bench evaluate dense \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--dtype bf16
	```

	For Matryoshka or other dimension comparisons, provide the truncation dimensions:

	- standalone truncation variants,
	- standalone quantized search and rescore variants at the original dimension,
	- truncation x quantized search and truncation x rescore grids.

	Example:

	```bash
	uv run hakari-bench evaluate dense \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--dtype bf16 \
	--embedding-variant truncate:256,128,64
	```

	This command produces the complete comparison set because standalone dimensions
	isolate the dimension trade-off, standalone quantized/rescore variants isolate
	the quantization trade-off at the original dimension, and the automatically
	expanded grids measure combined trade-offs such as `128dim x int8` and
	`64dim x binary`, with and without top-100 float rescore.

	If a user asks only for truncation and explicitly does not want quantization,
	disable dense defaults and state that quantized/rescore variants are
	intentionally omitted:

	```bash
	uv run hakari-bench evaluate dense \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--no-default-embedding-variants \
	--embedding-variant truncate:512,256
	```

	The benchmark implementation applies derived embedding variants after a single
	base encoding pass. Cross variants add transform/scoring work, not additional
	model encoding.
	No-op truncation variants whose requested dimension equals the base embedding
	dimension are skipped with a warning.

	## Sparse Evaluation

	Use `evaluate sparse` for SentenceTransformers `SparseEncoder` models:

	```bash
	uv run hakari-bench evaluate sparse \
	--model MODEL_NAME \
	--dataset DATASET_NAME
	```

	Do not add dense quantized embedding variants for sparse/SPLADE-style models.
	Sparse quantization is intentionally unsupported in the CLI. Sparse runs
	automatically include post-encode query/document max-active-dims grid variants
	unless `--no-default-embedding-variants` is set:

	- query max active dims: `8,16,24,32`
	- document max active dims: `64,128,256,512`

	These variants are derived after one full sparse model encode and do not run
	additional model inference.

	Additional query-only sparsity limits:

	```bash
	uv run hakari-bench evaluate sparse \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--embedding-variant sparse-query-max-active-dims:48
	```

	Additional query/document grids:

	```bash
	uv run hakari-bench evaluate sparse \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768
	```

	The base no-limit result is always included as
	`evaluation.embedding_evaluations[0]`. Use `--no-default-embedding-variants`
	when intentionally running only the base no-limit result or only explicitly
	specified sparse variants.

	## Late-Interaction, Reranker, And BM25

	Use `evaluate late-interaction` for PyLate ColBERT models. Check
	model-specific query/document prefixes, sequence lengths, `--trust-remote-code`,
	and `--late-interaction-attend-to-expansion-tokens` before running.

	For `jinaai/jina-colbert-v2`, the documented PyLate initialization uses
	`query_prefix="[QueryMarker]"`, `document_prefix="[DocumentMarker]"`,
	`attend_to_expansion_tokens=True`, and `trust_remote_code=True`. Use the
	matching CLI options unless the current model card says otherwise.

	Use `evaluate reranker` for CrossEncoder-style rerankers. They score only the
	candidate subset and require a candidate ranking such as BM25.

	Use `evaluate bm25` for BM25:

	```bash
	uv run hakari-bench evaluate bm25 \
	--dataset DATASET_NAME
	```

	By default, BM25 reads the selected dataset candidate subset. Use local BM25
	computation only when explicitly requested with `--bm25-source computed`, or
	from `build-candidates bm25` when generating candidate subsets.

	## Attention And Runtime Choices

	- Prefer the attention implementation officially recommended by the model author
	or model card. Use `--attn-implementation sdpa`, `--flash-attn2`, or
	`--attn-implementation flash_attention_2` explicitly when that is the intended
	runtime. Unspecified attention falls back to the Transformers/model default and
	may be substantially slower during long benchmark runs.
	- Do not assume Flash Attention 2 works with every model or every Transformers
	major version.
	- Compare practical options before large runs:
	- Transformers 4.x + Flash Attention 2.
	- Transformers 5.x + SDPA.
	- Use the `tf4-fa2` uv dependency group for the Transformers 4.x + Flash
	Attention 2 runtime:

	```bash
	uv run --group tf4-fa2 hakari-bench evaluate dense \
	--model MODEL_ID \
	--dataset DATASET_NAME \
	--flash-attn2
	```

	- Treat Transformers 5.x + Flash Attention 2 as suspect unless already verified
	for that model in this environment.
	- Prefer the fastest verified configuration that preserves correctness and
	model defaults.
	- For models that fail with Flash Attention 2, retry with SDPA or no explicit
	attention implementation before skipping.
	- If CUDA OOM or repeated runtime errors occur, retry with smaller batch sizes
	before changing dtype, attention implementation, or sequence length.

	## Long Benchmark Waves

	Use both GPUs when available by assigning separate processes with
	`CUDA_VISIBLE_DEVICES` or `--device`. Keep concurrent jobs writing to distinct
	model output directories. For long benchmark waves, keep an ignored checklist
	under `tmp/` and update it as tasks complete.

	Respect existing result JSON. Cached results are skipped unless `--overwrite`
	is provided. Use `--overwrite` only when correcting an intentionally changed run
	configuration.

	When all required Hugging Face datasets and models are already available in the
	local cache, run benchmark commands with `HF_DATASETS_OFFLINE=1` and
	`HF_HUB_OFFLINE=1`. This prevents the datasets and hub clients from calling the
	Hugging Face API for metadata checks, which can make repeated local evaluation
	runs faster and less sensitive to transient hub errors. Do not use these
	variables for a first run or any run that needs to download missing artifacts.

	Example:

	```bash
	HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
	uv run hakari-bench evaluate dense \
	--model MODEL_NAME \
	--dataset DATASET_NAME \
	--dtype bf16
	```

	## Result Hygiene

	Per-task result JSON should preserve enough metadata to explain the run:

	- dataset revision,
	- model revision,
	- prompts and prompt names,
	- embedding variants and representation metadata,
	- dtype and attention implementation,
	- Transformers, Sentence Transformers, and Torch versions,
	- batch size,
	- timing,
	- parameter counts,
	- model maximum sequence length.

	Top-100 ranking artifacts are optional because they are much larger than the
	summary task JSON. Pass `--save-top-rankings` when a run needs per-query ranked
	corpus ids for metric recomputation, rank-fusion analysis, or candidate audits.
	When enabled, each evaluated task writes a referenced artifact under
	`rankings/{split_or_task}.top100.json` containing base retrieval, available
	embedding variants, BM25/reranker outputs, and candidate-rerank outputs. Rebuild
	the DuckDB warehouse after evaluation to expose these rows in
	`retrieval_rankings`.

	When comparing models, check that prompt and embedding-variant choices are fair
	and intentional.

	## Coverage Audit Before Reporting

	Before reporting a leaderboard or diagnosing model differences, audit coverage:

	1. Confirm every base model has the expected task count for the selected view.
	2. Confirm each intended embedding-variant category exists for each model:
	base, standalone truncation, standalone quantized search, rescore, truncation
	x quantized search, and truncation x rescore when those comparisons were
	intended.
	3. Compare variant task counts against the model's base task count. Any variant
	with fewer rows needs investigation before it is used in a ranking. If the
	missing variant is a truncate dimension equal to the base embedding dimension,
	it should have been skipped as a no-op.
	4. Inspect missing `(benchmark, task_key)` pairs for incomplete variants.
	5. Confirm output JSON `config.embedding_variants` contains the intended
	variants. A dense truncation run should include standalone truncation,
	full-dim quantized/rescore, and truncation x quantized/rescore variants
	unless `--no-default-embedding-variants` was used.
	6. Rebuild DuckDB/HTML viewer artifacts after adding or correcting benchmark
	results.

	Useful DuckDB checks:

	```sql
	SELECT
	model_name,
	COALESCE(embedding_variant_name, 'base') AS variant,
	embedding_dim,
	quantization,
	COUNT(*) AS task_rows,
	COUNT(DISTINCT benchmark \|\| '::' \|\| task_key) AS distinct_tasks
	FROM task_results
	GROUP BY ALL
	ORDER BY model_name, variant;
	```

	```sql
	WITH base_tasks AS (
	SELECT DISTINCT model_name, benchmark, task_key
	FROM task_results
	WHERE embedding_variant_name IS NULL
	),
	variants AS (
	SELECT DISTINCT model_name, embedding_variant_name
	FROM task_results
	WHERE embedding_variant_name IS NOT NULL
	),
	variant_tasks AS (
	SELECT DISTINCT model_name, embedding_variant_name, benchmark, task_key
	FROM task_results
	WHERE embedding_variant_name IS NOT NULL
	)
	SELECT
	v.model_name,
	v.embedding_variant_name,
	COUNT(*) AS missing_tasks
	FROM variants v
	JOIN base_tasks b USING (model_name)
	LEFT JOIN variant_tasks vt
	ON vt.model_name = v.model_name
	AND vt.embedding_variant_name = v.embedding_variant_name
	AND vt.benchmark = b.benchmark
	AND vt.task_key = b.task_key
	WHERE vt.task_key IS NULL
	GROUP BY ALL
	ORDER BY model_name, embedding_variant_name;
	```

	Summarize failures plainly. If a model keeps failing after reasonable batch-size
	and attention fallbacks, mark it skipped with the exact reason.