Spaces:

hakari-bench
/

leaderboard

Running

App Files Files Community

leaderboard / docs /core_set_selection.md

hotchpotch

Deploy single-column benchmark groups layout

d332c82 verified 1 day ago

preview code

raw

history blame contribute delete

20 kB

	# HAKARI Core Set Selection Rationale

	Research date: 2026-05-21 JST

	Selection update: 2026-05-22 JST

	## Abstract

	The HAKARI Core set is the small, curated leaderboard view intended to answer a
	single question: how should a dense retrieval model be compared when the full
	HAKARI benchmark inventory is too large to interpret at once? The selected Core
	set is:

	1. `MNanoBEIR`
	2. `NanoMMTEB-v2`
	3. `NanoRTEB`
	4. `NanoMLDR`
	5. `NanoBRIGHT`
	6. `NanoCoIR`

	This document records why these six Nano sets were selected and why several
	plausible candidates were left out. The decision was made by combining external
	adoption signals, source benchmark quality, task and language diversity,
	overlap analysis, lexical baseline difficulty, and actual dense-model score
	dispersion from the evaluated DuckDB warehouse.

	The goal was not to maximize task count or to include every important domain.
	It was to keep a compact set whose aggregate score is interpretable, broad, and
	difficult to game by over-weighting one benchmark family. Domain-specific
	benchmarks that are useful but redundant, narrow, saturated, or better read in
	their own view remain available outside Core.

	The Core score also uses configured aggregation units rather than blindly
	averaging every raw task row. In particular, `MNanoBEIR` is aggregated by
	`task_name`: an ArguAna-style task is first averaged across its language
	variants and then contributes as one Core scoring unit. This preserves the
	multilingual BEIR anchor without allowing the raw language x task matrix to
	dominate the Core aggregate.

	## Final Core Set

	\| Position \| Nano set \| Role in Core \| Main reason for inclusion \|
	\| ---: \| --- \| --- \| --- \|
	\| 1 \| `MNanoBEIR` \| Classical multilingual IR anchor \| BEIR-style retrieval remains a common reference point; Core aggregates it by source task name so multilingual coverage does not dominate by raw row count. \|
	\| 2 \| `NanoMMTEB-v2` \| Broad multilingual MTEB/MMTEB anchor \| Represents modern MTEB-style retrieval coverage across many task types and languages. \|
	\| 3 \| `NanoRTEB` \| Practical retrieval domains \| Adds English RTEB-style applied retrieval tasks with strong model separation. \|
	\| 4 \| `NanoMLDR` \| Multilingual long-document retrieval \| Strong external adoption through BGE-M3/MLDR and excellent dense score dispersion across all languages. \|
	\| 5 \| `NanoBRIGHT` \| Reasoning-heavy retrieval stress test \| Hard tasks with high model separation and strong dataset usage signals. \|
	\| 6 \| `NanoCoIR` \| Code retrieval \| Preserves a code-search dimension that is not captured by general, long-document, or reasoning-heavy retrieval tasks. \|

	The most important late-stage changes were the removals of `NanoMIRACL` and
	`NanoLaw`. `NanoMIRACL` is well known, but it was too saturated in the analyzed
	dense results. `NanoLaw` is a useful legal benchmark, but too many of its tasks
	are already represented through selected Core sources, leaving only four
	effective non-duplicate tasks.

	## Pruning Decisions

	The Core set was deliberately pruned. These decisions are as important as the
	selected set because they prevent the Core score from becoming a second copy of
	the `All` view.

	\| Nano set \| Decision \| Reason \|
	\| --- \| --- \| --- \|
	\| `NanoMIRACL` \| Removed from Core after review \| MIRACL remains a canonical multilingual benchmark, but the analyzed dense results showed substantial saturation and low model separation. Its role is better served by the `All` and benchmark-specific views than by the compact Core score. \|
	\| `NanoLaw` \| Removed from Core after review \| Legal retrieval remains important, but many NanoLaw tasks duplicate similar source tasks already covered by `NanoRTEB` or `NanoMMTEB-v2`. After overlap removal, its effective contribution is only four tasks, so it is cleaner as a domain-specific view. \|
	\| `NanoLongEmbed` \| Removed from the earlier Core proposal \| Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than `NanoMLDR`. `NanoMLDR` gives a cleaner multilingual long-document retrieval signal. \|
	\| `NanoBIRCO` \| Not promoted \| `NanoBIRCO` is valuable as a complex-objective stress test, but it is small, English-only, and overlaps in role with the broader `NanoBRIGHT` reasoning-heavy stress slot. \|
	\| `NanoDAPFAM` \| Not promoted \| Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix. \|
	\| `NanoMedical` \| Not promoted \| Useful medical benchmark, but after overlap removal it is less discriminative than `NanoBRIGHT`, `NanoMLDR`, or `NanoRTEB`. \|
	\| `NanoR2MED` \| Not promoted \| Hard medical reasoning stress test with good dispersion, but newer and less established. Better as an optional stress suite. \|
	\| `NanoMuPLeR` \| Not promoted \| Good dense dispersion, but high average scores and a narrow multilingual task shape. Keep as a language/domain appendix rather than Core. \|
	\| `NanoJMTEB-v2` and other language-family NanoMTEB groups \| Not promoted \| Important for language-specific diagnostics, but including them in Core would over-weight MTEB-family language views. \|
	\| `NanoCMTEB` \| Deferred \| Present in configuration, but the analyzed DuckDB did not contain comparable dense base rows for this set. \|

	## Selection Criteria

	The Core set was chosen using five criteria.

	1. External benchmark credibility

	Core tasks should come from benchmarks or datasets that are used outside this
	repository. Signals included paper citations, Hugging Face dataset likes and
	downloads, and whether the source tasks are registered in the official MTEB
	task catalog. External credibility is necessary but not sufficient: a task can
	still be excluded if it is saturated or redundant with a stronger Core source.

	2. Task diversity

	The final set covers classical multilingual IR, broad MTEB/MMTEB retrieval,
	RTEB-style applied retrieval, multilingual long-document retrieval,
	reasoning-heavy retrieval, and code retrieval.

	3. Language diversity

	The set contains broad multilingual groups (`MNanoBEIR`, `NanoMMTEB-v2`,
	`NanoMLDR`) while avoiding a Core made mostly of language-specific
	MTEB-family views.

	4. Low redundancy

	Candidate groups were checked for source-task overlap and for rank
	correlation across evaluated dense models. Some overlap is intentional for
	anchor tasks, but the final set avoids adding multiple groups that primarily
	express the same signal. This criterion is the main reason `NanoLaw` was
	moved out of Core despite its legal-domain value.

	5. Empirical model separation

	A benchmark should usually distinguish current dense models. We therefore
	measured per-task score dispersion across ten evaluated dense embedding
	models, using only base dense rows and excluding embedding variants.

	## Evidence from Evaluated Dense Results

	Dense score evidence came from the evaluated DuckDB warehouse available on
	2026-05-21. The analyzed rows used `embedding_variant_name IS NULL`, so int8,
	binary, truncate, and rescore variants were excluded. Ten dense embedding
	models had complete base results for 514 tasks across 32 benchmark groups.

	For each raw task row, we computed score dispersion across the ten models. The
	table below reports benchmark-level means over those task-level statistics.
	`MNanoBEIR` is shown with its raw task-row count for transparency, even though
	Core scoring groups it by `task_name`.

	Definitions:

	- `avg_mean`: average task mean score across the ten dense models.
	- `avg_std`: average within-task standard deviation across the ten models.
	- `p90-p10`: average within-task 90th percentile minus 10th percentile.
	- `ceiling`: tasks with mean >= 0.90 and std <= 0.05.
	- `floor`: tasks with mean <= 0.25 and std <= 0.05.
	- `low-var`: tasks with std <= 0.03.
	- `healthy`: tasks with 0.25 < mean < 0.85 and std >= 0.05.

	\| Nano set \| Dense analysis task rows \| avg_mean \| avg_std \| p90-p10 \| ceiling \| floor \| low-var \| healthy \|
	\| --- \| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| `MNanoBEIR` \| 182 raw, grouped by `task_name` in Core \| 0.5521 \| 0.0476 \| 0.1042 \| 2 \| 1 \| 16 \| 73 \|
	\| `NanoMMTEB-v2` proxy \| 18 \| 0.5434 \| 0.0572 \| 0.1206 \| 5 \| 2 \| 5 \| 9 \|
	\| `NanoRTEB` \| 14 \| 0.5954 \| 0.0960 \| 0.2203 \| 0 \| 0 \| 0 \| 11 \|
	\| `NanoMLDR` \| 13 \| 0.5399 \| 0.0844 \| 0.1918 \| 0 \| 0 \| 0 \| 13 \|
	\| `NanoBRIGHT` \| 20 \| 0.3289 \| 0.1021 \| 0.2436 \| 0 \| 2 \| 0 \| 14 \|
	\| `NanoCoIR` \| 10 \| 0.7872 \| 0.0938 \| 0.2115 \| 3 \| 0 \| 0 \| 4 \|

	The analyzed result database still stored the current `NanoMMTEB-v2` family
	under the legacy `NanoMMTEB` benchmark label, so the row above is used as the
	best available dense-result proxy for `NanoMMTEB-v2`.

	This table explains several choices:

	- `NanoMLDR` was selected over `NanoLongEmbed` because all 13 `NanoMLDR` tasks
	were healthy and because its external adoption signals are stronger.
	- `NanoBRIGHT` and `NanoRTEB` were retained because they show high model
	separation and few saturation artifacts.
	- `NanoMIRACL` was removed from Core because its recognition as a multilingual
	benchmark did not offset the low dense-model dispersion observed in this
	result warehouse.
	- `NanoLaw` was removed from Core because a large part of the set duplicates
	legal tasks already represented by `NanoRTEB` or `NanoMMTEB-v2`; its
	non-duplicate legal signal remains useful in the domain-specific view.

	## Evidence from Pruned Alternatives

	\| Nano set \| Effective tasks \| avg_mean \| avg_std \| p90-p10 \| ceiling \| floor \| low-var \| healthy \| Interpretation \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| --- \|
	\| `NanoMIRACL` \| 18 \| 0.7880 \| 0.0280 \| 0.0597 \| 1 \| 0 \| 12 \| 1 \| Canonical multilingual benchmark, but too saturated and low-variance for the compact Core score. \|
	\| `NanoLaw` after overlap exclusions \| 4 \| 0.5634 \| 0.0686 \| 0.1516 \| 0 \| 0 \| 0 \| 4 \| Good legal-domain signal, but too much overlap with selected Core sources and too few effective tasks for a Core slot. \|
	\| `NanoLongEmbed` \| 6 \| 0.6265 \| 0.0911 \| 0.2049 \| 0 \| 0 \| 0 \| 3 \| Good dispersion, but weaker external signal and more synthetic long-context overlap than `NanoMLDR`. \|
	\| `NanoBIRCO` \| 5 \| 0.2890 \| 0.0618 \| 0.1182 \| 0 \| 1 \| 1 \| 3 \| Valuable hard benchmark, but smaller and less externally established than `NanoBRIGHT` for the Core stress slot. \|
	\| `NanoDAPFAM` \| 18 \| 0.2870 \| 0.0322 \| 0.0754 \| 0 \| 6 \| 8 \| 0 \| Too low-variance for Core, despite being domain-distinct. \|
	\| `NanoMedical` after overlap exclusions \| 7 \| 0.5323 \| 0.0509 \| 0.1059 \| 0 \| 0 \| 0 \| 4 \| Reasonable optional domain set, but not stronger than selected Core candidates. \|
	\| `NanoR2MED` \| 8 \| 0.2626 \| 0.0944 \| 0.2264 \| 0 \| 2 \| 0 \| 5 \| Hard and discriminative, but newer and less established. \|
	\| `NanoMuPLeR` \| 14 \| 0.8113 \| 0.0765 \| 0.1848 \| 0 \| 0 \| 0 \| 11 \| Strong optional language/domain suite, but high-score and narrower than Core needs. \|
	\| `NanoJMTEB-v2` \| 11 \| 0.8132 \| 0.0430 \| 0.0945 \| 4 \| 0 \| 5 \| 2 \| Important Japanese diagnostic set, but too saturated and language-specific for Core. \|

	## External Adoption and Source Quality

	External signals were collected on 2026-05-21. Citation counts were treated as
	directional rather than exact because Crossref, OpenAlex, Google Scholar, and
	Hugging Face paper pages count different objects and update at different times.
	Newer papers are expected to have fewer citations.

	\| Evidence item \| Observed signal \| Interpretation \|
	\| --- \| --- \| --- \|
	\| MTEB paper \| Crossref 307 citations, OpenAlex 350 citations \| Strong source signal for MTEB-family retrieval tasks and the general evaluation design. \|
	\| MMTEB paper \| OpenAlex 11 citations, Hugging Face paper page with 1,072 citing datasets \| Newer than MTEB, but already visible through dataset usage. \|
	\| BEIR \| Very high external recognition; citation counts vary widely by source \| Supports keeping a BEIR-style multilingual anchor through `MNanoBEIR`. \|
	\| MIRACL \| Crossref 37 citations, OpenAlex 35 citations \| Moderate citation signal, but a canonical multilingual retrieval benchmark. \|
	\| BGE-M3 / MLDR \| Crossref 419 citations, OpenAlex 384 citations, Hugging Face paper page with 444 citing models \| Strong reason to promote `NanoMLDR` into Core. \|
	\| BRIGHT \| OpenAlex 3 citations, but `xlangai/BRIGHT` had 71 HF likes and 17,528 downloads \| New benchmark with strong dataset usage and high empirical discrimination. \|
	\| LegalBench \| OpenAlex 131 citations \| Strong legal benchmark signal, but not enough by itself to justify a Core slot when many NanoLaw tasks overlap with selected Core sources. \|
	\| LegalBench plus other NanoLaw source papers \| Approximately 208 OpenAlex citations across the inspected legal source papers \| `NanoLaw` is externally meaningful, but its cleaner role is a domain-specific legal view rather than an additional Core component. \|
	\| BIRCO \| OpenAlex 1 citation \| Valuable and difficult, but less established than `NanoBRIGHT` for the Core stress slot. \|
	\| CoIR \| ACL 2025-era source with low early citations \| Kept because code retrieval is a distinct capability axis and citations are expected to lag for recent work. \|

	## MTEB Registration Check

	Official MTEB registration was checked against
	`embeddings-benchmark/mteb` main at commit
	`16cc3869619c78499c34bdb59533004899b0f4dc` on 2026-05-21. This matters because
	tasks already present in MTEB are more likely to be understood, reproduced, and
	compared by external users. It is not, by itself, a reason to include a Nano set
	in Core.

	All `NanoLaw` tasks map to MTEB retrieval tasks:

	\| NanoLaw task \| MTEB task name \|
	\| --- \| --- \|
	\| `NanoAILACasedocs` \| `AILACasedocs` \|
	\| `NanoAILAStatutes` \| `AILAStatutes` \|
	\| `NanoGerDaLIRSmall` \| `GerDaLIRSmall` \|
	\| `NanoLeCaRDv2` \| `LeCaRDv2` \|
	\| `NanoLegalBenchConsumerContractsQA` \| `LegalBenchConsumerContractsQA` \|
	\| `NanoLegalBenchCorporateLobbying` \| `LegalBenchCorporateLobbying` \|
	\| `NanoLegalQuAD` \| `LegalQuAD` \|
	\| `NanoLegalSummarization` \| `LegalSummarization` \|

	All `NanoBIRCO` tasks also map to MTEB retrieval tasks:

	\| NanoBIRCO task \| MTEB task name \|
	\| --- \| --- \|
	\| `NanoBIRCOArguAna` \| `BIRCO-ArguAna` \|
	\| `NanoBIRCOClinicalTrial` \| `BIRCO-ClinicalTrial` \|
	\| `NanoBIRCODorisMae` \| `BIRCO-DorisMae` \|
	\| `NanoBIRCORelic` \| `BIRCO-Relic` \|
	\| `NanoBIRCOWTB` \| `BIRCO-WTB` \|

	The MTEB check did not disqualify either `NanoLaw` or `NanoBIRCO`. The deciding
	factor was overlap and Core role clarity. `NanoLaw` contributes useful legal
	coverage but duplicates several tasks already represented by selected Core
	groups. `NanoBIRCO` is mostly non-overlapping, but it is better kept as a
	specialized hard complex-objective group because `NanoBRIGHT` fills the broader
	Core stress-test role with more tasks and stronger dense-model separation.

	## NanoLaw and NanoBIRCO

	`NanoLaw` and `NanoBIRCO` were both considered for Core and then left out for
	different reasons.

	\| Property \| `NanoLaw` \| `NanoBIRCO` \|
	\| --- \| ---: \| ---: \|
	\| Domain \| Legal retrieval \| Complex-objective general IR \|
	\| Languages \| English, German, Chinese \| English \|
	\| Subtasks \| 8 \| 5 \|
	\| Queries \| 1,259 \| 408 \|
	\| Split-local documents \| 15,142 \| 18,789 \|
	\| Positive qrels \| 5,488 \| 2,909 \|
	\| Query-weighted BM25 nDCG@10 \| 0.6275 \| 0.1822 \|
	\| Query-weighted BM25 hit@10 \| 0.8133 \| 0.3750 \|
	\| Effective Core tasks after overlap removal \| 4 \| 5 \|
	\| Dense healthy tasks \| 4 of 4 effective tasks \| 3 of 5 tasks \|

	`NanoLaw` is stronger as a legal-domain benchmark, and its source papers are
	better established. However, four of its eight tasks overlap with `NanoRTEB` or
	`NanoMMTEB-v2`, leaving only four effective non-duplicate tasks for a Core-style
	overall score. That makes the legal signal better suited to a focused
	domain-specific view.

	`NanoBIRCO` is lexically harder and mostly non-overlapping, but it is small and
	its complex-objective role is covered more broadly by `NanoBRIGHT`. Both
	benchmarks remain useful diagnostic views outside Core.

	## Dataset Scale and Lexical Baselines

	The Core set mixes easy, hard, and lexical-overlap-resistant tasks. Some
	selected components have strong BM25 baselines because the source task is
	lexical by nature. Others are deliberately hard for BM25. `NanoLaw` and
	`NanoBIRCO` are shown here as pruned comparison points.

	\| Nano set \| Subtasks \| Queries \| Split-local documents \| Positive qrels \| Query-weighted BM25 nDCG@10 \| Query-weighted BM25 hit@10 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| `NanoMLDR` \| 13 \| 2,089 \| 55,585 \| 2,089 \| 0.7178 \| 0.7946 \|
	\| `NanoBRIGHT` \| 20 \| 2,245 \| 121,771 \| 9,287 \| 0.2156 \| 0.4454 \|
	\| `NanoLaw` \| 8 \| 1,259 \| 15,142 \| 5,488 \| 0.6275 \| 0.8133 \|
	\| `NanoCoIR` \| 10 \| 1,850 \| 76,295 \| 1,850 \| 0.5965 \| 0.6962 \|
	\| `NanoBIRCO` \| 5 \| 408 \| 18,789 \| 2,909 \| 0.1822 \| 0.3750 \|

	These baselines also explain why Core keeps `NanoBRIGHT`: it provides hard
	reasoning-heavy retrieval where BM25 is weak, at a larger scale than
	`NanoBIRCO`. `NanoLaw` remains important, but its legal-domain signal is partly
	covered by overlapping selected Core sources. `NanoCoIR` keeps a code retrieval
	axis whose failure modes are different again.

	## Aggregation and Overlap Policy

	Core normally uses one scoring unit per raw task row, except for explicitly
	configured grouped components. The important exception is `MNanoBEIR`, where
	Core uses `group_by: task_name` so that each BEIR source task contributes once
	after averaging across language variants.

	Some benchmark configurations also define excluded tasks to prevent duplicate
	source tasks from being counted twice in leaderboard calculations. This does
	not mean the underlying dataset lacks those tasks; it means the viewer avoids
	scoring the uploaded duplicate copy. For `NanoLaw`, the following tasks overlap
	with `NanoRTEB` or `NanoMMTEB-v2`:

	- `NanoAILACasedocs`
	- `NanoAILAStatutes`
	- `NanoLegalBenchCorporateLobbying`
	- `NanoLegalSummarization`

	The remaining effective `NanoLaw` contribution is still useful, but small enough
	to keep outside the compact Core score:

	- `NanoGerDaLIRSmall`
	- `NanoLeCaRDv2`
	- `NanoLegalBenchConsumerContractsQA`
	- `NanoLegalQuAD`

	Those four effective tasks were all healthy in the dense dispersion analysis,
	which supports keeping `NanoLaw` as a domain-specific view even though it is no
	longer part of Core.

	## Limitations

	This selection should be revisited when one of the following changes:

	- More dense models are evaluated across all Nano groups.
	- `NanoCMTEB` receives comparable dense base results in the same DuckDB schema.
	- A new domain benchmark achieves both strong external adoption and strong model
	separation.
	- MTEB or MMTEB significantly changes the registered task catalog.
	- Saturation increases on `NanoCoIR` or `NanoMMTEB-v2` enough to reduce their
	usefulness as Core components.

	The Core set is not intended to replace the full `All` view. It is a compact
	summary. Domain and language-specific diagnosis should still use `All`,
	`Group`, and the individual benchmark views.

	## References and Source Pointers

	- HAKARI Core configuration: `config/viewer/overall.yaml`.
	- HAKARI benchmark metadata and task docs: `docs/benchmark_tasks/`.
	- Dense-result warehouse used for the 2026-05-21 analysis:
	`output/results/hakari_bench.duckdb` in the evaluated local worktree.
	- Official MTEB repository checked on 2026-05-21:
	<https://github.com/embeddings-benchmark/mteb>.
	- BEIR paper: <https://arxiv.org/abs/2104.08663>.
	- MTEB paper: <https://aclanthology.org/2023.eacl-main.148/>.
	- MMTEB paper: <https://arxiv.org/abs/2502.13595>.
	- MIRACL paper: <https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595/116724/MIRACL-A-Multilingual-Retrieval-Dataset>.
	- BGE-M3 / MLDR paper: <https://aclanthology.org/2024.findings-acl.137/>.
	- BRIGHT paper: <https://arxiv.org/abs/2407.12883>.
	- LegalBench paper: <https://arxiv.org/abs/2308.11462>.
	- BIRCO paper: <https://arxiv.org/abs/2402.14151>.
	- CoIR paper: <https://aclanthology.org/2025.acl-long.1072/>.