leaderboard / docs /core_set_selection.md
hotchpotch's picture
Deploy single-column benchmark groups layout
d332c82 verified

HAKARI Core Set Selection Rationale

Research date: 2026-05-21 JST

Selection update: 2026-05-22 JST

Abstract

The HAKARI Core set is the small, curated leaderboard view intended to answer a single question: how should a dense retrieval model be compared when the full HAKARI benchmark inventory is too large to interpret at once? The selected Core set is:

  1. MNanoBEIR
  2. NanoMMTEB-v2
  3. NanoRTEB
  4. NanoMLDR
  5. NanoBRIGHT
  6. NanoCoIR

This document records why these six Nano sets were selected and why several plausible candidates were left out. The decision was made by combining external adoption signals, source benchmark quality, task and language diversity, overlap analysis, lexical baseline difficulty, and actual dense-model score dispersion from the evaluated DuckDB warehouse.

The goal was not to maximize task count or to include every important domain. It was to keep a compact set whose aggregate score is interpretable, broad, and difficult to game by over-weighting one benchmark family. Domain-specific benchmarks that are useful but redundant, narrow, saturated, or better read in their own view remain available outside Core.

The Core score also uses configured aggregation units rather than blindly averaging every raw task row. In particular, MNanoBEIR is aggregated by task_name: an ArguAna-style task is first averaged across its language variants and then contributes as one Core scoring unit. This preserves the multilingual BEIR anchor without allowing the raw language x task matrix to dominate the Core aggregate.

Final Core Set

Position Nano set Role in Core Main reason for inclusion
1 MNanoBEIR Classical multilingual IR anchor BEIR-style retrieval remains a common reference point; Core aggregates it by source task name so multilingual coverage does not dominate by raw row count.
2 NanoMMTEB-v2 Broad multilingual MTEB/MMTEB anchor Represents modern MTEB-style retrieval coverage across many task types and languages.
3 NanoRTEB Practical retrieval domains Adds English RTEB-style applied retrieval tasks with strong model separation.
4 NanoMLDR Multilingual long-document retrieval Strong external adoption through BGE-M3/MLDR and excellent dense score dispersion across all languages.
5 NanoBRIGHT Reasoning-heavy retrieval stress test Hard tasks with high model separation and strong dataset usage signals.
6 NanoCoIR Code retrieval Preserves a code-search dimension that is not captured by general, long-document, or reasoning-heavy retrieval tasks.

The most important late-stage changes were the removals of NanoMIRACL and NanoLaw. NanoMIRACL is well known, but it was too saturated in the analyzed dense results. NanoLaw is a useful legal benchmark, but too many of its tasks are already represented through selected Core sources, leaving only four effective non-duplicate tasks.

Pruning Decisions

The Core set was deliberately pruned. These decisions are as important as the selected set because they prevent the Core score from becoming a second copy of the All view.

Nano set Decision Reason
NanoMIRACL Removed from Core after review MIRACL remains a canonical multilingual benchmark, but the analyzed dense results showed substantial saturation and low model separation. Its role is better served by the All and benchmark-specific views than by the compact Core score.
NanoLaw Removed from Core after review Legal retrieval remains important, but many NanoLaw tasks duplicate similar source tasks already covered by NanoRTEB or NanoMMTEB-v2. After overlap removal, its effective contribution is only four tasks, so it is cleaner as a domain-specific view.
NanoLongEmbed Removed from the earlier Core proposal Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than NanoMLDR. NanoMLDR gives a cleaner multilingual long-document retrieval signal.
NanoBIRCO Not promoted NanoBIRCO is valuable as a complex-objective stress test, but it is small, English-only, and overlaps in role with the broader NanoBRIGHT reasoning-heavy stress slot.
NanoDAPFAM Not promoted Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix.
NanoMedical Not promoted Useful medical benchmark, but after overlap removal it is less discriminative than NanoBRIGHT, NanoMLDR, or NanoRTEB.
NanoR2MED Not promoted Hard medical reasoning stress test with good dispersion, but newer and less established. Better as an optional stress suite.
NanoMuPLeR Not promoted Good dense dispersion, but high average scores and a narrow multilingual task shape. Keep as a language/domain appendix rather than Core.
NanoJMTEB-v2 and other language-family NanoMTEB groups Not promoted Important for language-specific diagnostics, but including them in Core would over-weight MTEB-family language views.
NanoCMTEB Deferred Present in configuration, but the analyzed DuckDB did not contain comparable dense base rows for this set.

Selection Criteria

The Core set was chosen using five criteria.

  1. External benchmark credibility

    Core tasks should come from benchmarks or datasets that are used outside this repository. Signals included paper citations, Hugging Face dataset likes and downloads, and whether the source tasks are registered in the official MTEB task catalog. External credibility is necessary but not sufficient: a task can still be excluded if it is saturated or redundant with a stronger Core source.

  2. Task diversity

    The final set covers classical multilingual IR, broad MTEB/MMTEB retrieval, RTEB-style applied retrieval, multilingual long-document retrieval, reasoning-heavy retrieval, and code retrieval.

  3. Language diversity

    The set contains broad multilingual groups (MNanoBEIR, NanoMMTEB-v2, NanoMLDR) while avoiding a Core made mostly of language-specific MTEB-family views.

  4. Low redundancy

    Candidate groups were checked for source-task overlap and for rank correlation across evaluated dense models. Some overlap is intentional for anchor tasks, but the final set avoids adding multiple groups that primarily express the same signal. This criterion is the main reason NanoLaw was moved out of Core despite its legal-domain value.

  5. Empirical model separation

    A benchmark should usually distinguish current dense models. We therefore measured per-task score dispersion across ten evaluated dense embedding models, using only base dense rows and excluding embedding variants.

Evidence from Evaluated Dense Results

Dense score evidence came from the evaluated DuckDB warehouse available on 2026-05-21. The analyzed rows used embedding_variant_name IS NULL, so int8, binary, truncate, and rescore variants were excluded. Ten dense embedding models had complete base results for 514 tasks across 32 benchmark groups.

For each raw task row, we computed score dispersion across the ten models. The table below reports benchmark-level means over those task-level statistics. MNanoBEIR is shown with its raw task-row count for transparency, even though Core scoring groups it by task_name.

Definitions:

  • avg_mean: average task mean score across the ten dense models.
  • avg_std: average within-task standard deviation across the ten models.
  • p90-p10: average within-task 90th percentile minus 10th percentile.
  • ceiling: tasks with mean >= 0.90 and std <= 0.05.
  • floor: tasks with mean <= 0.25 and std <= 0.05.
  • low-var: tasks with std <= 0.03.
  • healthy: tasks with 0.25 < mean < 0.85 and std >= 0.05.
Nano set Dense analysis task rows avg_mean avg_std p90-p10 ceiling floor low-var healthy
MNanoBEIR 182 raw, grouped by task_name in Core 0.5521 0.0476 0.1042 2 1 16 73
NanoMMTEB-v2 proxy 18 0.5434 0.0572 0.1206 5 2 5 9
NanoRTEB 14 0.5954 0.0960 0.2203 0 0 0 11
NanoMLDR 13 0.5399 0.0844 0.1918 0 0 0 13
NanoBRIGHT 20 0.3289 0.1021 0.2436 0 2 0 14
NanoCoIR 10 0.7872 0.0938 0.2115 3 0 0 4

The analyzed result database still stored the current NanoMMTEB-v2 family under the legacy NanoMMTEB benchmark label, so the row above is used as the best available dense-result proxy for NanoMMTEB-v2.

This table explains several choices:

  • NanoMLDR was selected over NanoLongEmbed because all 13 NanoMLDR tasks were healthy and because its external adoption signals are stronger.
  • NanoBRIGHT and NanoRTEB were retained because they show high model separation and few saturation artifacts.
  • NanoMIRACL was removed from Core because its recognition as a multilingual benchmark did not offset the low dense-model dispersion observed in this result warehouse.
  • NanoLaw was removed from Core because a large part of the set duplicates legal tasks already represented by NanoRTEB or NanoMMTEB-v2; its non-duplicate legal signal remains useful in the domain-specific view.

Evidence from Pruned Alternatives

Nano set Effective tasks avg_mean avg_std p90-p10 ceiling floor low-var healthy Interpretation
NanoMIRACL 18 0.7880 0.0280 0.0597 1 0 12 1 Canonical multilingual benchmark, but too saturated and low-variance for the compact Core score.
NanoLaw after overlap exclusions 4 0.5634 0.0686 0.1516 0 0 0 4 Good legal-domain signal, but too much overlap with selected Core sources and too few effective tasks for a Core slot.
NanoLongEmbed 6 0.6265 0.0911 0.2049 0 0 0 3 Good dispersion, but weaker external signal and more synthetic long-context overlap than NanoMLDR.
NanoBIRCO 5 0.2890 0.0618 0.1182 0 1 1 3 Valuable hard benchmark, but smaller and less externally established than NanoBRIGHT for the Core stress slot.
NanoDAPFAM 18 0.2870 0.0322 0.0754 0 6 8 0 Too low-variance for Core, despite being domain-distinct.
NanoMedical after overlap exclusions 7 0.5323 0.0509 0.1059 0 0 0 4 Reasonable optional domain set, but not stronger than selected Core candidates.
NanoR2MED 8 0.2626 0.0944 0.2264 0 2 0 5 Hard and discriminative, but newer and less established.
NanoMuPLeR 14 0.8113 0.0765 0.1848 0 0 0 11 Strong optional language/domain suite, but high-score and narrower than Core needs.
NanoJMTEB-v2 11 0.8132 0.0430 0.0945 4 0 5 2 Important Japanese diagnostic set, but too saturated and language-specific for Core.

External Adoption and Source Quality

External signals were collected on 2026-05-21. Citation counts were treated as directional rather than exact because Crossref, OpenAlex, Google Scholar, and Hugging Face paper pages count different objects and update at different times. Newer papers are expected to have fewer citations.

Evidence item Observed signal Interpretation
MTEB paper Crossref 307 citations, OpenAlex 350 citations Strong source signal for MTEB-family retrieval tasks and the general evaluation design.
MMTEB paper OpenAlex 11 citations, Hugging Face paper page with 1,072 citing datasets Newer than MTEB, but already visible through dataset usage.
BEIR Very high external recognition; citation counts vary widely by source Supports keeping a BEIR-style multilingual anchor through MNanoBEIR.
MIRACL Crossref 37 citations, OpenAlex 35 citations Moderate citation signal, but a canonical multilingual retrieval benchmark.
BGE-M3 / MLDR Crossref 419 citations, OpenAlex 384 citations, Hugging Face paper page with 444 citing models Strong reason to promote NanoMLDR into Core.
BRIGHT OpenAlex 3 citations, but xlangai/BRIGHT had 71 HF likes and 17,528 downloads New benchmark with strong dataset usage and high empirical discrimination.
LegalBench OpenAlex 131 citations Strong legal benchmark signal, but not enough by itself to justify a Core slot when many NanoLaw tasks overlap with selected Core sources.
LegalBench plus other NanoLaw source papers Approximately 208 OpenAlex citations across the inspected legal source papers NanoLaw is externally meaningful, but its cleaner role is a domain-specific legal view rather than an additional Core component.
BIRCO OpenAlex 1 citation Valuable and difficult, but less established than NanoBRIGHT for the Core stress slot.
CoIR ACL 2025-era source with low early citations Kept because code retrieval is a distinct capability axis and citations are expected to lag for recent work.

MTEB Registration Check

Official MTEB registration was checked against embeddings-benchmark/mteb main at commit 16cc3869619c78499c34bdb59533004899b0f4dc on 2026-05-21. This matters because tasks already present in MTEB are more likely to be understood, reproduced, and compared by external users. It is not, by itself, a reason to include a Nano set in Core.

All NanoLaw tasks map to MTEB retrieval tasks:

NanoLaw task MTEB task name
NanoAILACasedocs AILACasedocs
NanoAILAStatutes AILAStatutes
NanoGerDaLIRSmall GerDaLIRSmall
NanoLeCaRDv2 LeCaRDv2
NanoLegalBenchConsumerContractsQA LegalBenchConsumerContractsQA
NanoLegalBenchCorporateLobbying LegalBenchCorporateLobbying
NanoLegalQuAD LegalQuAD
NanoLegalSummarization LegalSummarization

All NanoBIRCO tasks also map to MTEB retrieval tasks:

NanoBIRCO task MTEB task name
NanoBIRCOArguAna BIRCO-ArguAna
NanoBIRCOClinicalTrial BIRCO-ClinicalTrial
NanoBIRCODorisMae BIRCO-DorisMae
NanoBIRCORelic BIRCO-Relic
NanoBIRCOWTB BIRCO-WTB

The MTEB check did not disqualify either NanoLaw or NanoBIRCO. The deciding factor was overlap and Core role clarity. NanoLaw contributes useful legal coverage but duplicates several tasks already represented by selected Core groups. NanoBIRCO is mostly non-overlapping, but it is better kept as a specialized hard complex-objective group because NanoBRIGHT fills the broader Core stress-test role with more tasks and stronger dense-model separation.

NanoLaw and NanoBIRCO

NanoLaw and NanoBIRCO were both considered for Core and then left out for different reasons.

Property NanoLaw NanoBIRCO
Domain Legal retrieval Complex-objective general IR
Languages English, German, Chinese English
Subtasks 8 5
Queries 1,259 408
Split-local documents 15,142 18,789
Positive qrels 5,488 2,909
Query-weighted BM25 nDCG@10 0.6275 0.1822
Query-weighted BM25 hit@10 0.8133 0.3750
Effective Core tasks after overlap removal 4 5
Dense healthy tasks 4 of 4 effective tasks 3 of 5 tasks

NanoLaw is stronger as a legal-domain benchmark, and its source papers are better established. However, four of its eight tasks overlap with NanoRTEB or NanoMMTEB-v2, leaving only four effective non-duplicate tasks for a Core-style overall score. That makes the legal signal better suited to a focused domain-specific view.

NanoBIRCO is lexically harder and mostly non-overlapping, but it is small and its complex-objective role is covered more broadly by NanoBRIGHT. Both benchmarks remain useful diagnostic views outside Core.

Dataset Scale and Lexical Baselines

The Core set mixes easy, hard, and lexical-overlap-resistant tasks. Some selected components have strong BM25 baselines because the source task is lexical by nature. Others are deliberately hard for BM25. NanoLaw and NanoBIRCO are shown here as pruned comparison points.

Nano set Subtasks Queries Split-local documents Positive qrels Query-weighted BM25 nDCG@10 Query-weighted BM25 hit@10
NanoMLDR 13 2,089 55,585 2,089 0.7178 0.7946
NanoBRIGHT 20 2,245 121,771 9,287 0.2156 0.4454
NanoLaw 8 1,259 15,142 5,488 0.6275 0.8133
NanoCoIR 10 1,850 76,295 1,850 0.5965 0.6962
NanoBIRCO 5 408 18,789 2,909 0.1822 0.3750

These baselines also explain why Core keeps NanoBRIGHT: it provides hard reasoning-heavy retrieval where BM25 is weak, at a larger scale than NanoBIRCO. NanoLaw remains important, but its legal-domain signal is partly covered by overlapping selected Core sources. NanoCoIR keeps a code retrieval axis whose failure modes are different again.

Aggregation and Overlap Policy

Core normally uses one scoring unit per raw task row, except for explicitly configured grouped components. The important exception is MNanoBEIR, where Core uses group_by: task_name so that each BEIR source task contributes once after averaging across language variants.

Some benchmark configurations also define excluded tasks to prevent duplicate source tasks from being counted twice in leaderboard calculations. This does not mean the underlying dataset lacks those tasks; it means the viewer avoids scoring the uploaded duplicate copy. For NanoLaw, the following tasks overlap with NanoRTEB or NanoMMTEB-v2:

  • NanoAILACasedocs
  • NanoAILAStatutes
  • NanoLegalBenchCorporateLobbying
  • NanoLegalSummarization

The remaining effective NanoLaw contribution is still useful, but small enough to keep outside the compact Core score:

  • NanoGerDaLIRSmall
  • NanoLeCaRDv2
  • NanoLegalBenchConsumerContractsQA
  • NanoLegalQuAD

Those four effective tasks were all healthy in the dense dispersion analysis, which supports keeping NanoLaw as a domain-specific view even though it is no longer part of Core.

Limitations

This selection should be revisited when one of the following changes:

  • More dense models are evaluated across all Nano groups.
  • NanoCMTEB receives comparable dense base results in the same DuckDB schema.
  • A new domain benchmark achieves both strong external adoption and strong model separation.
  • MTEB or MMTEB significantly changes the registered task catalog.
  • Saturation increases on NanoCoIR or NanoMMTEB-v2 enough to reduce their usefulness as Core components.

The Core set is not intended to replace the full All view. It is a compact summary. Domain and language-specific diagnosis should still use All, Group, and the individual benchmark views.

References and Source Pointers