Papers
arxiv:2606.22778

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

Published on Jun 22
· Submitted by
Yuichi Tateno
on Jun 23
Authors:

Abstract

HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis.

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction, quantization, reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.

Community

Paper submitter

Evaluating and selecting the right retrieval configuration for RAG or semantic search has become notoriously difficult. While heavy-scale benchmarks like MTEB/MMTEB are comprehensive, they are too heavy to run repeatedly during iterative development. Furthermore, production-time settings—such as embedding dimensionality reduction, quantization, and reranking—are rarely reported under identical, model-agnostic conditions.

To bridge this gap, I built HAKARI-Bench—a lightweight evaluation infrastructure that reconstructs existing retrieval benchmarks into shrunken "Nano-sets" (spanning 35 benchmarks, 43 languages, and 551 tasks).

Despite its lightweight data footprint, it acts as a high-fidelity ranking proxy, reproducing the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) rankings at a Spearman rank correlation of > 0.97.

🛠️ Why it's a game-changer for practical IR/RAG development:

  • 5 Retrieval Families & Realistic 2-Stage Shuffling: Evaluate BM25, Dense, Sparse (SPLADE), Late Interaction (ColBERT), and Rerankers under a unified format. To accurately simulate real-world production setups, Rerankers are evaluated over a fixed, reproducible hybrid candidate set fusing BM25 and Dense via Reciprocal Rank Fusion (RRF Top 100). This decouples candidate-generation failures from pure reranking accuracy.

  • Mapping the Quality-Efficiency Pareto Frontier: We applied post-encoding efficiency transformations uniformly across 33 dense models. You can instantly see how much quality you sacrifice when compressing embeddings to save storage, memory, and latency:

  • Int8 is nearly free: Scalar int8 quantization causes a minor, uniform drop (mean -1.95 points).

  • Binary is model-specific, but fixable: Binary quantization severely collapses some models (e.g., the multilingual-E5 family dropping up to -35.8 points) but spares others (Jina v5, Gemma, Arctic, Qwen3). Crucially, adding a simple float rescore on the top-100 candidates almost completely restores the original rankings (reducing the binary drop to a mere -0.93 points).

  • Sparse Pruning Knobs: For SPLADE-style models, document-side pruning is a "cheap knob" (lossless down to 256 active dimensions), whereas query-side pruning is an "expensive knob" that degrades quality sharply below 16 dimensions.

  • Fully Open-Source with a DuckDB Analytics Backend: We provide a multi-axis leaderboard/viewer alongside the pipeline. Because the backend stores all runs in a DuckDB warehouse, rendering is lightning-fast, and you can easily slice the data via SQL to filter by language, domain, context length, or parameter counts.

HAKARI-Bench is not a replacement for full-scale evaluation; rather, it is designed to supercharge your model selection, regression testing, and production cost-optimization under a single unified harness.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.22778
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.22778 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.22778 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.