hotchpotch commited on
Commit
616eae1
·
verified ·
1 Parent(s): 84ff9f5

Compact leaderboard selection tiles

Browse files
README.md CHANGED
@@ -141,21 +141,28 @@ The selected limits are written to result JSON under
141
  Sparse embedding metadata records `nnz_total`, `nnz_mean`, `nnz_median`,
142
  `nnz_max`, and `density` for queries and documents.
143
 
144
- To compare multiple sparsity limits from one full sparse model encode, use
145
- post-encode embedding variants:
 
 
 
 
 
 
146
 
147
  ```bash
148
  uv run hakari-bench evaluate sparse \
149
  --model naver/splade-v3 \
150
  --dataset NanoBEIR-en \
151
- --embedding-variant sparse-query-max-active-dims:8,16,32 \
152
- --embedding-variant sparse-document-max-active-dims:64,128,256 \
153
- --embedding-variant-grid sparse-query-max-active-dims:8,16,32 sparse-document-max-active-dims:64,128,256
154
  ```
155
 
156
  These variants keep the top absolute-value dimensions per query/document row
157
  and record each derived result under `evaluation.embedding_evaluations`, like
158
- dense `truncate:` variants.
 
 
159
 
160
  Sparse embeddings intentionally do not support quantized embedding variants in
161
  the CLI. Use post-encode sparse truncation variants for sparse footprint and
 
141
  Sparse embedding metadata records `nnz_total`, `nnz_mean`, `nnz_median`,
142
  `nnz_max`, and `density` for queries and documents.
143
 
144
+ Sparse evaluation automatically compares the default post-encode sparsity grid
145
+ from one full sparse model encode unless `--no-default-embedding-variants` is
146
+ set:
147
+
148
+ - query max active dims: `8,16,24,32`
149
+ - document max active dims: `64,128,256,512`
150
+
151
+ To add other sparsity limits, use post-encode embedding variants:
152
 
153
  ```bash
154
  uv run hakari-bench evaluate sparse \
155
  --model naver/splade-v3 \
156
  --dataset NanoBEIR-en \
157
+ --embedding-variant sparse-query-max-active-dims:48 \
158
+ --embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768
 
159
  ```
160
 
161
  These variants keep the top absolute-value dimensions per query/document row
162
  and record each derived result under `evaluation.embedding_evaluations`, like
163
+ dense `truncate:` variants. Use `--no-default-embedding-variants` when a sparse
164
+ run should write only the base no-limit result or only explicitly requested
165
+ sparse variants.
166
 
167
  Sparse embeddings intentionally do not support quantized embedding variants in
168
  the CLI. Use post-encode sparse truncation variants for sparse footprint and
config/viewer/overall.yaml CHANGED
@@ -40,10 +40,10 @@ overalls:
40
  - name: Core
41
  label: Core
42
  benchmarks:
43
- - MNanoBEIR
 
44
  - NanoMMTEB-v2
45
  - NanoRTEB
46
- - NanoMIRACL
47
  - NanoMLDR
48
  - NanoBRIGHT
49
  - NanoLaw
 
40
  - name: Core
41
  label: Core
42
  benchmarks:
43
+ - name: MNanoBEIR
44
+ group_by: task_name
45
  - NanoMMTEB-v2
46
  - NanoRTEB
 
47
  - NanoMLDR
48
  - NanoBRIGHT
49
  - NanoLaw
docs/benchmark_evaluation.md CHANGED
@@ -222,32 +222,38 @@ uv run hakari-bench evaluate sparse \
222
  ```
223
 
224
  Do not add dense quantized embedding variants for sparse/SPLADE-style models.
225
- Sparse quantization is intentionally unsupported in the CLI. Use post-encode
226
- sparse truncation variants for footprint and latency trade-offs.
 
227
 
228
- Query-only sparsity limits:
 
 
 
 
 
 
229
 
230
  ```bash
231
  uv run hakari-bench evaluate sparse \
232
  --model MODEL_NAME \
233
  --dataset DATASET_NAME \
234
- --embedding-variant sparse-query-max-active-dims:8,16,32
235
  ```
236
 
237
- Query/document grids:
238
 
239
  ```bash
240
  uv run hakari-bench evaluate sparse \
241
  --model MODEL_NAME \
242
  --dataset DATASET_NAME \
243
- --embedding-variant sparse-query-max-active-dims:8,16,32 \
244
- --embedding-variant sparse-document-max-active-dims:64,128,256 \
245
- --embedding-variant-grid sparse-query-max-active-dims:8,16,32 sparse-document-max-active-dims:64,128,256
246
  ```
247
 
248
- If only the full query x document grid is requested, the standalone query-only
249
- and document-only variants may be omitted. The base no-limit result is always
250
- included as `evaluation.embedding_evaluations[0]`.
 
251
 
252
  ## Late-Interaction, Reranker, And BM25
253
 
 
222
  ```
223
 
224
  Do not add dense quantized embedding variants for sparse/SPLADE-style models.
225
+ Sparse quantization is intentionally unsupported in the CLI. Sparse runs
226
+ automatically include post-encode query/document max-active-dims grid variants
227
+ unless `--no-default-embedding-variants` is set:
228
 
229
+ - query max active dims: `8,16,24,32`
230
+ - document max active dims: `64,128,256,512`
231
+
232
+ These variants are derived after one full sparse model encode and do not run
233
+ additional model inference.
234
+
235
+ Additional query-only sparsity limits:
236
 
237
  ```bash
238
  uv run hakari-bench evaluate sparse \
239
  --model MODEL_NAME \
240
  --dataset DATASET_NAME \
241
+ --embedding-variant sparse-query-max-active-dims:48
242
  ```
243
 
244
+ Additional query/document grids:
245
 
246
  ```bash
247
  uv run hakari-bench evaluate sparse \
248
  --model MODEL_NAME \
249
  --dataset DATASET_NAME \
250
+ --embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768
 
 
251
  ```
252
 
253
+ The base no-limit result is always included as
254
+ `evaluation.embedding_evaluations[0]`. Use `--no-default-embedding-variants`
255
+ when intentionally running only the base no-limit result or only explicitly
256
+ specified sparse variants.
257
 
258
  ## Late-Interaction, Reranker, And BM25
259
 
docs/core_set_selection.md CHANGED
@@ -12,13 +12,12 @@ set is:
12
  1. `MNanoBEIR`
13
  2. `NanoMMTEB-v2`
14
  3. `NanoRTEB`
15
- 4. `NanoMIRACL`
16
- 5. `NanoMLDR`
17
- 6. `NanoBRIGHT`
18
- 7. `NanoLaw`
19
- 8. `NanoCoIR`
20
 
21
- This document records why these eight Nano sets were selected. The decision was
22
  made by combining external adoption signals, source benchmark quality, task and
23
  language diversity, overlap analysis, lexical baseline difficulty, and actual
24
  dense-model score dispersion from the evaluated DuckDB warehouse. The goal was
@@ -26,18 +25,24 @@ not to maximize task count. It was to keep a compact set whose aggregate score
26
  is interpretable, broad, and difficult to game by over-weighting one benchmark
27
  family.
28
 
 
 
 
 
 
 
 
29
  ## Final Core Set
30
 
31
  | Position | Nano set | Role in Core | Main reason for inclusion |
32
  | ---: | --- | --- | --- |
33
- | 1 | `MNanoBEIR` | Classical multilingual IR anchor | BEIR-style retrieval remains a common reference point, and the multilingual Nano version gives broad language and task coverage. |
34
  | 2 | `NanoMMTEB-v2` | Broad multilingual MTEB/MMTEB anchor | Represents modern MTEB-style retrieval coverage across many task types and languages. |
35
  | 3 | `NanoRTEB` | Practical retrieval domains | Adds English RTEB-style applied retrieval tasks with strong model separation. |
36
- | 4 | `NanoMIRACL` | Canonical multilingual IR | Keeps MIRACL as an explicit multilingual retrieval anchor even though current dense models show low score dispersion on many splits. |
37
- | 5 | `NanoMLDR` | Multilingual long-document retrieval | Strong external adoption through BGE-M3/MLDR and excellent dense score dispersion across all languages. |
38
- | 6 | `NanoBRIGHT` | Reasoning-heavy retrieval stress test | Hard tasks with high model separation and strong dataset usage signals. |
39
- | 7 | `NanoLaw` | Legal-domain retrieval | A multilingual, multi-source legal retrieval group whose tasks are registered in MTEB and better supported than `NanoBIRCO` as a Core domain representative. |
40
- | 8 | `NanoCoIR` | Code retrieval | Preserves a code-search dimension that is not captured by legal, long-document, or general IR tasks. |
41
 
42
  ## Pruned or Not Promoted Sets
43
 
@@ -47,6 +52,7 @@ the `All` view.
47
 
48
  | Nano set | Decision | Reason |
49
  | --- | --- | --- |
 
50
  | `NanoLongEmbed` | Removed from the earlier Core proposal | Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than `NanoMLDR`. `NanoMLDR` gives a cleaner multilingual long-document retrieval signal. |
51
  | `NanoBIRCO` | Replaced by `NanoLaw` | `NanoBIRCO` is valuable as a complex-objective stress test, but it is small, English-only, and has weaker paper and dataset adoption signals. `NanoLaw` provides a better Core domain slot. |
52
  | `NanoDAPFAM` | Not promoted | Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix. |
@@ -69,15 +75,15 @@ The Core set was chosen using five criteria.
69
 
70
  2. Task diversity
71
 
72
- The final set covers classical IR, broad MTEB/MMTEB retrieval, RTEB-style
73
- applied retrieval, MIRACL, multilingual long-document retrieval, hard
74
  reasoning retrieval, legal retrieval, and code retrieval.
75
 
76
  3. Language diversity
77
 
78
  The set contains broad multilingual groups (`MNanoBEIR`, `NanoMMTEB-v2`,
79
- `NanoMIRACL`, `NanoMLDR`) while avoiding a Core made mostly of
80
- language-specific MTEB-family views.
81
 
82
  4. Low redundancy
83
 
@@ -112,12 +118,11 @@ Definitions:
112
  - `low-var`: tasks with std <= 0.03.
113
  - `healthy`: tasks with 0.25 < mean < 0.85 and std >= 0.05.
114
 
115
- | Nano set | Effective tasks | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy |
116
- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
117
- | `MNanoBEIR` | 182 | 0.5521 | 0.0476 | 0.1042 | 2 | 1 | 16 | 73 |
118
  | `NanoMMTEB-v2` proxy | 18 | 0.5434 | 0.0572 | 0.1206 | 5 | 2 | 5 | 9 |
119
  | `NanoRTEB` | 14 | 0.5954 | 0.0960 | 0.2203 | 0 | 0 | 0 | 11 |
120
- | `NanoMIRACL` | 18 | 0.7880 | 0.0280 | 0.0597 | 1 | 0 | 12 | 1 |
121
  | `NanoMLDR` | 13 | 0.5399 | 0.0844 | 0.1918 | 0 | 0 | 0 | 13 |
122
  | `NanoBRIGHT` | 20 | 0.3289 | 0.1021 | 0.2436 | 0 | 2 | 0 | 14 |
123
  | `NanoLaw` after Core overlap exclusions | 4 | 0.5634 | 0.0686 | 0.1516 | 0 | 0 | 0 | 4 |
@@ -133,9 +138,9 @@ This table explains several choices:
133
  were healthy and because its external adoption signals are stronger.
134
  - `NanoBRIGHT` and `NanoRTEB` were retained because they show high model
135
  separation and few saturation artifacts.
136
- - `NanoMIRACL` was retained despite low dispersion. Its purpose in Core is not
137
- to maximize ranking separation, but to keep a widely recognized multilingual
138
- IR anchor.
139
  - `NanoLaw` was selected over `NanoBIRCO` after comparing domain coverage,
140
  MTEB registration, citations, and effective Core overlap.
141
 
@@ -143,6 +148,7 @@ This table explains several choices:
143
 
144
  | Nano set | Effective tasks | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy | Interpretation |
145
  | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
 
146
  | `NanoLongEmbed` | 6 | 0.6265 | 0.0911 | 0.2049 | 0 | 0 | 0 | 3 | Good dispersion, but weaker external signal and more synthetic long-context overlap than `NanoMLDR`. |
147
  | `NanoBIRCO` | 5 | 0.2890 | 0.0618 | 0.1182 | 0 | 1 | 1 | 3 | Valuable hard benchmark, but smaller, English-only, and weaker external signal than `NanoLaw`. |
148
  | `NanoDAPFAM` | 18 | 0.2870 | 0.0322 | 0.0754 | 0 | 6 | 8 | 0 | Too low-variance for Core, despite being domain-distinct. |
@@ -251,13 +257,17 @@ sets. `NanoBRIGHT` provides hard reasoning-heavy retrieval where BM25 is weak.
251
  but not sufficient. `NanoCoIR` keeps a code retrieval axis whose failure modes
252
  are different again.
253
 
254
- ## Overlap Policy
 
 
 
 
 
255
 
256
- The Core set uses raw task rows, but some benchmark configurations define
257
- excluded tasks to prevent duplicate source tasks from being counted twice in
258
- benchmark-specific views. For `NanoLaw`, the following tasks overlap with
259
- `NanoRTEB` or `NanoMMTEB-v2` and are excluded by the viewer configuration when
260
- appropriate:
261
 
262
  - `NanoAILACasedocs`
263
  - `NanoAILAStatutes`
@@ -282,8 +292,8 @@ This selection should be revisited when one of the following changes:
282
  - A new domain benchmark achieves both strong external adoption and strong model
283
  separation.
284
  - MTEB or MMTEB significantly changes the registered task catalog.
285
- - Saturation increases on `NanoMIRACL`, `NanoCoIR`, or `NanoMMTEB-v2` enough to
286
- reduce their usefulness as Core components.
287
 
288
  The Core set is not intended to replace the full `All` view. It is a compact
289
  summary. Domain and language-specific diagnosis should still use `All`,
 
12
  1. `MNanoBEIR`
13
  2. `NanoMMTEB-v2`
14
  3. `NanoRTEB`
15
+ 4. `NanoMLDR`
16
+ 5. `NanoBRIGHT`
17
+ 6. `NanoLaw`
18
+ 7. `NanoCoIR`
 
19
 
20
+ This document records why these seven Nano sets were selected. The decision was
21
  made by combining external adoption signals, source benchmark quality, task and
22
  language diversity, overlap analysis, lexical baseline difficulty, and actual
23
  dense-model score dispersion from the evaluated DuckDB warehouse. The goal was
 
25
  is interpretable, broad, and difficult to game by over-weighting one benchmark
26
  family.
27
 
28
+ The Core score also uses configured aggregation units rather than blindly
29
+ averaging every raw task row. In particular, `MNanoBEIR` is aggregated by
30
+ `task_name`: an ArguAna-style task is first averaged across its language
31
+ variants and then contributes as one Core scoring unit. This preserves the
32
+ multilingual BEIR anchor without allowing the raw language x task matrix to
33
+ dominate the Core aggregate.
34
+
35
  ## Final Core Set
36
 
37
  | Position | Nano set | Role in Core | Main reason for inclusion |
38
  | ---: | --- | --- | --- |
39
+ | 1 | `MNanoBEIR` | Classical multilingual IR anchor | BEIR-style retrieval remains a common reference point; Core aggregates it by source task name so multilingual coverage does not dominate by raw row count. |
40
  | 2 | `NanoMMTEB-v2` | Broad multilingual MTEB/MMTEB anchor | Represents modern MTEB-style retrieval coverage across many task types and languages. |
41
  | 3 | `NanoRTEB` | Practical retrieval domains | Adds English RTEB-style applied retrieval tasks with strong model separation. |
42
+ | 4 | `NanoMLDR` | Multilingual long-document retrieval | Strong external adoption through BGE-M3/MLDR and excellent dense score dispersion across all languages. |
43
+ | 5 | `NanoBRIGHT` | Reasoning-heavy retrieval stress test | Hard tasks with high model separation and strong dataset usage signals. |
44
+ | 6 | `NanoLaw` | Legal-domain retrieval | A multilingual, multi-source legal retrieval group whose tasks are registered in MTEB and better supported than `NanoBIRCO` as a Core domain representative. |
45
+ | 7 | `NanoCoIR` | Code retrieval | Preserves a code-search dimension that is not captured by legal, long-document, or general IR tasks. |
 
46
 
47
  ## Pruned or Not Promoted Sets
48
 
 
52
 
53
  | Nano set | Decision | Reason |
54
  | --- | --- | --- |
55
+ | `NanoMIRACL` | Removed from Core after review | MIRACL remains a canonical multilingual benchmark, but the analyzed dense results showed substantial saturation and low model separation. Its role is better served by the `All` and benchmark-specific views than by the compact Core score. |
56
  | `NanoLongEmbed` | Removed from the earlier Core proposal | Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than `NanoMLDR`. `NanoMLDR` gives a cleaner multilingual long-document retrieval signal. |
57
  | `NanoBIRCO` | Replaced by `NanoLaw` | `NanoBIRCO` is valuable as a complex-objective stress test, but it is small, English-only, and has weaker paper and dataset adoption signals. `NanoLaw` provides a better Core domain slot. |
58
  | `NanoDAPFAM` | Not promoted | Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix. |
 
75
 
76
  2. Task diversity
77
 
78
+ The final set covers classical multilingual IR, broad MTEB/MMTEB retrieval,
79
+ RTEB-style applied retrieval, multilingual long-document retrieval, hard
80
  reasoning retrieval, legal retrieval, and code retrieval.
81
 
82
  3. Language diversity
83
 
84
  The set contains broad multilingual groups (`MNanoBEIR`, `NanoMMTEB-v2`,
85
+ `NanoMLDR`) while avoiding a Core made mostly of language-specific
86
+ MTEB-family views.
87
 
88
  4. Low redundancy
89
 
 
118
  - `low-var`: tasks with std <= 0.03.
119
  - `healthy`: tasks with 0.25 < mean < 0.85 and std >= 0.05.
120
 
121
+ | Nano set | Dense analysis task rows | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy |
122
+ | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
123
+ | `MNanoBEIR` | 182 raw, grouped by `task_name` in Core | 0.5521 | 0.0476 | 0.1042 | 2 | 1 | 16 | 73 |
124
  | `NanoMMTEB-v2` proxy | 18 | 0.5434 | 0.0572 | 0.1206 | 5 | 2 | 5 | 9 |
125
  | `NanoRTEB` | 14 | 0.5954 | 0.0960 | 0.2203 | 0 | 0 | 0 | 11 |
 
126
  | `NanoMLDR` | 13 | 0.5399 | 0.0844 | 0.1918 | 0 | 0 | 0 | 13 |
127
  | `NanoBRIGHT` | 20 | 0.3289 | 0.1021 | 0.2436 | 0 | 2 | 0 | 14 |
128
  | `NanoLaw` after Core overlap exclusions | 4 | 0.5634 | 0.0686 | 0.1516 | 0 | 0 | 0 | 4 |
 
138
  were healthy and because its external adoption signals are stronger.
139
  - `NanoBRIGHT` and `NanoRTEB` were retained because they show high model
140
  separation and few saturation artifacts.
141
+ - `NanoMIRACL` was removed from Core because its recognition as a multilingual
142
+ benchmark did not offset the low dense-model dispersion observed in this
143
+ result warehouse.
144
  - `NanoLaw` was selected over `NanoBIRCO` after comparing domain coverage,
145
  MTEB registration, citations, and effective Core overlap.
146
 
 
148
 
149
  | Nano set | Effective tasks | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy | Interpretation |
150
  | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
151
+ | `NanoMIRACL` | 18 | 0.7880 | 0.0280 | 0.0597 | 1 | 0 | 12 | 1 | Canonical multilingual benchmark, but too saturated and low-variance for the compact Core score. |
152
  | `NanoLongEmbed` | 6 | 0.6265 | 0.0911 | 0.2049 | 0 | 0 | 0 | 3 | Good dispersion, but weaker external signal and more synthetic long-context overlap than `NanoMLDR`. |
153
  | `NanoBIRCO` | 5 | 0.2890 | 0.0618 | 0.1182 | 0 | 1 | 1 | 3 | Valuable hard benchmark, but smaller, English-only, and weaker external signal than `NanoLaw`. |
154
  | `NanoDAPFAM` | 18 | 0.2870 | 0.0322 | 0.0754 | 0 | 6 | 8 | 0 | Too low-variance for Core, despite being domain-distinct. |
 
257
  but not sufficient. `NanoCoIR` keeps a code retrieval axis whose failure modes
258
  are different again.
259
 
260
+ ## Aggregation and Overlap Policy
261
+
262
+ Core normally uses one scoring unit per raw task row, except for explicitly
263
+ configured grouped components. The important exception is `MNanoBEIR`, where
264
+ Core uses `group_by: task_name` so that each BEIR source task contributes once
265
+ after averaging across language variants.
266
 
267
+ Some benchmark configurations also define excluded tasks to prevent duplicate
268
+ source tasks from being counted twice in benchmark-specific views. For
269
+ `NanoLaw`, the following tasks overlap with `NanoRTEB` or `NanoMMTEB-v2` and
270
+ are excluded by the viewer configuration when appropriate:
 
271
 
272
  - `NanoAILACasedocs`
273
  - `NanoAILAStatutes`
 
292
  - A new domain benchmark achieves both strong external adoption and strong model
293
  separation.
294
  - MTEB or MMTEB significantly changes the registered task catalog.
295
+ - Saturation increases on `NanoCoIR` or `NanoMMTEB-v2` enough to reduce their
296
+ usefulness as Core components.
297
 
298
  The Core set is not intended to replace the full `All` view. It is a compact
299
  summary. Domain and language-specific diagnosis should still use `All`,
docs/duckdb_schema.md CHANGED
@@ -726,10 +726,10 @@ Only models that have every expected task in the selected view are ranked.
726
  3. Build the expected task set from the remaining rows.
727
  4. Keep only models whose task-key set exactly matches the expected task set.
728
 
729
- For grouped overall views such as `Group`, the viewer first checks raw
730
- task completeness within each model/benchmark pair, then aggregates rows by the
731
- configured group key, and finally applies the complete model rule again to the
732
- aggregated task set.
733
 
734
  ### Benchmark and Overall Views
735
 
@@ -740,20 +740,22 @@ For overall views:
740
 
741
  - `All`: all configured benchmark views using raw task rows.
742
  - `Core`: a compact curated set covering broad multilingual retrieval,
743
- multilingual BEIR, English RTEB domains, MIRACL, multilingual long-document
744
  retrieval, reasoning-heavy retrieval, legal retrieval, and code retrieval
745
- (`MNanoBEIR`, `NanoMMTEB-v2`, `NanoRTEB`, `NanoMIRACL`, `NanoMLDR`,
746
- `NanoBRIGHT`, `NanoLaw`, and `NanoCoIR`).
 
747
  - `Group`: all configured benchmark views aggregated by each component's
748
  `group_by` setting before ranking.
749
  - `micro_mean`: mean over all included tasks with equal task weight.
750
  - `macro_mean`: mean of benchmark-level means with equal benchmark weight.
751
  - `mean_score`: `macro_mean` for overall views, task mean for benchmark views.
752
 
753
- `All` and `Core` use raw `task_key` values. `Group` uses the `group_by` settings
754
- from `overall.yaml` to average tasks into benchmark-local units before
755
- computing Borda and means. For task x language collections such as `MNanoBEIR`,
756
- `Group` uses the underlying task name (`task_name`) as the grouped unit.
 
757
 
758
  Grouped overall views also expose the aggregated benchmark-local units as
759
  metric columns. These columns use the aggregated `task_key` values, such as
@@ -898,11 +900,12 @@ choices:
898
  `NanoBEIR-ja` -> `ja` or the NanoMIRACL split code, rather than expanding
899
  every code in `languages`.
900
  - Viewer benchmark groups put the compact curated Core set under
901
- Core benchmarks. Other broader multilingual/domain suites remain
902
- Domain-specific unless they are an official language-specific NanoMTEB family
903
- such as `NanoJMTEB-v2`, `NanoFaMTEB-v2`, `NanoRuMTEB`, `NanoVNMTEB`, or
904
- `NanoCMTEB`. `NanoIndicQA`, `NanoMuPLeR`, and `NanoChemTEB` remain
905
- Domain-specific by viewer policy even when they expose language pages.
 
906
 
907
  The viewer logs timing records through the `hakari_bench.viewer` logger:
908
 
@@ -952,8 +955,9 @@ materialized by a custom build. It is keyed by `view_name`, `score_target`, and
952
  the four display flags
953
  `include_quantization_variants`, `include_truncate_variants`,
954
  `include_rescore_variants`, and `include_other_variants`. The viewer uses this
955
- mart when language filters, task-score columns, and task text filters are not
956
- active. Those interactive cases still fall back to the normal
 
957
  `LeaderboardService` computation from task-score rows.
958
 
959
  | column | type | meaning |
@@ -1233,9 +1237,9 @@ SELECT
1233
 
1234
  ### 3. Raw Overall View Leaderboard
1235
 
1236
- For an overall view that uses raw tasks, such as `All` or `Core`, put the
1237
- overall benchmarks into
1238
- `selected_benchmarks` and replace `model_agg` with this version:
1239
 
1240
  ```sql
1241
  benchmark_means AS (
@@ -1290,11 +1294,11 @@ model_agg AS (
1290
  The final result should return both `macro_mean` and `micro_mean`. `mean_rank`
1291
  for overall views ranks `mean_score`, which is `macro_mean`.
1292
 
1293
- ### 4. Group Leaderboard
1294
 
1295
- `Group` first averages raw tasks into benchmark-local groups, then computes
1296
- Borda, means, and per-group metric columns. Generate `overall_components` from
1297
- `config/viewer/overall.yaml`.
1298
 
1299
  ```sql
1300
  WITH
 
726
  3. Build the expected task set from the remaining rows.
727
  4. Keep only models whose task-key set exactly matches the expected task set.
728
 
729
+ For overall views with configured grouped components, such as `Core` and
730
+ `Group`, the viewer first checks raw task completeness within each
731
+ model/benchmark pair, then aggregates rows by the configured group key, and
732
+ finally applies the complete model rule again to the aggregated task set.
733
 
734
  ### Benchmark and Overall Views
735
 
 
740
 
741
  - `All`: all configured benchmark views using raw task rows.
742
  - `Core`: a compact curated set covering broad multilingual retrieval,
743
+ multilingual BEIR, English RTEB domains, multilingual long-document
744
  retrieval, reasoning-heavy retrieval, legal retrieval, and code retrieval
745
+ (`MNanoBEIR`, `NanoMMTEB-v2`, `NanoRTEB`, `NanoMLDR`, `NanoBRIGHT`,
746
+ `NanoLaw`, and `NanoCoIR`). `MNanoBEIR` is grouped by `task_name` in Core so
747
+ each BEIR source task contributes once after averaging language variants.
748
  - `Group`: all configured benchmark views aggregated by each component's
749
  `group_by` setting before ranking.
750
  - `micro_mean`: mean over all included tasks with equal task weight.
751
  - `macro_mean`: mean of benchmark-level means with equal benchmark weight.
752
  - `mean_score`: `macro_mean` for overall views, task mean for benchmark views.
753
 
754
+ `All` uses raw `task_key` values. `Core` and `Group` use any component-level
755
+ `group_by` settings from `overall.yaml` to average tasks into benchmark-local
756
+ units before computing Borda and means. For task x language collections such as
757
+ `MNanoBEIR`, Core and Group use the underlying task name (`task_name`) as the
758
+ grouped unit.
759
 
760
  Grouped overall views also expose the aggregated benchmark-local units as
761
  metric columns. These columns use the aggregated `task_key` values, such as
 
900
  `NanoBEIR-ja` -> `ja` or the NanoMIRACL split code, rather than expanding
901
  every code in `languages`.
902
  - Viewer benchmark groups put the compact curated Core set under
903
+ Core benchmarks. Other broader multilingual/domain suites, including
904
+ `NanoMIRACL`, remain Domain-specific unless they are an official
905
+ language-specific NanoMTEB family such as `NanoJMTEB-v2`, `NanoFaMTEB-v2`,
906
+ `NanoRuMTEB`, `NanoVNMTEB`, or `NanoCMTEB`. `NanoIndicQA`, `NanoMuPLeR`, and
907
+ `NanoChemTEB` remain Domain-specific by viewer policy even when they expose
908
+ language pages.
909
 
910
  The viewer logs timing records through the `hakari_bench.viewer` logger:
911
 
 
955
  the four display flags
956
  `include_quantization_variants`, `include_truncate_variants`,
957
  `include_rescore_variants`, and `include_other_variants`. The viewer uses this
958
+ mart when language filters, task-score columns, task text filters, and
959
+ component-level overall grouping are not active. Those interactive and grouped
960
+ cases still fall back to the normal
961
  `LeaderboardService` computation from task-score rows.
962
 
963
  | column | type | meaning |
 
1237
 
1238
  ### 3. Raw Overall View Leaderboard
1239
 
1240
+ For an overall view that uses raw tasks, such as `All`, put the overall
1241
+ benchmarks into `selected_benchmarks` and replace `model_agg` with this
1242
+ version:
1243
 
1244
  ```sql
1245
  benchmark_means AS (
 
1294
  The final result should return both `macro_mean` and `micro_mean`. `mean_rank`
1295
  for overall views ranks `mean_score`, which is `macro_mean`.
1296
 
1297
+ ### 4. Grouped Overall View Leaderboard
1298
 
1299
+ Grouped overall views such as `Core` and `Group` first average raw tasks into
1300
+ benchmark-local groups, then compute Borda, means, and per-group metric
1301
+ columns. Generate `overall_components` from `config/viewer/overall.yaml`.
1302
 
1303
  ```sql
1304
  WITH
hakari_bench/cli.py CHANGED
@@ -24,6 +24,7 @@ from hakari_bench.embedding_variants import (
24
  TORCH_SCORE_REPRESENTATION,
25
  dense_embedding_variants,
26
  parse_embedding_variants,
 
27
  )
28
  from hakari_bench.evaluation import LoadedIrDataset, load_ir_dataset, start_encode_pool, stop_encode_pool
29
  from hakari_bench.model_cards import load_model_cards, write_evaluation_model_card
@@ -223,7 +224,8 @@ def _add_embedding_variant_args(parser: argparse.ArgumentParser) -> None:
223
  "sparse-document-max-active-dims:DIM, normalize, int8, binary, "
224
  "rescore:int8, rescore:binary, int8-rescore, or binary-rescore. "
225
  "Dense runs automatically include full-dim quantized/rescore variants; "
226
- "explicit truncate:DIM also expands to truncate x quantized/rescore variants."
 
227
  ),
228
  )
229
  parser.add_argument(
@@ -243,7 +245,8 @@ def _add_embedding_variant_args(parser: argparse.ArgumentParser) -> None:
243
  action="store_true",
244
  help=(
245
  "Disable automatic dense int8/binary quantized and top-100 rescore variants, "
246
- "including truncate x quantized/rescore expansion."
 
247
  ),
248
  )
249
 
@@ -381,6 +384,12 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
381
  args.embedding_variant_grid_values,
382
  include_defaults=not args.no_default_embedding_variants,
383
  )
 
 
 
 
 
 
384
  else:
385
  args.embedding_variants = parse_embedding_variants(
386
  args.embedding_variant_values,
 
24
  TORCH_SCORE_REPRESENTATION,
25
  dense_embedding_variants,
26
  parse_embedding_variants,
27
+ sparse_embedding_variants,
28
  )
29
  from hakari_bench.evaluation import LoadedIrDataset, load_ir_dataset, start_encode_pool, stop_encode_pool
30
  from hakari_bench.model_cards import load_model_cards, write_evaluation_model_card
 
224
  "sparse-document-max-active-dims:DIM, normalize, int8, binary, "
225
  "rescore:int8, rescore:binary, int8-rescore, or binary-rescore. "
226
  "Dense runs automatically include full-dim quantized/rescore variants; "
227
+ "explicit truncate:DIM also expands to truncate x quantized/rescore variants. "
228
+ "Sparse runs automatically include query/document max-active-dims grid variants."
229
  ),
230
  )
231
  parser.add_argument(
 
245
  action="store_true",
246
  help=(
247
  "Disable automatic dense int8/binary quantized and top-100 rescore variants, "
248
+ "including truncate x quantized/rescore expansion, and automatic sparse "
249
+ "query/document max-active-dims grid variants."
250
  ),
251
  )
252
 
 
384
  args.embedding_variant_grid_values,
385
  include_defaults=not args.no_default_embedding_variants,
386
  )
387
+ elif args.model_type == "sparse":
388
+ args.embedding_variants = sparse_embedding_variants(
389
+ args.embedding_variant_values,
390
+ args.embedding_variant_grid_values,
391
+ include_defaults=not args.no_default_embedding_variants,
392
+ )
393
  else:
394
  args.embedding_variants = parse_embedding_variants(
395
  args.embedding_variant_values,
hakari_bench/embedding_variants.py CHANGED
@@ -53,6 +53,18 @@ def default_dense_quantized_embedding_variants() -> list[dict[str, Any]]:
53
  return parse_embedding_variants(["int8,binary", "rescore:int8,binary"])
54
 
55
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  def dense_embedding_variants(
57
  values: list[str] | None,
58
  cross_values: list[list[str]] | None = None,
@@ -73,6 +85,18 @@ def dense_embedding_variants(
73
  return _dedupe_variants([*variants, *auto_variants])
74
 
75
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  def _default_truncate_embedding_variants(dims: list[int]) -> list[dict[str, Any]]:
77
  if not dims:
78
  return []
 
53
  return parse_embedding_variants(["int8,binary", "rescore:int8,binary"])
54
 
55
 
56
+ def default_sparse_truncation_embedding_variants() -> list[dict[str, Any]]:
57
+ return parse_embedding_variants(
58
+ None,
59
+ [
60
+ [
61
+ "sparse-query-max-active-dims:8,16,24,32",
62
+ "sparse-document-max-active-dims:64,128,256,512",
63
+ ]
64
+ ],
65
+ )
66
+
67
+
68
  def dense_embedding_variants(
69
  values: list[str] | None,
70
  cross_values: list[list[str]] | None = None,
 
85
  return _dedupe_variants([*variants, *auto_variants])
86
 
87
 
88
+ def sparse_embedding_variants(
89
+ values: list[str] | None,
90
+ cross_values: list[list[str]] | None = None,
91
+ *,
92
+ include_defaults: bool = True,
93
+ ) -> list[dict[str, Any]]:
94
+ variants = parse_embedding_variants(values, cross_values)
95
+ if not include_defaults:
96
+ return variants
97
+ return _dedupe_variants([*variants, *default_sparse_truncation_embedding_variants()])
98
+
99
+
100
  def _default_truncate_embedding_variants(dims: list[int]) -> list[dict[str, Any]]:
101
  if not dims:
102
  return []
hakari_bench/viewer/app.py CHANGED
@@ -713,11 +713,11 @@ def render_analysis_shell(*, view: str) -> str:
713
  <p class="text-sm text-zinc-600">Use these panels for paper-facing variant, reranking, and Nano subset audits.</p>
714
  </div>
715
  <div class="flex flex-wrap gap-2">
716
- <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-3 py-1.5 text-sm text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
717
  hx-get="/analysis?{escape(variant_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("git-compare-arrows", class_name="hakari-icon action-icon shrink-0")}<span>Variant impact</span></button>
718
- <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-3 py-1.5 text-sm text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
719
  hx-get="/analysis?{escape(rerank_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("arrow-down-up", class_name="hakari-icon action-icon shrink-0")}<span>Reranking diagnostics</span></button>
720
- <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-3 py-1.5 text-sm text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
721
  hx-get="/analysis?{escape(dataset_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("database", class_name="hakari-icon action-icon shrink-0")}<span>Dataset diagnostics</span></button>
722
  </div>
723
  </div>
@@ -803,7 +803,7 @@ def render_tabs(
803
  doc = benchmark_docs.group_doc(view_name) if benchmark_docs is not None else None
804
  if doc is None:
805
  grouped_buttons[_view_group(view_name)].append(
806
- f"""<button type="button" class="border px-3 py-1.5 text-sm {classes}"
807
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
808
  {_leaderboard_control_hx_attrs()}>
809
  {escape(view_label)}
@@ -812,8 +812,8 @@ def render_tabs(
812
  continue
813
  doc_trigger = _render_doc_summary_trigger(doc=doc, label=f"{view_label} overview")
814
  grouped_buttons[_view_group(view_name)].append(
815
- f"""<span class="doc-label-group inline-flex items-center border text-sm {classes}" data-doc-label-group="benchmark">
816
- <button type="button" class="py-1.5 pl-3 pr-0 text-left"
817
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
818
  {_leaderboard_control_hx_attrs()}>
819
  {escape(view_label)}
@@ -891,7 +891,7 @@ def _render_target_group(*, result: LeaderboardResult, sort: str, direction: str
891
  query_payload["target"] = target
892
  query = urlencode(query_payload, doseq=True)
893
  buttons.append(
894
- f"""<button type="button" class="border px-3 py-1.5 text-sm {classes}"
895
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
896
  {_leaderboard_control_hx_attrs()}>
897
  {escape(label)}
@@ -925,7 +925,6 @@ def _view_group(view_name: str) -> str:
925
  "MNanoBEIR",
926
  "NanoMMTEB-v2",
927
  "NanoRTEB",
928
- "NanoMIRACL",
929
  "NanoMLDR",
930
  "NanoBRIGHT",
931
  "NanoLaw",
@@ -981,7 +980,7 @@ def render_language_pages(
981
  )
982
  more = f"""
983
  <details class="relative">
984
- <summary class="cursor-pointer border border-zinc-300 bg-white px-3 py-1.5 text-sm text-zinc-700 hover:border-cyan-500 hover:text-cyan-700">More</summary>
985
  <div class="absolute z-10 mt-1 grid max-h-72 min-w-[28rem] grid-cols-3 gap-1 overflow-auto border border-zinc-300 bg-white p-2 shadow-sm sm:grid-cols-5">
986
  {more_buttons}
987
  </div>
@@ -989,7 +988,7 @@ def render_language_pages(
989
  """
990
  return f"""
991
  <nav class="mb-4 flex flex-wrap items-start gap-2" aria-label="Language pages">
992
- {_control_label(icon="languages", text="Language pages", extra_class="pt-1.5 text-sm")}
993
  {''.join(buttons)}
994
  {more}
995
  </nav>
@@ -1020,7 +1019,7 @@ def _language_page_button(
1020
  )
1021
  query = urlencode(query_payload, doseq=True)
1022
  data_attr = "" if option is None else f' data-language-page="{escape(option.code)}"'
1023
- return f"""<button type="button"{data_attr} class="border px-3 py-1.5 text-sm {classes}"
1024
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
1025
  {_leaderboard_control_hx_attrs()}>{escape(label)}</button>"""
1026
 
@@ -1325,7 +1324,7 @@ def render_controls(
1325
  else ""
1326
  )
1327
  return f"""
1328
- <div class="mb-4 text-sm text-zinc-700">
1329
  <form id="column-controls" class="flex flex-wrap items-center gap-x-5 gap-y-2"
1330
  hx-get="/leaderboard" hx-push-url="true"
1331
  {_leaderboard_control_hx_attrs()}
@@ -1381,13 +1380,13 @@ def render_controls(
1381
  <label class="flex min-w-64 items-center gap-2">
1382
  <span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Model name</span>
1383
  <input id="model-filter-input" type="search" name="model_filter" value="{escape(filter_state.model_filter)}"
1384
- class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-sm text-zinc-900 outline-none focus:border-cyan-700"
1385
  autocomplete="off">
1386
  </label>
1387
  <label class="flex min-w-64 items-center gap-2">
1388
  <span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Task name</span>
1389
  <input id="task-filter-input" type="search" name="task_filter" value="{escape(filter_state.task_filter)}"
1390
- class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-sm text-zinc-900 outline-none focus:border-cyan-700"
1391
  autocomplete="off">
1392
  </label>
1393
  <label class="inline-flex items-center gap-2 pt-1">
@@ -1395,7 +1394,7 @@ def render_controls(
1395
  <span class="whitespace-nowrap font-medium text-zinc-800">Recalculate Borda, Mean</span>
1396
  {_render_help_tooltip("When enabled, Model name, Task name, and active facet filters narrow the ranking population before Borda and Mean are recomputed. With a Task name filter, Borda is computed from per-task ranks over the filtered tasks.")}
1397
  </label>
1398
- <button type="submit" class="border border-zinc-300 bg-zinc-50 px-3 py-1 text-sm font-medium text-zinc-800 hover:border-cyan-600 hover:text-cyan-700">
1399
  Apply
1400
  </button>
1401
  <div id="facet-filters" class="flex flex-wrap items-start gap-3">
@@ -1442,7 +1441,7 @@ def _text_filter_hidden_fields(filter_state: FilterState) -> list[tuple[str, str
1442
 
1443
  def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
1444
  input_class = (
1445
- "w-24 border border-zinc-300 bg-white px-2 py-1 text-sm text-zinc-900 outline-none "
1446
  "focus:border-cyan-700"
1447
  )
1448
  active_classes = "border-cyan-700 bg-cyan-50" if filter_state.has_task_length_filters else "border-zinc-200 bg-zinc-50"
@@ -1451,7 +1450,7 @@ def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
1451
  "Tasks missing length metadata are excluded when a bound is set."
1452
  )
1453
  return f"""
1454
- <fieldset class="flex flex-wrap items-center gap-2 border {active_classes} px-2 py-1.5">
1455
  <legend class="inline-flex items-center gap-1 px-1 text-xs font-semibold uppercase text-zinc-500">
1456
  <span>Task string length</span>
1457
  {_render_help_tooltip(tooltip)}
@@ -1497,7 +1496,7 @@ def render_score_groups(*, result: LeaderboardResult, sort: str, direction: str,
1497
  query = urlencode(query_payload, doseq=True)
1498
  page_url = _page_url(query_payload)
1499
  buttons.append(
1500
- f"""<button type="button" class="border px-3 py-1.5 text-sm {classes}"
1501
  hx-get="{_leaderboard_url(query)}" hx-push-url="{page_url}"
1502
  {_leaderboard_control_hx_attrs()}>
1503
  {escape(score_group.label)}
@@ -1888,7 +1887,7 @@ def _variant_analysis_toggle(
1888
  else "border-zinc-300 bg-white text-zinc-700 hover:border-cyan-600 hover:text-cyan-700"
1889
  )
1890
  return f"""
1891
- <button type="button" class="border px-3 py-1.5 text-sm {toggle_classes}"
1892
  hx-get="/analysis?{escape(toggle_query, quote=True)}"
1893
  hx-target="#analysis-panel" hx-swap="innerHTML">{escape(toggle_label)}</button>
1894
  """
@@ -2102,7 +2101,7 @@ def _render_filter_details(
2102
  for value, label in options:
2103
  checked = " checked" if value in selected_values else ""
2104
  checkboxes.append(
2105
- f"""<label class="flex min-w-0 items-center gap-2 whitespace-nowrap px-2 py-1">
2106
  <input type="checkbox" name="{escape(name)}" value="{escape(value)}" class="h-4 w-4 accent-cyan-700"{checked}>
2107
  <span>{escape(label)}</span>
2108
  </label>"""
@@ -2113,7 +2112,7 @@ def _render_filter_details(
2113
  none_page_url = _page_url(none_query)
2114
  return f"""
2115
  <details class="filter-detail border border-zinc-300 bg-white" data-filter-detail="{escape(name, quote=True)}" data-filter-icon="{escape(icon, quote=True)}">
2116
- <summary class="cursor-pointer px-2 py-1 font-medium text-zinc-800">
2117
  <span class="inline-flex items-center gap-1.5">
2118
  {_icon_svg(icon, class_name="hakari-icon filter-detail-icon shrink-0")}
2119
  <span>{escape(summary)}</span>
 
713
  <p class="text-sm text-zinc-600">Use these panels for paper-facing variant, reranking, and Nano subset audits.</p>
714
  </div>
715
  <div class="flex flex-wrap gap-2">
716
+ <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
717
  hx-get="/analysis?{escape(variant_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("git-compare-arrows", class_name="hakari-icon action-icon shrink-0")}<span>Variant impact</span></button>
718
+ <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
719
  hx-get="/analysis?{escape(rerank_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("arrow-down-up", class_name="hakari-icon action-icon shrink-0")}<span>Reranking diagnostics</span></button>
720
+ <button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
721
  hx-get="/analysis?{escape(dataset_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("database", class_name="hakari-icon action-icon shrink-0")}<span>Dataset diagnostics</span></button>
722
  </div>
723
  </div>
 
803
  doc = benchmark_docs.group_doc(view_name) if benchmark_docs is not None else None
804
  if doc is None:
805
  grouped_buttons[_view_group(view_name)].append(
806
+ f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
807
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
808
  {_leaderboard_control_hx_attrs()}>
809
  {escape(view_label)}
 
812
  continue
813
  doc_trigger = _render_doc_summary_trigger(doc=doc, label=f"{view_label} overview")
814
  grouped_buttons[_view_group(view_name)].append(
815
+ f"""<span class="doc-label-group inline-flex items-center border text-[0.8125rem] {classes}" data-doc-label-group="benchmark">
816
+ <button type="button" class="py-1 pl-2 pr-0 text-left"
817
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
818
  {_leaderboard_control_hx_attrs()}>
819
  {escape(view_label)}
 
891
  query_payload["target"] = target
892
  query = urlencode(query_payload, doseq=True)
893
  buttons.append(
894
+ f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
895
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
896
  {_leaderboard_control_hx_attrs()}>
897
  {escape(label)}
 
925
  "MNanoBEIR",
926
  "NanoMMTEB-v2",
927
  "NanoRTEB",
 
928
  "NanoMLDR",
929
  "NanoBRIGHT",
930
  "NanoLaw",
 
980
  )
981
  more = f"""
982
  <details class="relative">
983
+ <summary class="cursor-pointer border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-700 hover:border-cyan-500 hover:text-cyan-700">More</summary>
984
  <div class="absolute z-10 mt-1 grid max-h-72 min-w-[28rem] grid-cols-3 gap-1 overflow-auto border border-zinc-300 bg-white p-2 shadow-sm sm:grid-cols-5">
985
  {more_buttons}
986
  </div>
 
988
  """
989
  return f"""
990
  <nav class="mb-4 flex flex-wrap items-start gap-2" aria-label="Language pages">
991
+ {_control_label(icon="languages", text="Language pages", extra_class="pt-1 text-[0.8125rem]")}
992
  {''.join(buttons)}
993
  {more}
994
  </nav>
 
1019
  )
1020
  query = urlencode(query_payload, doseq=True)
1021
  data_attr = "" if option is None else f' data-language-page="{escape(option.code)}"'
1022
+ return f"""<button type="button"{data_attr} class="border px-2 py-1 text-[0.8125rem] {classes}"
1023
  hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
1024
  {_leaderboard_control_hx_attrs()}>{escape(label)}</button>"""
1025
 
 
1324
  else ""
1325
  )
1326
  return f"""
1327
+ <div class="mb-4 text-[0.8125rem] text-zinc-700">
1328
  <form id="column-controls" class="flex flex-wrap items-center gap-x-5 gap-y-2"
1329
  hx-get="/leaderboard" hx-push-url="true"
1330
  {_leaderboard_control_hx_attrs()}
 
1380
  <label class="flex min-w-64 items-center gap-2">
1381
  <span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Model name</span>
1382
  <input id="model-filter-input" type="search" name="model_filter" value="{escape(filter_state.model_filter)}"
1383
+ class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none focus:border-cyan-700"
1384
  autocomplete="off">
1385
  </label>
1386
  <label class="flex min-w-64 items-center gap-2">
1387
  <span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Task name</span>
1388
  <input id="task-filter-input" type="search" name="task_filter" value="{escape(filter_state.task_filter)}"
1389
+ class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none focus:border-cyan-700"
1390
  autocomplete="off">
1391
  </label>
1392
  <label class="inline-flex items-center gap-2 pt-1">
 
1394
  <span class="whitespace-nowrap font-medium text-zinc-800">Recalculate Borda, Mean</span>
1395
  {_render_help_tooltip("When enabled, Model name, Task name, and active facet filters narrow the ranking population before Borda and Mean are recomputed. With a Task name filter, Borda is computed from per-task ranks over the filtered tasks.")}
1396
  </label>
1397
+ <button type="submit" class="border border-zinc-300 bg-zinc-50 px-2 py-0.5 text-[0.8125rem] font-medium text-zinc-800 hover:border-cyan-600 hover:text-cyan-700">
1398
  Apply
1399
  </button>
1400
  <div id="facet-filters" class="flex flex-wrap items-start gap-3">
 
1441
 
1442
  def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
1443
  input_class = (
1444
+ "w-24 border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none "
1445
  "focus:border-cyan-700"
1446
  )
1447
  active_classes = "border-cyan-700 bg-cyan-50" if filter_state.has_task_length_filters else "border-zinc-200 bg-zinc-50"
 
1450
  "Tasks missing length metadata are excluded when a bound is set."
1451
  )
1452
  return f"""
1453
+ <fieldset class="flex flex-wrap items-center gap-2 border {active_classes} px-1.5 py-1">
1454
  <legend class="inline-flex items-center gap-1 px-1 text-xs font-semibold uppercase text-zinc-500">
1455
  <span>Task string length</span>
1456
  {_render_help_tooltip(tooltip)}
 
1496
  query = urlencode(query_payload, doseq=True)
1497
  page_url = _page_url(query_payload)
1498
  buttons.append(
1499
+ f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
1500
  hx-get="{_leaderboard_url(query)}" hx-push-url="{page_url}"
1501
  {_leaderboard_control_hx_attrs()}>
1502
  {escape(score_group.label)}
 
1887
  else "border-zinc-300 bg-white text-zinc-700 hover:border-cyan-600 hover:text-cyan-700"
1888
  )
1889
  return f"""
1890
+ <button type="button" class="border px-2 py-1 text-[0.8125rem] {toggle_classes}"
1891
  hx-get="/analysis?{escape(toggle_query, quote=True)}"
1892
  hx-target="#analysis-panel" hx-swap="innerHTML">{escape(toggle_label)}</button>
1893
  """
 
2101
  for value, label in options:
2102
  checked = " checked" if value in selected_values else ""
2103
  checkboxes.append(
2104
+ f"""<label class="flex min-w-0 items-center gap-2 whitespace-nowrap px-1.5 py-0.5">
2105
  <input type="checkbox" name="{escape(name)}" value="{escape(value)}" class="h-4 w-4 accent-cyan-700"{checked}>
2106
  <span>{escape(label)}</span>
2107
  </label>"""
 
2112
  none_page_url = _page_url(none_query)
2113
  return f"""
2114
  <details class="filter-detail border border-zinc-300 bg-white" data-filter-detail="{escape(name, quote=True)}" data-filter-icon="{escape(icon, quote=True)}">
2115
+ <summary class="cursor-pointer px-1.5 py-0.5 text-[0.8125rem] font-medium text-zinc-800">
2116
  <span class="inline-flex items-center gap-1.5">
2117
  {_icon_svg(icon, class_name="hakari-icon filter-detail-icon shrink-0")}
2118
  <span>{escape(summary)}</span>
hakari_bench/viewer/assets/app.css CHANGED
@@ -1 +1 @@
1
- *,:after,:before{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }::backdrop{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }/*! tailwindcss v3.4.17 | MIT License | https://tailwindcss.com*/*,:after,:before{box-sizing:border-box;border:0 solid #e5e7eb}:after,:before{--tw-content:""}:host,html{line-height:1.5;-webkit-text-size-adjust:100%;-moz-tab-size:4;-o-tab-size:4;tab-size:4;font-family:ui-sans-serif,system-ui,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;font-feature-settings:normal;font-variation-settings:normal;-webkit-tap-highlight-color:transparent}body{margin:0;line-height:inherit}hr{height:0;color:inherit;border-top-width:1px}abbr:where([title]){-webkit-text-decoration:underline dotted;text-decoration:underline dotted}h1,h2,h3,h4,h5,h6{font-size:inherit;font-weight:inherit}a{color:inherit;text-decoration:inherit}b,strong{font-weight:bolder}code,kbd,pre,samp{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace;font-feature-settings:normal;font-variation-settings:normal;font-size:1em}small{font-size:80%}sub,sup{font-size:75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}table{text-indent:0;border-color:inherit;border-collapse:collapse}button,input,optgroup,select,textarea{font-family:inherit;font-feature-settings:inherit;font-variation-settings:inherit;font-size:100%;font-weight:inherit;line-height:inherit;letter-spacing:inherit;color:inherit;margin:0;padding:0}button,select{text-transform:none}button,input:where([type=button]),input:where([type=reset]),input:where([type=submit]){-webkit-appearance:button;background-color:transparent;background-image:none}:-moz-focusring{outline:auto}:-moz-ui-invalid{box-shadow:none}progress{vertical-align:baseline}::-webkit-inner-spin-button,::-webkit-outer-spin-button{height:auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appearance:button;font:inherit}summary{display:list-item}blockquote,dd,dl,figure,h1,h2,h3,h4,h5,h6,hr,p,pre{margin:0}fieldset{margin:0}fieldset,legend{padding:0}menu,ol,ul{list-style:none;margin:0;padding:0}dialog{padding:0}textarea{resize:vertical}input::-moz-placeholder,textarea::-moz-placeholder{opacity:1;color:#9ca3af}input::placeholder,textarea::placeholder{opacity:1;color:#9ca3af}[role=button],button{cursor:pointer}:disabled{cursor:default}audio,canvas,embed,iframe,img,object,svg,video{display:block;vertical-align:middle}img,video{max-width:100%;height:auto}[hidden]:where(:not([hidden=until-found])){display:none}:root{color-scheme:light dark;--hakari-bg:#f4f1e8;--hakari-surface:#fffaf0;--hakari-surface-muted:#ebe6d9;--hakari-surface-faint:#f8f4eb;--hakari-border:#d3ccb9;--hakari-border-strong:#aaa18e;--hakari-text:#1d1b18;--hakari-text-muted:#70695c;--hakari-text-faint:#918978;--hakari-accent:#256d57;--hakari-accent-soft:#dceee4;--hakari-accent-border:#80b39d;--hakari-warn-bg:#f7e7c6;--hakari-warn-border:#c28f31;--hakari-warn-text:#704b08;--hakari-danger:#b24a57;--hakari-danger-soft:#f3d7da;--hakari-purple-soft:#e8e0ef;--hakari-row-hover:#ece4d2;--hakari-font:"SFMono-Regular","Cascadia Code","Roboto Mono","Noto Sans Mono","Yu Gothic UI","Meiryo",ui-monospace,monospace}body{background-color:var(--hakari-bg);color:var(--hakari-text);font-family:var(--hakari-font);-webkit-font-smoothing:antialiased;text-rendering:optimizeLegibility}body:before{background-image:linear-gradient(hsla(0,0%,100%,.18) 1px,transparent 0),linear-gradient(90deg,rgba(29,27,24,.05) 1px,transparent 0),radial-gradient(circle at 1px 1px,rgba(29,27,24,.08) 1px,transparent 0);background-size:100% 3rem,3rem 100%,4px 4px;content:"";inset:0;opacity:.36;pointer-events:none;position:fixed;z-index:-1}@media (prefers-color-scheme:dark){:root{--hakari-bg:#29292d;--hakari-surface:#242428;--hakari-surface-muted:#303035;--hakari-surface-faint:#2a2a2f;--hakari-border:#45474c;--hakari-border-strong:#5b625f;--hakari-text:#f0eee8;--hakari-text-muted:#aaa8a1;--hakari-text-faint:#7f7f7b;--hakari-accent:#8bd99a;--hakari-accent-soft:#23352c;--hakari-accent-border:#5a8d66;--hakari-warn-bg:#3d2c1c;--hakari-warn-border:#a46a22;--hakari-warn-text:#f0c36f;--hakari-danger:#ff7d8a;--hakari-danger-soft:#41272d;--hakari-purple-soft:#332b3c;--hakari-row-hover:#38393d}body:before{background-image:linear-gradient(hsla(0,0%,100%,.035) 1px,transparent 0),linear-gradient(90deg,hsla(0,0%,100%,.025) 1px,transparent 0),radial-gradient(circle at 1px 1px,hsla(0,0%,100%,.16) 1px,transparent 0);opacity:.42}}.visible{visibility:visible}.static{position:static}.fixed{position:fixed}.absolute{position:absolute}.relative{position:relative}.sticky{position:sticky}.bottom-4{bottom:1rem}.right-4{right:1rem}.z-10{z-index:10}.z-20{z-index:20}.z-50{z-index:50}.mx-auto{margin-left:auto;margin-right:auto}.mb-1{margin-bottom:.25rem}.mb-2{margin-bottom:.5rem}.mb-3{margin-bottom:.75rem}.mb-4{margin-bottom:1rem}.mb-5{margin-bottom:1.25rem}.ml-2{margin-left:.5rem}.mt-1{margin-top:.25rem}.mt-2{margin-top:.5rem}.mt-3{margin-top:.75rem}.mt-6{margin-top:1.5rem}.block{display:block}.inline{display:inline}.flex{display:flex}.inline-flex{display:inline-flex}.table{display:table}.grid{display:grid}.hidden{display:none}.h-3\.5{height:.875rem}.h-4{height:1rem}.h-8{height:2rem}.max-h-60{max-height:15rem}.max-h-72{max-height:18rem}.max-h-80{max-height:20rem}.w-24{width:6rem}.w-3\.5{width:.875rem}.w-4{width:1rem}.w-72{width:18rem}.w-8{width:2rem}.w-\[4\.75rem\]{width:4.75rem}.w-\[min\(92vw\2c 42rem\)\]{width:min(92vw,42rem)}.w-full{width:100%}.min-w-0{min-width:0}.min-w-64{min-width:16rem}.min-w-\[28rem\]{min-width:28rem}.min-w-\[4\.75rem\]{min-width:4.75rem}.min-w-full{min-width:100%}.max-w-4xl{max-width:56rem}.max-w-\[1600px\]{max-width:1600px}.max-w-\[4\.75rem\]{max-width:4.75rem}.max-w-full{max-width:100%}.flex-1{flex:1 1 0%}.shrink-0{flex-shrink:0}.border-collapse{border-collapse:collapse}.transform{transform:translate(var(--tw-translate-x),var(--tw-translate-y)) rotate(var(--tw-rotate)) skewX(var(--tw-skew-x)) skewY(var(--tw-skew-y)) scaleX(var(--tw-scale-x)) scaleY(var(--tw-scale-y))}.cursor-pointer{cursor:pointer}.grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.grid-cols-\[10rem_1fr\]{grid-template-columns:10rem 1fr}.flex-wrap{flex-wrap:wrap}.items-start{align-items:flex-start}.items-end{align-items:flex-end}.items-center{align-items:center}.justify-start{justify-content:flex-start}.justify-end{justify-content:flex-end}.justify-center{justify-content:center}.justify-between{justify-content:space-between}.gap-0\.5{gap:.125rem}.gap-1{gap:.25rem}.gap-1\.5{gap:.375rem}.gap-2{gap:.5rem}.gap-3{gap:.75rem}.gap-x-2{-moz-column-gap:.5rem;column-gap:.5rem}.gap-x-3{-moz-column-gap:.75rem;column-gap:.75rem}.gap-x-5{-moz-column-gap:1.25rem;column-gap:1.25rem}.gap-y-1{row-gap:.25rem}.gap-y-2{row-gap:.5rem}.space-y-2>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.5rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.5rem*var(--tw-space-y-reverse))}.space-y-3>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.75rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.75rem*var(--tw-space-y-reverse))}.overflow-auto{overflow:auto}.overflow-x-auto{overflow-x:auto}.truncate{overflow:hidden;text-overflow:ellipsis}.truncate,.whitespace-nowrap{white-space:nowrap}.whitespace-pre-wrap{white-space:pre-wrap}.break-all{word-break:break-all}.rounded{border-radius:.25rem}.rounded-full{border-radius:9999px}.border{border-width:1px}.border-b{border-bottom-width:1px}.border-l{border-left-width:1px}.border-t{border-top-width:1px}.border-amber-200{--tw-border-opacity:1;border-color:rgb(253 230 138/var(--tw-border-opacity,1))}.border-cyan-200{--tw-border-opacity:1;border-color:rgb(165 243 252/var(--tw-border-opacity,1))}.border-cyan-700{--tw-border-opacity:1;border-color:rgb(14 116 144/var(--tw-border-opacity,1))}.border-violet-200{--tw-border-opacity:1;border-color:rgb(221 214 254/var(--tw-border-opacity,1))}.border-zinc-200{--tw-border-opacity:1;border-color:rgb(228 228 231/var(--tw-border-opacity,1))}.border-zinc-300{--tw-border-opacity:1;border-color:rgb(212 212 216/var(--tw-border-opacity,1))}.border-zinc-900{--tw-border-opacity:1;border-color:rgb(24 24 27/var(--tw-border-opacity,1))}.bg-amber-50{--tw-bg-opacity:1;background-color:rgb(255 251 235/var(--tw-bg-opacity,1))}.bg-cyan-50{--tw-bg-opacity:1;background-color:rgb(236 254 255/var(--tw-bg-opacity,1))}.bg-inherit{background-color:inherit}.bg-violet-50{--tw-bg-opacity:1;background-color:rgb(245 243 255/var(--tw-bg-opacity,1))}.bg-white{--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1))}.bg-zinc-100{--tw-bg-opacity:1;background-color:rgb(244 244 245/var(--tw-bg-opacity,1))}.bg-zinc-50{--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.bg-zinc-900{--tw-bg-opacity:1;background-color:rgb(24 24 27/var(--tw-bg-opacity,1))}.p-0{padding:0}.p-2{padding:.5rem}.p-3{padding:.75rem}.px-1{padding-left:.25rem;padding-right:.25rem}.px-2{padding-left:.5rem;padding-right:.5rem}.px-3{padding-left:.75rem;padding-right:.75rem}.px-4{padding-left:1rem;padding-right:1rem}.py-0{padding-top:0;padding-bottom:0}.py-0\.5{padding-top:.125rem;padding-bottom:.125rem}.py-1{padding-top:.25rem;padding-bottom:.25rem}.py-1\.5{padding-top:.375rem;padding-bottom:.375rem}.py-2{padding-top:.5rem;padding-bottom:.5rem}.py-3{padding-top:.75rem;padding-bottom:.75rem}.py-4{padding-top:1rem;padding-bottom:1rem}.py-5{padding-top:1.25rem;padding-bottom:1.25rem}.py-6{padding-top:1.5rem;padding-bottom:1.5rem}.pb-4{padding-bottom:1rem}.pl-0\.5{padding-left:.125rem}.pl-3{padding-left:.75rem}.pr-0{padding-right:0}.pr-2{padding-right:.5rem}.pt-1{padding-top:.25rem}.pt-1\.5{padding-top:.375rem}.text-left{text-align:left}.text-center{text-align:center}.text-right{text-align:right}.align-middle{vertical-align:middle}.font-mono{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace}.text-2xl{font-size:1.5rem;line-height:2rem}.text-\[0\.6875rem\]{font-size:.6875rem}.text-\[0\.8125rem\]{font-size:.8125rem}.text-\[9px\]{font-size:9px}.text-base{font-size:1rem;line-height:1.5rem}.text-lg{font-size:1.125rem;line-height:1.75rem}.text-sm{font-size:.875rem;line-height:1.25rem}.text-xl{font-size:1.25rem;line-height:1.75rem}.text-xs{font-size:.75rem;line-height:1rem}.font-medium{font-weight:500}.font-semibold{font-weight:600}.uppercase{text-transform:uppercase}.normal-case{text-transform:none}.tabular-nums{--tw-numeric-spacing:tabular-nums;font-variant-numeric:var(--tw-ordinal) var(--tw-slashed-zero) var(--tw-numeric-figure) var(--tw-numeric-spacing) var(--tw-numeric-fraction)}.leading-none{line-height:1}.leading-snug{line-height:1.375}.leading-tight{line-height:1.25}.text-amber-800{--tw-text-opacity:1;color:rgb(146 64 14/var(--tw-text-opacity,1))}.text-cyan-700{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.text-cyan-800{--tw-text-opacity:1;color:rgb(21 94 117/var(--tw-text-opacity,1))}.text-cyan-900{--tw-text-opacity:1;color:rgb(22 78 99/var(--tw-text-opacity,1))}.text-violet-800{--tw-text-opacity:1;color:rgb(91 33 182/var(--tw-text-opacity,1))}.text-white{--tw-text-opacity:1;color:rgb(255 255 255/var(--tw-text-opacity,1))}.text-zinc-400{--tw-text-opacity:1;color:rgb(161 161 170/var(--tw-text-opacity,1))}.text-zinc-500{--tw-text-opacity:1;color:rgb(113 113 122/var(--tw-text-opacity,1))}.text-zinc-600{--tw-text-opacity:1;color:rgb(82 82 91/var(--tw-text-opacity,1))}.text-zinc-700{--tw-text-opacity:1;color:rgb(63 63 70/var(--tw-text-opacity,1))}.text-zinc-800{--tw-text-opacity:1;color:rgb(39 39 42/var(--tw-text-opacity,1))}.text-zinc-900{--tw-text-opacity:1;color:rgb(24 24 27/var(--tw-text-opacity,1))}.text-zinc-950{--tw-text-opacity:1;color:rgb(9 9 11/var(--tw-text-opacity,1))}.underline{text-decoration-line:underline}.underline-offset-2{text-underline-offset:2px}.accent-cyan-700{accent-color:#0e7490}.shadow-sm{--tw-shadow:0 1px 2px 0 rgba(0,0,0,.05);--tw-shadow-colored:0 1px 2px 0 var(--tw-shadow-color);box-shadow:var(--tw-ring-offset-shadow,0 0 #0000),var(--tw-ring-shadow,0 0 #0000),var(--tw-shadow)}.outline-none{outline:2px solid transparent;outline-offset:2px}.filter{filter:var(--tw-blur) var(--tw-brightness) var(--tw-contrast) var(--tw-grayscale) var(--tw-hue-rotate) var(--tw-invert) var(--tw-saturate) var(--tw-sepia) var(--tw-drop-shadow)}body.bg-zinc-50{background-color:var(--hakari-bg)}.bg-white{background-color:var(--hakari-surface)}.bg-zinc-50,.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.bg-zinc-100{background-color:var(--hakari-surface-muted)}.bg-zinc-900{background-color:var(--hakari-accent)}.bg-cyan-50{background-color:var(--hakari-accent-soft)}.bg-amber-50{background-color:var(--hakari-warn-bg)}.bg-violet-50{background-color:var(--hakari-purple-soft)}.text-zinc-800,.text-zinc-900,.text-zinc-950{color:var(--hakari-text)}.text-zinc-600,.text-zinc-700{color:var(--hakari-text-muted)}.text-zinc-400,.text-zinc-500{color:var(--hakari-text-faint)}.text-white{color:var(--hakari-bg)}.text-cyan-700,.text-cyan-800,.text-cyan-900{color:var(--hakari-accent)}.text-amber-800{color:var(--hakari-warn-text)}.text-violet-800{color:#765a8f}.border-zinc-200,.border-zinc-300{border-color:var(--hakari-border)}.border-cyan-200,.border-cyan-500,.border-cyan-600,.border-cyan-700{border-color:var(--hakari-accent-border)}.border-amber-200{border-color:var(--hakari-warn-border)}.border-violet-200{border-color:#a78bbd}button,details,input,select,summary,textarea{border-color:var(--hakari-border)}input,select,textarea{background-color:var(--hakari-surface-faint);color:var(--hakari-text)}.hover\:text-cyan-700:hover,button:hover,summary:hover{color:var(--hakari-accent)}.model-detail-trigger,.model-detail-trigger:hover{color:var(--hakari-text)}.hakari-icon{height:.875rem;width:.875rem}.doc-summary-trigger .hakari-icon,.tooltip-trigger .hakari-icon{height:.75rem;width:.75rem}.action-icon,.control-heading-icon,.filter-detail-icon,.section-heading-icon{color:var(--hakari-accent)}.leaderboard-row:hover>td{background-color:var(--hakari-row-hover)}.leaderboard-table-scroll{--hakari-model-col-width:clamp(18rem,40vw,36rem);--hakari-rank-col-width:4rem}.leaderboard-col-model{box-sizing:border-box;left:0;max-width:var(--hakari-model-col-width);min-width:var(--hakari-model-col-width);width:var(--hakari-model-col-width)}.leaderboard-col-rank{box-sizing:border-box;max-width:var(--hakari-rank-col-width);min-width:var(--hakari-rank-col-width);width:var(--hakari-rank-col-width)}.benchmark-doc h1{font-size:1.5rem;font-weight:600;line-height:2rem}.benchmark-doc h2{border-top:1px solid var(--hakari-border);font-size:1.125rem;font-weight:600;line-height:1.75rem;margin-top:1.5rem;padding-top:1rem}.benchmark-doc h3{font-size:1rem;font-weight:600;line-height:1.5rem;margin-top:1rem}.benchmark-doc blockquote,.benchmark-doc p,.benchmark-doc pre,.benchmark-doc table,.benchmark-doc ul{margin-top:.75rem}.benchmark-doc blockquote,.benchmark-doc li,.benchmark-doc p,.benchmark-doc td{color:var(--hakari-text-muted);font-size:.875rem;line-height:1.55}.benchmark-doc ul{list-style:disc;padding-left:1.25rem}.benchmark-doc a{color:var(--hakari-accent);text-decoration-line:underline;text-underline-offset:2px}.benchmark-doc table{min-width:100%}.benchmark-doc td{border-top:1px solid var(--hakari-border);padding:.375rem .5rem;vertical-align:top}.benchmark-doc code,.benchmark-doc pre{font-family:var(--hakari-font)}.benchmark-doc pre{border:1px solid var(--hakari-border);overflow-x:auto;padding:.75rem}.tooltip-trigger{cursor:pointer;position:relative}.tooltip-trigger:after{display:none}.global-tooltip{z-index:1000;max-width:min(24rem,calc(100vw - 2rem));opacity:0;overflow-wrap:anywhere;pointer-events:none;text-transform:none;transition:opacity .12s ease-in-out;white-space:normal}.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.18)}.global-tooltip[data-visible=true]{opacity:1}.leaderboard-loading-toast{opacity:0;pointer-events:none;transform:translateY(.5rem);transition:opacity .16s ease-in-out,transform .16s ease-in-out}.leaderboard-loading-toast.htmx-request{opacity:1;transform:translateY(0)}[data-leaderboard-pending=true]{cursor:progress;opacity:.72}[data-leaderboard-pending=true]:after{animation:hakari-leaderboard-spin .7s linear infinite;border:2px solid;border-right:2px solid transparent;border-radius:9999px;content:"";display:inline-block;height:.55rem;margin-left:.375rem;vertical-align:-.075rem;width:.55rem}.task-z-score{box-sizing:border-box;display:inline-flex;width:3.75rem;min-width:3.75rem;flex-direction:column;align-items:flex-end;justify-content:center;border:1px solid rgba(29,27,24,.14);border-radius:0;line-height:1;padding:.1rem .3rem}.task-z-score-value{font-size:.8125rem;font-weight:400}.task-z-score-delta{margin-top:0;font-size:.5625rem;font-weight:400}.task-z-neutral{background-color:transparent;color:inherit}.task-z-pos-025{background-color:#f4f1df;color:#5a6335}.task-z-pos-050{background-color:#e8e5c4;color:#4f5b2e}.task-z-pos-075{background-color:#d7d49f;color:#3f4a24}.task-z-pos-100{background-color:#c5c27c;color:#30391b}.task-z-pos-125{background-color:#aaa85f;color:#252d16}.task-z-pos-150{background-color:#8c8f46;color:#fffaf0}.task-z-pos-175{background-color:#707735;color:#fffaf0}.task-z-pos-200{background-color:#566126;color:#fffaf0}.task-z-neg-025{background-color:#f7ebe4;color:#87513e}.task-z-neg-050{background-color:#efd8cb;color:#7a4234}.task-z-neg-075{background-color:#e5c0ae;color:#6e342c}.task-z-neg-100{background-color:#d99f88;color:#55251f}.task-z-neg-125{background-color:#c97c63;color:#3a1714}.task-z-neg-150{background-color:#ad5d4d;color:#fffaf0}.task-z-neg-175{background-color:#924436;color:#fffaf0}.task-z-neg-200{background-color:#733126;color:#fffaf0}@keyframes hakari-leaderboard-spin{to{transform:rotate(1turn)}}@media (prefers-reduced-motion:reduce){.leaderboard-loading-toast{transition:none}[data-leaderboard-pending=true]:after{animation:none}}@media (prefers-color-scheme:dark){.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.5)}button,input,select,summary,textarea{color-scheme:dark}dialog::backdrop{background-color:rgba(24,24,27,.76)}.global-tooltip{border-color:var(--hakari-border-strong);background-color:var(--hakari-surface);color:var(--hakari-text)}.task-z-score{border-color:hsla(45,21%,93%,.22)}.task-z-pos-025{background-color:#022c22;color:#a7f3d0}.task-z-pos-050{background-color:#064e3b;color:#d1fae5}.task-z-pos-075{background-color:#065f46;color:#ecfdf5}.task-z-pos-100{background-color:#047857;color:#fff}.task-z-pos-125{background-color:#059669;color:#fff}.task-z-pos-150{background-color:#10b981;color:#022c22}.task-z-pos-175{background-color:#34d399;color:#022c22}.task-z-pos-200{background-color:#6ee7b7;color:#022c22}.task-z-neg-025{background-color:#4c0519;color:#fecdd3}.task-z-neg-050{background-color:#881337;color:#ffe4e6}.task-z-neg-075{background-color:#9f1239;color:#fff1f2}.task-z-neg-100{background-color:#be123c;color:#fff}.task-z-neg-125{background-color:#e11d48;color:#fff}.task-z-neg-150{background-color:#f43f5e;color:#fff}.task-z-neg-175{background-color:#e11d48;color:#fff}.task-z-neg-200{background-color:#be123c;color:#fff}}.\[index\:end\]{index:end}.\[overflow-wrap\:anywhere\]{overflow-wrap:anywhere}.backdrop\:bg-zinc-950\/35::backdrop{background-color:rgba(9,9,11,.35)}.odd\:bg-white:nth-child(odd){--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1));background-color:var(--hakari-surface)}.even\:bg-zinc-50:nth-child(2n){--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.even\:bg-zinc-50:nth-child(2n)body{background-color:var(--hakari-bg)}.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.hover\:border-cyan-500:hover{--tw-border-opacity:1;border-color:rgb(6 182 212/var(--tw-border-opacity,1))}.hover\:border-cyan-600:hover{--tw-border-opacity:1;border-color:rgb(8 145 178/var(--tw-border-opacity,1))}.hover\:text-cyan-700:hover{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.hover\:underline:hover{text-decoration-line:underline}.hover\:text-cyan-700:hover{color:var(--hakari-accent)}.focus\:border-cyan-700:focus,.hover\:border-cyan-500:hover,.hover\:border-cyan-600:hover{border-color:var(--hakari-accent-border)}.focus\:border-cyan-700:focus{--tw-border-opacity:1}@media (min-width:640px){.sm\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.sm\:grid-cols-5{grid-template-columns:repeat(5,minmax(0,1fr))}.sm\:px-6{padding-left:1.5rem;padding-right:1.5rem}}@media (min-width:1024px){.lg\:grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.lg\:grid-cols-6{grid-template-columns:repeat(6,minmax(0,1fr))}}@media (min-width:1280px){.xl\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}}
 
1
+ *,:after,:before{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }::backdrop{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }/*! tailwindcss v3.4.17 | MIT License | https://tailwindcss.com*/*,:after,:before{box-sizing:border-box;border:0 solid #e5e7eb}:after,:before{--tw-content:""}:host,html{line-height:1.5;-webkit-text-size-adjust:100%;-moz-tab-size:4;-o-tab-size:4;tab-size:4;font-family:ui-sans-serif,system-ui,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;font-feature-settings:normal;font-variation-settings:normal;-webkit-tap-highlight-color:transparent}body{margin:0;line-height:inherit}hr{height:0;color:inherit;border-top-width:1px}abbr:where([title]){-webkit-text-decoration:underline dotted;text-decoration:underline dotted}h1,h2,h3,h4,h5,h6{font-size:inherit;font-weight:inherit}a{color:inherit;text-decoration:inherit}b,strong{font-weight:bolder}code,kbd,pre,samp{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace;font-feature-settings:normal;font-variation-settings:normal;font-size:1em}small{font-size:80%}sub,sup{font-size:75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}table{text-indent:0;border-color:inherit;border-collapse:collapse}button,input,optgroup,select,textarea{font-family:inherit;font-feature-settings:inherit;font-variation-settings:inherit;font-size:100%;font-weight:inherit;line-height:inherit;letter-spacing:inherit;color:inherit;margin:0;padding:0}button,select{text-transform:none}button,input:where([type=button]),input:where([type=reset]),input:where([type=submit]){-webkit-appearance:button;background-color:transparent;background-image:none}:-moz-focusring{outline:auto}:-moz-ui-invalid{box-shadow:none}progress{vertical-align:baseline}::-webkit-inner-spin-button,::-webkit-outer-spin-button{height:auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appearance:button;font:inherit}summary{display:list-item}blockquote,dd,dl,figure,h1,h2,h3,h4,h5,h6,hr,p,pre{margin:0}fieldset{margin:0}fieldset,legend{padding:0}menu,ol,ul{list-style:none;margin:0;padding:0}dialog{padding:0}textarea{resize:vertical}input::-moz-placeholder,textarea::-moz-placeholder{opacity:1;color:#9ca3af}input::placeholder,textarea::placeholder{opacity:1;color:#9ca3af}[role=button],button{cursor:pointer}:disabled{cursor:default}audio,canvas,embed,iframe,img,object,svg,video{display:block;vertical-align:middle}img,video{max-width:100%;height:auto}[hidden]:where(:not([hidden=until-found])){display:none}:root{color-scheme:light dark;--hakari-bg:#f4f1e8;--hakari-surface:#fffaf0;--hakari-surface-muted:#ebe6d9;--hakari-surface-faint:#f8f4eb;--hakari-border:#d3ccb9;--hakari-border-strong:#aaa18e;--hakari-text:#1d1b18;--hakari-text-muted:#70695c;--hakari-text-faint:#918978;--hakari-accent:#256d57;--hakari-accent-soft:#dceee4;--hakari-accent-border:#80b39d;--hakari-warn-bg:#f7e7c6;--hakari-warn-border:#c28f31;--hakari-warn-text:#704b08;--hakari-danger:#b24a57;--hakari-danger-soft:#f3d7da;--hakari-purple-soft:#e8e0ef;--hakari-row-hover:#ece4d2;--hakari-font:"SFMono-Regular","Cascadia Code","Roboto Mono","Noto Sans Mono","Yu Gothic UI","Meiryo",ui-monospace,monospace}body{background-color:var(--hakari-bg);color:var(--hakari-text);font-family:var(--hakari-font);-webkit-font-smoothing:antialiased;text-rendering:optimizeLegibility}body:before{background-image:linear-gradient(hsla(0,0%,100%,.18) 1px,transparent 0),linear-gradient(90deg,rgba(29,27,24,.05) 1px,transparent 0),radial-gradient(circle at 1px 1px,rgba(29,27,24,.08) 1px,transparent 0);background-size:100% 3rem,3rem 100%,4px 4px;content:"";inset:0;opacity:.36;pointer-events:none;position:fixed;z-index:-1}@media (prefers-color-scheme:dark){:root{--hakari-bg:#29292d;--hakari-surface:#242428;--hakari-surface-muted:#303035;--hakari-surface-faint:#2a2a2f;--hakari-border:#45474c;--hakari-border-strong:#5b625f;--hakari-text:#f0eee8;--hakari-text-muted:#aaa8a1;--hakari-text-faint:#7f7f7b;--hakari-accent:#8bd99a;--hakari-accent-soft:#23352c;--hakari-accent-border:#5a8d66;--hakari-warn-bg:#3d2c1c;--hakari-warn-border:#a46a22;--hakari-warn-text:#f0c36f;--hakari-danger:#ff7d8a;--hakari-danger-soft:#41272d;--hakari-purple-soft:#332b3c;--hakari-row-hover:#38393d}body:before{background-image:linear-gradient(hsla(0,0%,100%,.035) 1px,transparent 0),linear-gradient(90deg,hsla(0,0%,100%,.025) 1px,transparent 0),radial-gradient(circle at 1px 1px,hsla(0,0%,100%,.16) 1px,transparent 0);opacity:.42}}.visible{visibility:visible}.static{position:static}.fixed{position:fixed}.absolute{position:absolute}.relative{position:relative}.sticky{position:sticky}.bottom-4{bottom:1rem}.right-4{right:1rem}.z-10{z-index:10}.z-20{z-index:20}.z-50{z-index:50}.mx-auto{margin-left:auto;margin-right:auto}.mb-1{margin-bottom:.25rem}.mb-2{margin-bottom:.5rem}.mb-3{margin-bottom:.75rem}.mb-4{margin-bottom:1rem}.mb-5{margin-bottom:1.25rem}.ml-2{margin-left:.5rem}.mt-1{margin-top:.25rem}.mt-2{margin-top:.5rem}.mt-3{margin-top:.75rem}.mt-6{margin-top:1.5rem}.block{display:block}.inline{display:inline}.flex{display:flex}.inline-flex{display:inline-flex}.table{display:table}.grid{display:grid}.hidden{display:none}.h-3\.5{height:.875rem}.h-4{height:1rem}.h-8{height:2rem}.max-h-60{max-height:15rem}.max-h-72{max-height:18rem}.max-h-80{max-height:20rem}.w-24{width:6rem}.w-3\.5{width:.875rem}.w-4{width:1rem}.w-72{width:18rem}.w-8{width:2rem}.w-\[4\.75rem\]{width:4.75rem}.w-\[min\(92vw\2c 42rem\)\]{width:min(92vw,42rem)}.w-full{width:100%}.min-w-0{min-width:0}.min-w-64{min-width:16rem}.min-w-\[28rem\]{min-width:28rem}.min-w-\[4\.75rem\]{min-width:4.75rem}.min-w-full{min-width:100%}.max-w-4xl{max-width:56rem}.max-w-\[1600px\]{max-width:1600px}.max-w-\[4\.75rem\]{max-width:4.75rem}.max-w-full{max-width:100%}.flex-1{flex:1 1 0%}.shrink-0{flex-shrink:0}.border-collapse{border-collapse:collapse}.transform{transform:translate(var(--tw-translate-x),var(--tw-translate-y)) rotate(var(--tw-rotate)) skewX(var(--tw-skew-x)) skewY(var(--tw-skew-y)) scaleX(var(--tw-scale-x)) scaleY(var(--tw-scale-y))}.cursor-pointer{cursor:pointer}.grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.grid-cols-\[10rem_1fr\]{grid-template-columns:10rem 1fr}.flex-wrap{flex-wrap:wrap}.items-start{align-items:flex-start}.items-end{align-items:flex-end}.items-center{align-items:center}.justify-start{justify-content:flex-start}.justify-end{justify-content:flex-end}.justify-center{justify-content:center}.justify-between{justify-content:space-between}.gap-0\.5{gap:.125rem}.gap-1{gap:.25rem}.gap-1\.5{gap:.375rem}.gap-2{gap:.5rem}.gap-3{gap:.75rem}.gap-x-2{-moz-column-gap:.5rem;column-gap:.5rem}.gap-x-3{-moz-column-gap:.75rem;column-gap:.75rem}.gap-x-5{-moz-column-gap:1.25rem;column-gap:1.25rem}.gap-y-1{row-gap:.25rem}.gap-y-2{row-gap:.5rem}.space-y-2>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.5rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.5rem*var(--tw-space-y-reverse))}.space-y-3>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.75rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.75rem*var(--tw-space-y-reverse))}.overflow-auto{overflow:auto}.overflow-x-auto{overflow-x:auto}.truncate{overflow:hidden;text-overflow:ellipsis}.truncate,.whitespace-nowrap{white-space:nowrap}.whitespace-pre-wrap{white-space:pre-wrap}.break-all{word-break:break-all}.rounded{border-radius:.25rem}.rounded-full{border-radius:9999px}.border{border-width:1px}.border-b{border-bottom-width:1px}.border-l{border-left-width:1px}.border-t{border-top-width:1px}.border-amber-200{--tw-border-opacity:1;border-color:rgb(253 230 138/var(--tw-border-opacity,1))}.border-cyan-200{--tw-border-opacity:1;border-color:rgb(165 243 252/var(--tw-border-opacity,1))}.border-cyan-700{--tw-border-opacity:1;border-color:rgb(14 116 144/var(--tw-border-opacity,1))}.border-violet-200{--tw-border-opacity:1;border-color:rgb(221 214 254/var(--tw-border-opacity,1))}.border-zinc-200{--tw-border-opacity:1;border-color:rgb(228 228 231/var(--tw-border-opacity,1))}.border-zinc-300{--tw-border-opacity:1;border-color:rgb(212 212 216/var(--tw-border-opacity,1))}.border-zinc-900{--tw-border-opacity:1;border-color:rgb(24 24 27/var(--tw-border-opacity,1))}.bg-amber-50{--tw-bg-opacity:1;background-color:rgb(255 251 235/var(--tw-bg-opacity,1))}.bg-cyan-50{--tw-bg-opacity:1;background-color:rgb(236 254 255/var(--tw-bg-opacity,1))}.bg-inherit{background-color:inherit}.bg-violet-50{--tw-bg-opacity:1;background-color:rgb(245 243 255/var(--tw-bg-opacity,1))}.bg-white{--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1))}.bg-zinc-100{--tw-bg-opacity:1;background-color:rgb(244 244 245/var(--tw-bg-opacity,1))}.bg-zinc-50{--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.bg-zinc-900{--tw-bg-opacity:1;background-color:rgb(24 24 27/var(--tw-bg-opacity,1))}.p-0{padding:0}.p-2{padding:.5rem}.p-3{padding:.75rem}.px-1{padding-left:.25rem;padding-right:.25rem}.px-1\.5{padding-left:.375rem;padding-right:.375rem}.px-2{padding-left:.5rem;padding-right:.5rem}.px-3{padding-left:.75rem;padding-right:.75rem}.px-4{padding-left:1rem;padding-right:1rem}.py-0{padding-top:0;padding-bottom:0}.py-0\.5{padding-top:.125rem;padding-bottom:.125rem}.py-1{padding-top:.25rem;padding-bottom:.25rem}.py-1\.5{padding-top:.375rem;padding-bottom:.375rem}.py-2{padding-top:.5rem;padding-bottom:.5rem}.py-3{padding-top:.75rem;padding-bottom:.75rem}.py-4{padding-top:1rem;padding-bottom:1rem}.py-5{padding-top:1.25rem;padding-bottom:1.25rem}.py-6{padding-top:1.5rem;padding-bottom:1.5rem}.pb-4{padding-bottom:1rem}.pl-0\.5{padding-left:.125rem}.pl-2{padding-left:.5rem}.pl-3{padding-left:.75rem}.pr-0{padding-right:0}.pr-2{padding-right:.5rem}.pt-1{padding-top:.25rem}.text-left{text-align:left}.text-center{text-align:center}.text-right{text-align:right}.align-middle{vertical-align:middle}.font-mono{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace}.text-2xl{font-size:1.5rem;line-height:2rem}.text-\[0\.6875rem\]{font-size:.6875rem}.text-\[0\.8125rem\]{font-size:.8125rem}.text-\[9px\]{font-size:9px}.text-base{font-size:1rem;line-height:1.5rem}.text-lg{font-size:1.125rem;line-height:1.75rem}.text-sm{font-size:.875rem;line-height:1.25rem}.text-xl{font-size:1.25rem;line-height:1.75rem}.text-xs{font-size:.75rem;line-height:1rem}.font-medium{font-weight:500}.font-semibold{font-weight:600}.uppercase{text-transform:uppercase}.normal-case{text-transform:none}.tabular-nums{--tw-numeric-spacing:tabular-nums;font-variant-numeric:var(--tw-ordinal) var(--tw-slashed-zero) var(--tw-numeric-figure) var(--tw-numeric-spacing) var(--tw-numeric-fraction)}.leading-none{line-height:1}.leading-snug{line-height:1.375}.leading-tight{line-height:1.25}.text-amber-800{--tw-text-opacity:1;color:rgb(146 64 14/var(--tw-text-opacity,1))}.text-cyan-700{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.text-cyan-800{--tw-text-opacity:1;color:rgb(21 94 117/var(--tw-text-opacity,1))}.text-cyan-900{--tw-text-opacity:1;color:rgb(22 78 99/var(--tw-text-opacity,1))}.text-violet-800{--tw-text-opacity:1;color:rgb(91 33 182/var(--tw-text-opacity,1))}.text-white{--tw-text-opacity:1;color:rgb(255 255 255/var(--tw-text-opacity,1))}.text-zinc-400{--tw-text-opacity:1;color:rgb(161 161 170/var(--tw-text-opacity,1))}.text-zinc-500{--tw-text-opacity:1;color:rgb(113 113 122/var(--tw-text-opacity,1))}.text-zinc-600{--tw-text-opacity:1;color:rgb(82 82 91/var(--tw-text-opacity,1))}.text-zinc-700{--tw-text-opacity:1;color:rgb(63 63 70/var(--tw-text-opacity,1))}.text-zinc-800{--tw-text-opacity:1;color:rgb(39 39 42/var(--tw-text-opacity,1))}.text-zinc-900{--tw-text-opacity:1;color:rgb(24 24 27/var(--tw-text-opacity,1))}.text-zinc-950{--tw-text-opacity:1;color:rgb(9 9 11/var(--tw-text-opacity,1))}.underline{text-decoration-line:underline}.underline-offset-2{text-underline-offset:2px}.accent-cyan-700{accent-color:#0e7490}.shadow-sm{--tw-shadow:0 1px 2px 0 rgba(0,0,0,.05);--tw-shadow-colored:0 1px 2px 0 var(--tw-shadow-color);box-shadow:var(--tw-ring-offset-shadow,0 0 #0000),var(--tw-ring-shadow,0 0 #0000),var(--tw-shadow)}.outline-none{outline:2px solid transparent;outline-offset:2px}.filter{filter:var(--tw-blur) var(--tw-brightness) var(--tw-contrast) var(--tw-grayscale) var(--tw-hue-rotate) var(--tw-invert) var(--tw-saturate) var(--tw-sepia) var(--tw-drop-shadow)}body.bg-zinc-50{background-color:var(--hakari-bg)}.bg-white{background-color:var(--hakari-surface)}.bg-zinc-50,.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.bg-zinc-100{background-color:var(--hakari-surface-muted)}.bg-zinc-900{background-color:var(--hakari-accent)}.bg-cyan-50{background-color:var(--hakari-accent-soft)}.bg-amber-50{background-color:var(--hakari-warn-bg)}.bg-violet-50{background-color:var(--hakari-purple-soft)}.text-zinc-800,.text-zinc-900,.text-zinc-950{color:var(--hakari-text)}.text-zinc-600,.text-zinc-700{color:var(--hakari-text-muted)}.text-zinc-400,.text-zinc-500{color:var(--hakari-text-faint)}.text-white{color:var(--hakari-bg)}.text-cyan-700,.text-cyan-800,.text-cyan-900{color:var(--hakari-accent)}.text-amber-800{color:var(--hakari-warn-text)}.text-violet-800{color:#765a8f}.border-zinc-200,.border-zinc-300{border-color:var(--hakari-border)}.border-cyan-200,.border-cyan-500,.border-cyan-600,.border-cyan-700{border-color:var(--hakari-accent-border)}.border-amber-200{border-color:var(--hakari-warn-border)}.border-violet-200{border-color:#a78bbd}button,details,input,select,summary,textarea{border-color:var(--hakari-border)}input,select,textarea{background-color:var(--hakari-surface-faint);color:var(--hakari-text)}.hover\:text-cyan-700:hover,button:hover,summary:hover{color:var(--hakari-accent)}.model-detail-trigger,.model-detail-trigger:hover{color:var(--hakari-text)}.hakari-icon{height:.875rem;width:.875rem}.doc-summary-trigger .hakari-icon,.tooltip-trigger .hakari-icon{height:.75rem;width:.75rem}.action-icon,.control-heading-icon,.filter-detail-icon,.section-heading-icon{color:var(--hakari-accent)}.leaderboard-row:hover>td{background-color:var(--hakari-row-hover)}.leaderboard-table-scroll{--hakari-model-col-width:clamp(18rem,40vw,36rem);--hakari-rank-col-width:4rem}.leaderboard-col-model{box-sizing:border-box;left:0;max-width:var(--hakari-model-col-width);min-width:var(--hakari-model-col-width);width:var(--hakari-model-col-width)}.leaderboard-col-rank{box-sizing:border-box;max-width:var(--hakari-rank-col-width);min-width:var(--hakari-rank-col-width);width:var(--hakari-rank-col-width)}.benchmark-doc h1{font-size:1.5rem;font-weight:600;line-height:2rem}.benchmark-doc h2{border-top:1px solid var(--hakari-border);font-size:1.125rem;font-weight:600;line-height:1.75rem;margin-top:1.5rem;padding-top:1rem}.benchmark-doc h3{font-size:1rem;font-weight:600;line-height:1.5rem;margin-top:1rem}.benchmark-doc blockquote,.benchmark-doc p,.benchmark-doc pre,.benchmark-doc table,.benchmark-doc ul{margin-top:.75rem}.benchmark-doc blockquote,.benchmark-doc li,.benchmark-doc p,.benchmark-doc td{color:var(--hakari-text-muted);font-size:.875rem;line-height:1.55}.benchmark-doc ul{list-style:disc;padding-left:1.25rem}.benchmark-doc a{color:var(--hakari-accent);text-decoration-line:underline;text-underline-offset:2px}.benchmark-doc table{min-width:100%}.benchmark-doc td{border-top:1px solid var(--hakari-border);padding:.375rem .5rem;vertical-align:top}.benchmark-doc code,.benchmark-doc pre{font-family:var(--hakari-font)}.benchmark-doc pre{border:1px solid var(--hakari-border);overflow-x:auto;padding:.75rem}.tooltip-trigger{cursor:pointer;position:relative}.tooltip-trigger:after{display:none}.global-tooltip{z-index:1000;max-width:min(24rem,calc(100vw - 2rem));opacity:0;overflow-wrap:anywhere;pointer-events:none;text-transform:none;transition:opacity .12s ease-in-out;white-space:normal}.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.18)}.global-tooltip[data-visible=true]{opacity:1}.leaderboard-loading-toast{opacity:0;pointer-events:none;transform:translateY(.5rem);transition:opacity .16s ease-in-out,transform .16s ease-in-out}.leaderboard-loading-toast.htmx-request{opacity:1;transform:translateY(0)}[data-leaderboard-pending=true]{cursor:progress;opacity:.72}[data-leaderboard-pending=true]:after{animation:hakari-leaderboard-spin .7s linear infinite;border:2px solid;border-right:2px solid transparent;border-radius:9999px;content:"";display:inline-block;height:.55rem;margin-left:.375rem;vertical-align:-.075rem;width:.55rem}.task-z-score{box-sizing:border-box;display:inline-flex;width:3.75rem;min-width:3.75rem;flex-direction:column;align-items:flex-end;justify-content:center;border:1px solid rgba(29,27,24,.14);border-radius:0;line-height:1;padding:.1rem .3rem}.task-z-score-value{font-size:.8125rem;font-weight:400}.task-z-score-delta{margin-top:0;font-size:.5625rem;font-weight:400}.task-z-neutral{background-color:transparent;color:inherit}.task-z-pos-025{background-color:#f4f1df;color:#5a6335}.task-z-pos-050{background-color:#e8e5c4;color:#4f5b2e}.task-z-pos-075{background-color:#d7d49f;color:#3f4a24}.task-z-pos-100{background-color:#c5c27c;color:#30391b}.task-z-pos-125{background-color:#aaa85f;color:#252d16}.task-z-pos-150{background-color:#8c8f46;color:#fffaf0}.task-z-pos-175{background-color:#707735;color:#fffaf0}.task-z-pos-200{background-color:#566126;color:#fffaf0}.task-z-neg-025{background-color:#f7ebe4;color:#87513e}.task-z-neg-050{background-color:#efd8cb;color:#7a4234}.task-z-neg-075{background-color:#e5c0ae;color:#6e342c}.task-z-neg-100{background-color:#d99f88;color:#55251f}.task-z-neg-125{background-color:#c97c63;color:#3a1714}.task-z-neg-150{background-color:#ad5d4d;color:#fffaf0}.task-z-neg-175{background-color:#924436;color:#fffaf0}.task-z-neg-200{background-color:#733126;color:#fffaf0}@keyframes hakari-leaderboard-spin{to{transform:rotate(1turn)}}@media (prefers-reduced-motion:reduce){.leaderboard-loading-toast{transition:none}[data-leaderboard-pending=true]:after{animation:none}}@media (prefers-color-scheme:dark){.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.5)}button,input,select,summary,textarea{color-scheme:dark}dialog::backdrop{background-color:rgba(24,24,27,.76)}.global-tooltip{border-color:var(--hakari-border-strong);background-color:var(--hakari-surface);color:var(--hakari-text)}.task-z-score{border-color:hsla(45,21%,93%,.22)}.task-z-pos-025{background-color:#022c22;color:#a7f3d0}.task-z-pos-050{background-color:#064e3b;color:#d1fae5}.task-z-pos-075{background-color:#065f46;color:#ecfdf5}.task-z-pos-100{background-color:#047857;color:#fff}.task-z-pos-125{background-color:#059669;color:#fff}.task-z-pos-150{background-color:#10b981;color:#022c22}.task-z-pos-175{background-color:#34d399;color:#022c22}.task-z-pos-200{background-color:#6ee7b7;color:#022c22}.task-z-neg-025{background-color:#4c0519;color:#fecdd3}.task-z-neg-050{background-color:#881337;color:#ffe4e6}.task-z-neg-075{background-color:#9f1239;color:#fff1f2}.task-z-neg-100{background-color:#be123c;color:#fff}.task-z-neg-125{background-color:#e11d48;color:#fff}.task-z-neg-150{background-color:#f43f5e;color:#fff}.task-z-neg-175{background-color:#e11d48;color:#fff}.task-z-neg-200{background-color:#be123c;color:#fff}}.\[index\:end\]{index:end}.\[overflow-wrap\:anywhere\]{overflow-wrap:anywhere}.backdrop\:bg-zinc-950\/35::backdrop{background-color:rgba(9,9,11,.35)}.odd\:bg-white:nth-child(odd){--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1));background-color:var(--hakari-surface)}.even\:bg-zinc-50:nth-child(2n){--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.even\:bg-zinc-50:nth-child(2n)body{background-color:var(--hakari-bg)}.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.hover\:border-cyan-500:hover{--tw-border-opacity:1;border-color:rgb(6 182 212/var(--tw-border-opacity,1))}.hover\:border-cyan-600:hover{--tw-border-opacity:1;border-color:rgb(8 145 178/var(--tw-border-opacity,1))}.hover\:text-cyan-700:hover{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.hover\:underline:hover{text-decoration-line:underline}.hover\:text-cyan-700:hover{color:var(--hakari-accent)}.focus\:border-cyan-700:focus,.hover\:border-cyan-500:hover,.hover\:border-cyan-600:hover{border-color:var(--hakari-accent-border)}.focus\:border-cyan-700:focus{--tw-border-opacity:1}@media (min-width:640px){.sm\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.sm\:grid-cols-5{grid-template-columns:repeat(5,minmax(0,1fr))}.sm\:px-6{padding-left:1.5rem;padding-right:1.5rem}}@media (min-width:1024px){.lg\:grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.lg\:grid-cols-6{grid-template-columns:repeat(6,minmax(0,1fr))}}@media (min-width:1280px){.xl\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}}
hakari_bench/viewer/leaderboard.py CHANGED
@@ -233,6 +233,7 @@ class LeaderboardService:
233
  and not task_filter.strip()
234
  and not has_length_filters
235
  and language_filter_mode == "languages"
 
236
  ):
237
  precomputed = _load_precomputed_leaderboard_rows(
238
  duckdb_path=self.duckdb_path,
@@ -907,7 +908,7 @@ def _group_by_benchmark(rows: list[TaskScore]) -> dict[str, list[TaskScore]]:
907
 
908
  def _aggregate_overall_scores(rows: list[TaskScore], overall: OverallConfig) -> list[TaskScore]:
909
  component_by_benchmark = {component.name: component for component in overall.benchmark_components}
910
- if not any(component.group_by is not None for component in component_by_benchmark.values()):
911
  return rows
912
 
913
  expected_raw_tasks: dict[str, set[str]] = defaultdict(set)
@@ -1262,11 +1263,15 @@ def _language_filter_mode_for_view(config: ViewerConfig, view_name: str) -> Lang
1262
 
1263
 
1264
  def _overall_metric_score_group(overall: OverallConfig) -> ScoreGroupConfig | None:
1265
- if not any(component.group_by is not None for component in overall.benchmark_components):
1266
  return None
1267
  return ScoreGroupConfig(name="grouped_tasks", label="Grouped Tasks", group_by="task_key")
1268
 
1269
 
 
 
 
 
1270
  def _record_display_model_names(records: list[TaskResultRow], *, include_variant_details: bool) -> list[str]:
1271
  if not include_variant_details:
1272
  return [record.model_name for record in records]
 
233
  and not task_filter.strip()
234
  and not has_length_filters
235
  and language_filter_mode == "languages"
236
+ and (overall is None or not _overall_uses_grouped_components(overall))
237
  ):
238
  precomputed = _load_precomputed_leaderboard_rows(
239
  duckdb_path=self.duckdb_path,
 
908
 
909
  def _aggregate_overall_scores(rows: list[TaskScore], overall: OverallConfig) -> list[TaskScore]:
910
  component_by_benchmark = {component.name: component for component in overall.benchmark_components}
911
+ if not _overall_uses_grouped_components(overall):
912
  return rows
913
 
914
  expected_raw_tasks: dict[str, set[str]] = defaultdict(set)
 
1263
 
1264
 
1265
  def _overall_metric_score_group(overall: OverallConfig) -> ScoreGroupConfig | None:
1266
+ if not _overall_uses_grouped_components(overall):
1267
  return None
1268
  return ScoreGroupConfig(name="grouped_tasks", label="Grouped Tasks", group_by="task_key")
1269
 
1270
 
1271
+ def _overall_uses_grouped_components(overall: OverallConfig) -> bool:
1272
+ return any(component.group_by is not None for component in overall.benchmark_components)
1273
+
1274
+
1275
  def _record_display_model_names(records: list[TaskResultRow], *, include_variant_details: bool) -> list[str]:
1276
  if not include_variant_details:
1277
  return [record.model_name for record in records]
tests/test_cli.py CHANGED
@@ -66,6 +66,31 @@ def _default_dense_quantized_variants() -> list[dict[str, object]]:
66
  ]
67
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  def _truncate_quantized_variants(*dims: int) -> list[dict[str, object]]:
70
  variants: list[dict[str, object]] = []
71
  for dim in dims:
@@ -465,7 +490,7 @@ def test_parse_args_no_default_keeps_explicit_truncate_variants_only() -> None:
465
  ]
466
 
467
 
468
- def test_parse_args_does_not_add_default_quantized_variants_to_sparse_models() -> None:
469
  args = parse_args(
470
  [
471
  "evaluate",
@@ -475,6 +500,20 @@ def test_parse_args_does_not_add_default_quantized_variants_to_sparse_models() -
475
  ]
476
  )
477
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
478
  assert args.embedding_variants == []
479
 
480
 
@@ -973,6 +1012,7 @@ def test_parse_args_accepts_query_truncate_sparse_max_dims_embedding_variants()
973
  assert args.embedding_variants == [
974
  _pipeline_variant("sparse_query_max_active_dims_128", _truncate_sparse_max_dims_step(128, target="query")),
975
  _pipeline_variant("sparse_query_max_active_dims_64", _truncate_sparse_max_dims_step(64, target="query")),
 
976
  ]
977
 
978
 
@@ -1007,52 +1047,19 @@ def test_parse_args_accepts_query_and_docs_truncate_sparse_max_dims_cross_produc
1007
  ]
1008
  )
1009
 
 
 
 
 
 
 
1010
  assert args.embedding_variants == [
1011
- _pipeline_variant(
1012
- "sparse_query_max_active_dims_8_sparse_document_max_active_dims_64",
1013
- _truncate_sparse_max_dims_step(8, target="query"),
1014
- _truncate_sparse_max_dims_step(64, target="corpus"),
1015
- ),
1016
- _pipeline_variant(
1017
- "sparse_query_max_active_dims_8_sparse_document_max_active_dims_128",
1018
- _truncate_sparse_max_dims_step(8, target="query"),
1019
- _truncate_sparse_max_dims_step(128, target="corpus"),
1020
- ),
1021
- _pipeline_variant(
1022
- "sparse_query_max_active_dims_8_sparse_document_max_active_dims_256",
1023
- _truncate_sparse_max_dims_step(8, target="query"),
1024
- _truncate_sparse_max_dims_step(256, target="corpus"),
1025
- ),
1026
- _pipeline_variant(
1027
- "sparse_query_max_active_dims_16_sparse_document_max_active_dims_64",
1028
- _truncate_sparse_max_dims_step(16, target="query"),
1029
- _truncate_sparse_max_dims_step(64, target="corpus"),
1030
- ),
1031
- _pipeline_variant(
1032
- "sparse_query_max_active_dims_16_sparse_document_max_active_dims_128",
1033
- _truncate_sparse_max_dims_step(16, target="query"),
1034
- _truncate_sparse_max_dims_step(128, target="corpus"),
1035
- ),
1036
- _pipeline_variant(
1037
- "sparse_query_max_active_dims_16_sparse_document_max_active_dims_256",
1038
- _truncate_sparse_max_dims_step(16, target="query"),
1039
- _truncate_sparse_max_dims_step(256, target="corpus"),
1040
- ),
1041
- _pipeline_variant(
1042
- "sparse_query_max_active_dims_32_sparse_document_max_active_dims_64",
1043
- _truncate_sparse_max_dims_step(32, target="query"),
1044
- _truncate_sparse_max_dims_step(64, target="corpus"),
1045
- ),
1046
- _pipeline_variant(
1047
- "sparse_query_max_active_dims_32_sparse_document_max_active_dims_128",
1048
- _truncate_sparse_max_dims_step(32, target="query"),
1049
- _truncate_sparse_max_dims_step(128, target="corpus"),
1050
- ),
1051
- _pipeline_variant(
1052
- "sparse_query_max_active_dims_32_sparse_document_max_active_dims_256",
1053
- _truncate_sparse_max_dims_step(32, target="query"),
1054
- _truncate_sparse_max_dims_step(256, target="corpus"),
1055
- ),
1056
  ]
1057
 
1058
 
 
66
  ]
67
 
68
 
69
+ def _default_sparse_truncation_variants() -> list[dict[str, object]]:
70
+ return _sparse_truncation_grid_variants(
71
+ query_dims=[8, 16, 24, 32],
72
+ document_dims=[64, 128, 256, 512],
73
+ )
74
+
75
+
76
+ def _sparse_truncation_grid_variants(
77
+ *,
78
+ query_dims: list[int],
79
+ document_dims: list[int],
80
+ ) -> list[dict[str, object]]:
81
+ variants: list[dict[str, object]] = []
82
+ for query_dim in query_dims:
83
+ for document_dim in document_dims:
84
+ variants.append(
85
+ _pipeline_variant(
86
+ f"sparse_query_max_active_dims_{query_dim}_sparse_document_max_active_dims_{document_dim}",
87
+ _truncate_sparse_max_dims_step(query_dim, target="query"),
88
+ _truncate_sparse_max_dims_step(document_dim, target="corpus"),
89
+ )
90
+ )
91
+ return variants
92
+
93
+
94
  def _truncate_quantized_variants(*dims: int) -> list[dict[str, object]]:
95
  variants: list[dict[str, object]] = []
96
  for dim in dims:
 
490
  ]
491
 
492
 
493
+ def test_parse_args_defaults_to_sparse_truncation_grid_variants() -> None:
494
  args = parse_args(
495
  [
496
  "evaluate",
 
500
  ]
501
  )
502
 
503
+ assert args.embedding_variants == _default_sparse_truncation_variants()
504
+
505
+
506
+ def test_parse_args_can_disable_default_sparse_truncation_grid_variants() -> None:
507
+ args = parse_args(
508
+ [
509
+ "evaluate",
510
+ "sparse",
511
+ "--model",
512
+ "naver/splade-v3",
513
+ "--no-default-embedding-variants",
514
+ ]
515
+ )
516
+
517
  assert args.embedding_variants == []
518
 
519
 
 
1012
  assert args.embedding_variants == [
1013
  _pipeline_variant("sparse_query_max_active_dims_128", _truncate_sparse_max_dims_step(128, target="query")),
1014
  _pipeline_variant("sparse_query_max_active_dims_64", _truncate_sparse_max_dims_step(64, target="query")),
1015
+ *_default_sparse_truncation_variants(),
1016
  ]
1017
 
1018
 
 
1047
  ]
1048
  )
1049
 
1050
+ explicit_variants = _sparse_truncation_grid_variants(
1051
+ query_dims=[8, 16, 32],
1052
+ document_dims=[64, 128, 256],
1053
+ )
1054
+ explicit_names = {str(variant["name"]) for variant in explicit_variants}
1055
+
1056
  assert args.embedding_variants == [
1057
+ *explicit_variants,
1058
+ *[
1059
+ variant
1060
+ for variant in _default_sparse_truncation_variants()
1061
+ if str(variant["name"]) not in explicit_names
1062
+ ],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1063
  ]
1064
 
1065
 
tests/test_viewer.py CHANGED
@@ -56,7 +56,6 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
56
  "MNanoBEIR",
57
  "NanoMMTEB-v2",
58
  "NanoRTEB",
59
- "NanoMIRACL",
60
  "NanoMLDR",
61
  "NanoBRIGHT",
62
  "NanoLaw",
@@ -70,6 +69,15 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
70
  core_overall = config.overall_for_view("Core")
71
  assert core_overall is not None
72
  assert core_overall.benchmark_names == core_benchmarks
 
 
 
 
 
 
 
 
 
73
  grouped_overall = config.overall_for_view("Group")
74
  assert grouped_overall is not None
75
  assert [component.name for component in grouped_overall.benchmark_components] == [
@@ -135,7 +143,7 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
135
  assert all(benchmark in config.overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
136
  assert all(benchmark not in core_overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
137
  assert "NanoMIRACL" in config.overall.benchmark_names
138
- assert "NanoMIRACL" in core_overall.benchmark_names
139
  assert "NanoCMTEB" in config.view_names
140
  assert "NanoCMTEB" in config.overall.benchmark_names
141
  nano_law = config.benchmark_for_view("NanoLaw")
@@ -173,7 +181,7 @@ def test_core_benchmark_view_group_only_contains_primary_core_benchmarks() -> No
173
  assert _view_group("NanoMMTEB-v2") == "Core benchmarks"
174
  assert _view_group("MNanoBEIR") == "Core benchmarks"
175
  assert _view_group("NanoRTEB") == "Core benchmarks"
176
- assert _view_group("NanoMIRACL") == "Core benchmarks"
177
  assert _view_group("NanoMLDR") == "Core benchmarks"
178
  assert _view_group("NanoBRIGHT") == "Core benchmarks"
179
  assert _view_group("NanoLaw") == "Core benchmarks"
@@ -602,6 +610,8 @@ def test_leaderboard_renders_grouped_benchmark_picker_and_sticky_columns(tmp_pat
602
  assert response.status_code == 200
603
  assert "Benchmark groups" in response.text
604
  assert 'data-icon="layers"' in response.text
 
 
605
  assert response.text.index("Target") < response.text.index("Overall")
606
  assert 'data-testid="primary-benchmark-column"' in response.text
607
  primary_column = response.text.split('data-testid="primary-benchmark-column"', 1)[1].split('data-testid="secondary-benchmark-column"', 1)[0]
@@ -1333,6 +1343,7 @@ def test_viewer_can_include_embedding_variants_in_ranking(tmp_path: Path) -> Non
1333
  assert 'data-filter-icon="ruler"' in response.text
1334
  assert 'data-filter-detail="quant_filter"' in response.text
1335
  assert 'data-filter-icon="binary"' in response.text
 
1336
  assert "grid-cols-2" in response.text
1337
  assert "sm:grid-cols-3" in response.text
1338
  assert response.text.count(">All</button>") == 5
 
56
  "MNanoBEIR",
57
  "NanoMMTEB-v2",
58
  "NanoRTEB",
 
59
  "NanoMLDR",
60
  "NanoBRIGHT",
61
  "NanoLaw",
 
69
  core_overall = config.overall_for_view("Core")
70
  assert core_overall is not None
71
  assert core_overall.benchmark_names == core_benchmarks
72
+ assert [component.group_by for component in core_overall.benchmark_components] == [
73
+ "task_name",
74
+ None,
75
+ None,
76
+ None,
77
+ None,
78
+ None,
79
+ None,
80
+ ]
81
  grouped_overall = config.overall_for_view("Group")
82
  assert grouped_overall is not None
83
  assert [component.name for component in grouped_overall.benchmark_components] == [
 
143
  assert all(benchmark in config.overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
144
  assert all(benchmark not in core_overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
145
  assert "NanoMIRACL" in config.overall.benchmark_names
146
+ assert "NanoMIRACL" not in core_overall.benchmark_names
147
  assert "NanoCMTEB" in config.view_names
148
  assert "NanoCMTEB" in config.overall.benchmark_names
149
  nano_law = config.benchmark_for_view("NanoLaw")
 
181
  assert _view_group("NanoMMTEB-v2") == "Core benchmarks"
182
  assert _view_group("MNanoBEIR") == "Core benchmarks"
183
  assert _view_group("NanoRTEB") == "Core benchmarks"
184
+ assert _view_group("NanoMIRACL") == "Domain-specific"
185
  assert _view_group("NanoMLDR") == "Core benchmarks"
186
  assert _view_group("NanoBRIGHT") == "Core benchmarks"
187
  assert _view_group("NanoLaw") == "Core benchmarks"
 
610
  assert response.status_code == 200
611
  assert "Benchmark groups" in response.text
612
  assert 'data-icon="layers"' in response.text
613
+ assert 'class="border px-2 py-1 text-[0.8125rem] border-cyan-700 bg-cyan-50 text-cyan-900"' in response.text
614
+ assert 'class="border px-3 py-1.5 text-sm' not in response.text
615
  assert response.text.index("Target") < response.text.index("Overall")
616
  assert 'data-testid="primary-benchmark-column"' in response.text
617
  primary_column = response.text.split('data-testid="primary-benchmark-column"', 1)[1].split('data-testid="secondary-benchmark-column"', 1)[0]
 
1343
  assert 'data-filter-icon="ruler"' in response.text
1344
  assert 'data-filter-detail="quant_filter"' in response.text
1345
  assert 'data-filter-icon="binary"' in response.text
1346
+ assert 'summary class="cursor-pointer px-1.5 py-0.5 text-[0.8125rem] font-medium text-zinc-800"' in response.text
1347
  assert "grid-cols-2" in response.text
1348
  assert "sm:grid-cols-3" in response.text
1349
  assert response.text.count(">All</button>") == 5
tests/test_viewer_browser.py CHANGED
@@ -56,6 +56,24 @@ def test_viewer_browser_smoke_covers_static_javascript(tmp_path: Path) -> None:
56
  assert section_icon_state["color"] != "rgb(0, 0, 0)"
57
  assert page.locator("button", has_text="Variant impact").locator("svg[data-icon='git-compare-arrows']").count() == 1
58
  assert page.locator("summary", has_text="Languages").locator("svg[data-icon='languages']").count() == 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  page.get_by_text("256d <- 384").wait_for(timeout=15_000)
60
  compact_table_state = page.locator("tbody tr:not([hidden]) td").first.evaluate(
61
  """(el) => ({
 
56
  assert section_icon_state["color"] != "rgb(0, 0, 0)"
57
  assert page.locator("button", has_text="Variant impact").locator("svg[data-icon='git-compare-arrows']").count() == 1
58
  assert page.locator("summary", has_text="Languages").locator("svg[data-icon='languages']").count() == 1
59
+ compact_tile_state = page.locator("button", has_text="Variant impact").first.evaluate(
60
+ """(el) => ({
61
+ fontSize: parseFloat(getComputedStyle(el).fontSize),
62
+ paddingLeft: parseFloat(getComputedStyle(el).paddingLeft),
63
+ paddingTop: parseFloat(getComputedStyle(el).paddingTop),
64
+ })"""
65
+ )
66
+ assert compact_tile_state["fontSize"] == pytest.approx(13.0, abs=0.1)
67
+ assert compact_tile_state["paddingLeft"] <= 8.0
68
+ assert compact_tile_state["paddingTop"] <= 4.0
69
+ language_tile_state = page.locator("nav[aria-label='Language pages'] button", has_text="All").first.evaluate(
70
+ """(el) => ({
71
+ fontSize: parseFloat(getComputedStyle(el).fontSize),
72
+ paddingLeft: parseFloat(getComputedStyle(el).paddingLeft),
73
+ paddingTop: parseFloat(getComputedStyle(el).paddingTop),
74
+ })"""
75
+ )
76
+ assert language_tile_state == pytest.approx({"fontSize": 13.0, "paddingLeft": 8.0, "paddingTop": 4.0}, abs=0.1)
77
  page.get_by_text("256d <- 384").wait_for(timeout=15_000)
78
  compact_table_state = page.locator("tbody tr:not([hidden]) td").first.evaluate(
79
  """(el) => ({