Spaces:
Running
Running
Compact leaderboard selection tiles
Browse files- README.md +13 -6
- config/viewer/overall.yaml +2 -2
- docs/benchmark_evaluation.md +17 -11
- docs/core_set_selection.md +41 -31
- docs/duckdb_schema.md +29 -25
- hakari_bench/cli.py +11 -2
- hakari_bench/embedding_variants.py +24 -0
- hakari_bench/viewer/app.py +20 -21
- hakari_bench/viewer/assets/app.css +1 -1
- hakari_bench/viewer/leaderboard.py +7 -2
- tests/test_cli.py +53 -46
- tests/test_viewer.py +14 -3
- tests/test_viewer_browser.py +18 -0
README.md
CHANGED
|
@@ -141,21 +141,28 @@ The selected limits are written to result JSON under
|
|
| 141 |
Sparse embedding metadata records `nnz_total`, `nnz_mean`, `nnz_median`,
|
| 142 |
`nnz_max`, and `density` for queries and documents.
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
```bash
|
| 148 |
uv run hakari-bench evaluate sparse \
|
| 149 |
--model naver/splade-v3 \
|
| 150 |
--dataset NanoBEIR-en \
|
| 151 |
-
--embedding-variant sparse-query-max-active-dims:
|
| 152 |
-
--embedding-variant sparse-
|
| 153 |
-
--embedding-variant-grid sparse-query-max-active-dims:8,16,32 sparse-document-max-active-dims:64,128,256
|
| 154 |
```
|
| 155 |
|
| 156 |
These variants keep the top absolute-value dimensions per query/document row
|
| 157 |
and record each derived result under `evaluation.embedding_evaluations`, like
|
| 158 |
-
dense `truncate:` variants.
|
|
|
|
|
|
|
| 159 |
|
| 160 |
Sparse embeddings intentionally do not support quantized embedding variants in
|
| 161 |
the CLI. Use post-encode sparse truncation variants for sparse footprint and
|
|
|
|
| 141 |
Sparse embedding metadata records `nnz_total`, `nnz_mean`, `nnz_median`,
|
| 142 |
`nnz_max`, and `density` for queries and documents.
|
| 143 |
|
| 144 |
+
Sparse evaluation automatically compares the default post-encode sparsity grid
|
| 145 |
+
from one full sparse model encode unless `--no-default-embedding-variants` is
|
| 146 |
+
set:
|
| 147 |
+
|
| 148 |
+
- query max active dims: `8,16,24,32`
|
| 149 |
+
- document max active dims: `64,128,256,512`
|
| 150 |
+
|
| 151 |
+
To add other sparsity limits, use post-encode embedding variants:
|
| 152 |
|
| 153 |
```bash
|
| 154 |
uv run hakari-bench evaluate sparse \
|
| 155 |
--model naver/splade-v3 \
|
| 156 |
--dataset NanoBEIR-en \
|
| 157 |
+
--embedding-variant sparse-query-max-active-dims:48 \
|
| 158 |
+
--embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768
|
|
|
|
| 159 |
```
|
| 160 |
|
| 161 |
These variants keep the top absolute-value dimensions per query/document row
|
| 162 |
and record each derived result under `evaluation.embedding_evaluations`, like
|
| 163 |
+
dense `truncate:` variants. Use `--no-default-embedding-variants` when a sparse
|
| 164 |
+
run should write only the base no-limit result or only explicitly requested
|
| 165 |
+
sparse variants.
|
| 166 |
|
| 167 |
Sparse embeddings intentionally do not support quantized embedding variants in
|
| 168 |
the CLI. Use post-encode sparse truncation variants for sparse footprint and
|
config/viewer/overall.yaml
CHANGED
|
@@ -40,10 +40,10 @@ overalls:
|
|
| 40 |
- name: Core
|
| 41 |
label: Core
|
| 42 |
benchmarks:
|
| 43 |
-
- MNanoBEIR
|
|
|
|
| 44 |
- NanoMMTEB-v2
|
| 45 |
- NanoRTEB
|
| 46 |
-
- NanoMIRACL
|
| 47 |
- NanoMLDR
|
| 48 |
- NanoBRIGHT
|
| 49 |
- NanoLaw
|
|
|
|
| 40 |
- name: Core
|
| 41 |
label: Core
|
| 42 |
benchmarks:
|
| 43 |
+
- name: MNanoBEIR
|
| 44 |
+
group_by: task_name
|
| 45 |
- NanoMMTEB-v2
|
| 46 |
- NanoRTEB
|
|
|
|
| 47 |
- NanoMLDR
|
| 48 |
- NanoBRIGHT
|
| 49 |
- NanoLaw
|
docs/benchmark_evaluation.md
CHANGED
|
@@ -222,32 +222,38 @@ uv run hakari-bench evaluate sparse \
|
|
| 222 |
```
|
| 223 |
|
| 224 |
Do not add dense quantized embedding variants for sparse/SPLADE-style models.
|
| 225 |
-
Sparse quantization is intentionally unsupported in the CLI.
|
| 226 |
-
|
|
|
|
| 227 |
|
| 228 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
|
| 230 |
```bash
|
| 231 |
uv run hakari-bench evaluate sparse \
|
| 232 |
--model MODEL_NAME \
|
| 233 |
--dataset DATASET_NAME \
|
| 234 |
-
--embedding-variant sparse-query-max-active-dims:
|
| 235 |
```
|
| 236 |
|
| 237 |
-
|
| 238 |
|
| 239 |
```bash
|
| 240 |
uv run hakari-bench evaluate sparse \
|
| 241 |
--model MODEL_NAME \
|
| 242 |
--dataset DATASET_NAME \
|
| 243 |
-
--embedding-variant sparse-query-max-active-dims:
|
| 244 |
-
--embedding-variant sparse-document-max-active-dims:64,128,256 \
|
| 245 |
-
--embedding-variant-grid sparse-query-max-active-dims:8,16,32 sparse-document-max-active-dims:64,128,256
|
| 246 |
```
|
| 247 |
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
|
|
|
| 251 |
|
| 252 |
## Late-Interaction, Reranker, And BM25
|
| 253 |
|
|
|
|
| 222 |
```
|
| 223 |
|
| 224 |
Do not add dense quantized embedding variants for sparse/SPLADE-style models.
|
| 225 |
+
Sparse quantization is intentionally unsupported in the CLI. Sparse runs
|
| 226 |
+
automatically include post-encode query/document max-active-dims grid variants
|
| 227 |
+
unless `--no-default-embedding-variants` is set:
|
| 228 |
|
| 229 |
+
- query max active dims: `8,16,24,32`
|
| 230 |
+
- document max active dims: `64,128,256,512`
|
| 231 |
+
|
| 232 |
+
These variants are derived after one full sparse model encode and do not run
|
| 233 |
+
additional model inference.
|
| 234 |
+
|
| 235 |
+
Additional query-only sparsity limits:
|
| 236 |
|
| 237 |
```bash
|
| 238 |
uv run hakari-bench evaluate sparse \
|
| 239 |
--model MODEL_NAME \
|
| 240 |
--dataset DATASET_NAME \
|
| 241 |
+
--embedding-variant sparse-query-max-active-dims:48
|
| 242 |
```
|
| 243 |
|
| 244 |
+
Additional query/document grids:
|
| 245 |
|
| 246 |
```bash
|
| 247 |
uv run hakari-bench evaluate sparse \
|
| 248 |
--model MODEL_NAME \
|
| 249 |
--dataset DATASET_NAME \
|
| 250 |
+
--embedding-variant-grid sparse-query-max-active-dims:48 sparse-document-max-active-dims:768
|
|
|
|
|
|
|
| 251 |
```
|
| 252 |
|
| 253 |
+
The base no-limit result is always included as
|
| 254 |
+
`evaluation.embedding_evaluations[0]`. Use `--no-default-embedding-variants`
|
| 255 |
+
when intentionally running only the base no-limit result or only explicitly
|
| 256 |
+
specified sparse variants.
|
| 257 |
|
| 258 |
## Late-Interaction, Reranker, And BM25
|
| 259 |
|
docs/core_set_selection.md
CHANGED
|
@@ -12,13 +12,12 @@ set is:
|
|
| 12 |
1. `MNanoBEIR`
|
| 13 |
2. `NanoMMTEB-v2`
|
| 14 |
3. `NanoRTEB`
|
| 15 |
-
4. `
|
| 16 |
-
5. `
|
| 17 |
-
6. `
|
| 18 |
-
7. `
|
| 19 |
-
8. `NanoCoIR`
|
| 20 |
|
| 21 |
-
This document records why these
|
| 22 |
made by combining external adoption signals, source benchmark quality, task and
|
| 23 |
language diversity, overlap analysis, lexical baseline difficulty, and actual
|
| 24 |
dense-model score dispersion from the evaluated DuckDB warehouse. The goal was
|
|
@@ -26,18 +25,24 @@ not to maximize task count. It was to keep a compact set whose aggregate score
|
|
| 26 |
is interpretable, broad, and difficult to game by over-weighting one benchmark
|
| 27 |
family.
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## Final Core Set
|
| 30 |
|
| 31 |
| Position | Nano set | Role in Core | Main reason for inclusion |
|
| 32 |
| ---: | --- | --- | --- |
|
| 33 |
-
| 1 | `MNanoBEIR` | Classical multilingual IR anchor | BEIR-style retrieval remains a common reference point
|
| 34 |
| 2 | `NanoMMTEB-v2` | Broad multilingual MTEB/MMTEB anchor | Represents modern MTEB-style retrieval coverage across many task types and languages. |
|
| 35 |
| 3 | `NanoRTEB` | Practical retrieval domains | Adds English RTEB-style applied retrieval tasks with strong model separation. |
|
| 36 |
-
| 4 | `
|
| 37 |
-
| 5 | `
|
| 38 |
-
| 6 | `
|
| 39 |
-
| 7 | `
|
| 40 |
-
| 8 | `NanoCoIR` | Code retrieval | Preserves a code-search dimension that is not captured by legal, long-document, or general IR tasks. |
|
| 41 |
|
| 42 |
## Pruned or Not Promoted Sets
|
| 43 |
|
|
@@ -47,6 +52,7 @@ the `All` view.
|
|
| 47 |
|
| 48 |
| Nano set | Decision | Reason |
|
| 49 |
| --- | --- | --- |
|
|
|
|
| 50 |
| `NanoLongEmbed` | Removed from the earlier Core proposal | Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than `NanoMLDR`. `NanoMLDR` gives a cleaner multilingual long-document retrieval signal. |
|
| 51 |
| `NanoBIRCO` | Replaced by `NanoLaw` | `NanoBIRCO` is valuable as a complex-objective stress test, but it is small, English-only, and has weaker paper and dataset adoption signals. `NanoLaw` provides a better Core domain slot. |
|
| 52 |
| `NanoDAPFAM` | Not promoted | Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix. |
|
|
@@ -69,15 +75,15 @@ The Core set was chosen using five criteria.
|
|
| 69 |
|
| 70 |
2. Task diversity
|
| 71 |
|
| 72 |
-
The final set covers classical IR, broad MTEB/MMTEB retrieval,
|
| 73 |
-
applied retrieval,
|
| 74 |
reasoning retrieval, legal retrieval, and code retrieval.
|
| 75 |
|
| 76 |
3. Language diversity
|
| 77 |
|
| 78 |
The set contains broad multilingual groups (`MNanoBEIR`, `NanoMMTEB-v2`,
|
| 79 |
-
`
|
| 80 |
-
|
| 81 |
|
| 82 |
4. Low redundancy
|
| 83 |
|
|
@@ -112,12 +118,11 @@ Definitions:
|
|
| 112 |
- `low-var`: tasks with std <= 0.03.
|
| 113 |
- `healthy`: tasks with 0.25 < mean < 0.85 and std >= 0.05.
|
| 114 |
|
| 115 |
-
| Nano set |
|
| 116 |
-
| --- | ---
|
| 117 |
-
| `MNanoBEIR` | 182 | 0.5521 | 0.0476 | 0.1042 | 2 | 1 | 16 | 73 |
|
| 118 |
| `NanoMMTEB-v2` proxy | 18 | 0.5434 | 0.0572 | 0.1206 | 5 | 2 | 5 | 9 |
|
| 119 |
| `NanoRTEB` | 14 | 0.5954 | 0.0960 | 0.2203 | 0 | 0 | 0 | 11 |
|
| 120 |
-
| `NanoMIRACL` | 18 | 0.7880 | 0.0280 | 0.0597 | 1 | 0 | 12 | 1 |
|
| 121 |
| `NanoMLDR` | 13 | 0.5399 | 0.0844 | 0.1918 | 0 | 0 | 0 | 13 |
|
| 122 |
| `NanoBRIGHT` | 20 | 0.3289 | 0.1021 | 0.2436 | 0 | 2 | 0 | 14 |
|
| 123 |
| `NanoLaw` after Core overlap exclusions | 4 | 0.5634 | 0.0686 | 0.1516 | 0 | 0 | 0 | 4 |
|
|
@@ -133,9 +138,9 @@ This table explains several choices:
|
|
| 133 |
were healthy and because its external adoption signals are stronger.
|
| 134 |
- `NanoBRIGHT` and `NanoRTEB` were retained because they show high model
|
| 135 |
separation and few saturation artifacts.
|
| 136 |
-
- `NanoMIRACL` was
|
| 137 |
-
|
| 138 |
-
|
| 139 |
- `NanoLaw` was selected over `NanoBIRCO` after comparing domain coverage,
|
| 140 |
MTEB registration, citations, and effective Core overlap.
|
| 141 |
|
|
@@ -143,6 +148,7 @@ This table explains several choices:
|
|
| 143 |
|
| 144 |
| Nano set | Effective tasks | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy | Interpretation |
|
| 145 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
|
|
|
|
| 146 |
| `NanoLongEmbed` | 6 | 0.6265 | 0.0911 | 0.2049 | 0 | 0 | 0 | 3 | Good dispersion, but weaker external signal and more synthetic long-context overlap than `NanoMLDR`. |
|
| 147 |
| `NanoBIRCO` | 5 | 0.2890 | 0.0618 | 0.1182 | 0 | 1 | 1 | 3 | Valuable hard benchmark, but smaller, English-only, and weaker external signal than `NanoLaw`. |
|
| 148 |
| `NanoDAPFAM` | 18 | 0.2870 | 0.0322 | 0.0754 | 0 | 6 | 8 | 0 | Too low-variance for Core, despite being domain-distinct. |
|
|
@@ -251,13 +257,17 @@ sets. `NanoBRIGHT` provides hard reasoning-heavy retrieval where BM25 is weak.
|
|
| 251 |
but not sufficient. `NanoCoIR` keeps a code retrieval axis whose failure modes
|
| 252 |
are different again.
|
| 253 |
|
| 254 |
-
## Overlap Policy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 255 |
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
appropriate:
|
| 261 |
|
| 262 |
- `NanoAILACasedocs`
|
| 263 |
- `NanoAILAStatutes`
|
|
@@ -282,8 +292,8 @@ This selection should be revisited when one of the following changes:
|
|
| 282 |
- A new domain benchmark achieves both strong external adoption and strong model
|
| 283 |
separation.
|
| 284 |
- MTEB or MMTEB significantly changes the registered task catalog.
|
| 285 |
-
- Saturation increases on `
|
| 286 |
-
|
| 287 |
|
| 288 |
The Core set is not intended to replace the full `All` view. It is a compact
|
| 289 |
summary. Domain and language-specific diagnosis should still use `All`,
|
|
|
|
| 12 |
1. `MNanoBEIR`
|
| 13 |
2. `NanoMMTEB-v2`
|
| 14 |
3. `NanoRTEB`
|
| 15 |
+
4. `NanoMLDR`
|
| 16 |
+
5. `NanoBRIGHT`
|
| 17 |
+
6. `NanoLaw`
|
| 18 |
+
7. `NanoCoIR`
|
|
|
|
| 19 |
|
| 20 |
+
This document records why these seven Nano sets were selected. The decision was
|
| 21 |
made by combining external adoption signals, source benchmark quality, task and
|
| 22 |
language diversity, overlap analysis, lexical baseline difficulty, and actual
|
| 23 |
dense-model score dispersion from the evaluated DuckDB warehouse. The goal was
|
|
|
|
| 25 |
is interpretable, broad, and difficult to game by over-weighting one benchmark
|
| 26 |
family.
|
| 27 |
|
| 28 |
+
The Core score also uses configured aggregation units rather than blindly
|
| 29 |
+
averaging every raw task row. In particular, `MNanoBEIR` is aggregated by
|
| 30 |
+
`task_name`: an ArguAna-style task is first averaged across its language
|
| 31 |
+
variants and then contributes as one Core scoring unit. This preserves the
|
| 32 |
+
multilingual BEIR anchor without allowing the raw language x task matrix to
|
| 33 |
+
dominate the Core aggregate.
|
| 34 |
+
|
| 35 |
## Final Core Set
|
| 36 |
|
| 37 |
| Position | Nano set | Role in Core | Main reason for inclusion |
|
| 38 |
| ---: | --- | --- | --- |
|
| 39 |
+
| 1 | `MNanoBEIR` | Classical multilingual IR anchor | BEIR-style retrieval remains a common reference point; Core aggregates it by source task name so multilingual coverage does not dominate by raw row count. |
|
| 40 |
| 2 | `NanoMMTEB-v2` | Broad multilingual MTEB/MMTEB anchor | Represents modern MTEB-style retrieval coverage across many task types and languages. |
|
| 41 |
| 3 | `NanoRTEB` | Practical retrieval domains | Adds English RTEB-style applied retrieval tasks with strong model separation. |
|
| 42 |
+
| 4 | `NanoMLDR` | Multilingual long-document retrieval | Strong external adoption through BGE-M3/MLDR and excellent dense score dispersion across all languages. |
|
| 43 |
+
| 5 | `NanoBRIGHT` | Reasoning-heavy retrieval stress test | Hard tasks with high model separation and strong dataset usage signals. |
|
| 44 |
+
| 6 | `NanoLaw` | Legal-domain retrieval | A multilingual, multi-source legal retrieval group whose tasks are registered in MTEB and better supported than `NanoBIRCO` as a Core domain representative. |
|
| 45 |
+
| 7 | `NanoCoIR` | Code retrieval | Preserves a code-search dimension that is not captured by legal, long-document, or general IR tasks. |
|
|
|
|
| 46 |
|
| 47 |
## Pruned or Not Promoted Sets
|
| 48 |
|
|
|
|
| 52 |
|
| 53 |
| Nano set | Decision | Reason |
|
| 54 |
| --- | --- | --- |
|
| 55 |
+
| `NanoMIRACL` | Removed from Core after review | MIRACL remains a canonical multilingual benchmark, but the analyzed dense results showed substantial saturation and low model separation. Its role is better served by the `All` and benchmark-specific views than by the compact Core score. |
|
| 56 |
| `NanoLongEmbed` | Removed from the earlier Core proposal | Dense dispersion was good, but the set contains synthetic long-context probes such as passkey/needle-style tasks and has weaker external adoption than `NanoMLDR`. `NanoMLDR` gives a cleaner multilingual long-document retrieval signal. |
|
| 57 |
| `NanoBIRCO` | Replaced by `NanoLaw` | `NanoBIRCO` is valuable as a complex-objective stress test, but it is small, English-only, and has weaker paper and dataset adoption signals. `NanoLaw` provides a better Core domain slot. |
|
| 58 |
| `NanoDAPFAM` | Not promoted | Patent retrieval is distinctive, but dense model dispersion was very low and many tasks were floor-like. Better suited to a domain appendix. |
|
|
|
|
| 75 |
|
| 76 |
2. Task diversity
|
| 77 |
|
| 78 |
+
The final set covers classical multilingual IR, broad MTEB/MMTEB retrieval,
|
| 79 |
+
RTEB-style applied retrieval, multilingual long-document retrieval, hard
|
| 80 |
reasoning retrieval, legal retrieval, and code retrieval.
|
| 81 |
|
| 82 |
3. Language diversity
|
| 83 |
|
| 84 |
The set contains broad multilingual groups (`MNanoBEIR`, `NanoMMTEB-v2`,
|
| 85 |
+
`NanoMLDR`) while avoiding a Core made mostly of language-specific
|
| 86 |
+
MTEB-family views.
|
| 87 |
|
| 88 |
4. Low redundancy
|
| 89 |
|
|
|
|
| 118 |
- `low-var`: tasks with std <= 0.03.
|
| 119 |
- `healthy`: tasks with 0.25 < mean < 0.85 and std >= 0.05.
|
| 120 |
|
| 121 |
+
| Nano set | Dense analysis task rows | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy |
|
| 122 |
+
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 123 |
+
| `MNanoBEIR` | 182 raw, grouped by `task_name` in Core | 0.5521 | 0.0476 | 0.1042 | 2 | 1 | 16 | 73 |
|
| 124 |
| `NanoMMTEB-v2` proxy | 18 | 0.5434 | 0.0572 | 0.1206 | 5 | 2 | 5 | 9 |
|
| 125 |
| `NanoRTEB` | 14 | 0.5954 | 0.0960 | 0.2203 | 0 | 0 | 0 | 11 |
|
|
|
|
| 126 |
| `NanoMLDR` | 13 | 0.5399 | 0.0844 | 0.1918 | 0 | 0 | 0 | 13 |
|
| 127 |
| `NanoBRIGHT` | 20 | 0.3289 | 0.1021 | 0.2436 | 0 | 2 | 0 | 14 |
|
| 128 |
| `NanoLaw` after Core overlap exclusions | 4 | 0.5634 | 0.0686 | 0.1516 | 0 | 0 | 0 | 4 |
|
|
|
|
| 138 |
were healthy and because its external adoption signals are stronger.
|
| 139 |
- `NanoBRIGHT` and `NanoRTEB` were retained because they show high model
|
| 140 |
separation and few saturation artifacts.
|
| 141 |
+
- `NanoMIRACL` was removed from Core because its recognition as a multilingual
|
| 142 |
+
benchmark did not offset the low dense-model dispersion observed in this
|
| 143 |
+
result warehouse.
|
| 144 |
- `NanoLaw` was selected over `NanoBIRCO` after comparing domain coverage,
|
| 145 |
MTEB registration, citations, and effective Core overlap.
|
| 146 |
|
|
|
|
| 148 |
|
| 149 |
| Nano set | Effective tasks | avg_mean | avg_std | p90-p10 | ceiling | floor | low-var | healthy | Interpretation |
|
| 150 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
|
| 151 |
+
| `NanoMIRACL` | 18 | 0.7880 | 0.0280 | 0.0597 | 1 | 0 | 12 | 1 | Canonical multilingual benchmark, but too saturated and low-variance for the compact Core score. |
|
| 152 |
| `NanoLongEmbed` | 6 | 0.6265 | 0.0911 | 0.2049 | 0 | 0 | 0 | 3 | Good dispersion, but weaker external signal and more synthetic long-context overlap than `NanoMLDR`. |
|
| 153 |
| `NanoBIRCO` | 5 | 0.2890 | 0.0618 | 0.1182 | 0 | 1 | 1 | 3 | Valuable hard benchmark, but smaller, English-only, and weaker external signal than `NanoLaw`. |
|
| 154 |
| `NanoDAPFAM` | 18 | 0.2870 | 0.0322 | 0.0754 | 0 | 6 | 8 | 0 | Too low-variance for Core, despite being domain-distinct. |
|
|
|
|
| 257 |
but not sufficient. `NanoCoIR` keeps a code retrieval axis whose failure modes
|
| 258 |
are different again.
|
| 259 |
|
| 260 |
+
## Aggregation and Overlap Policy
|
| 261 |
+
|
| 262 |
+
Core normally uses one scoring unit per raw task row, except for explicitly
|
| 263 |
+
configured grouped components. The important exception is `MNanoBEIR`, where
|
| 264 |
+
Core uses `group_by: task_name` so that each BEIR source task contributes once
|
| 265 |
+
after averaging across language variants.
|
| 266 |
|
| 267 |
+
Some benchmark configurations also define excluded tasks to prevent duplicate
|
| 268 |
+
source tasks from being counted twice in benchmark-specific views. For
|
| 269 |
+
`NanoLaw`, the following tasks overlap with `NanoRTEB` or `NanoMMTEB-v2` and
|
| 270 |
+
are excluded by the viewer configuration when appropriate:
|
|
|
|
| 271 |
|
| 272 |
- `NanoAILACasedocs`
|
| 273 |
- `NanoAILAStatutes`
|
|
|
|
| 292 |
- A new domain benchmark achieves both strong external adoption and strong model
|
| 293 |
separation.
|
| 294 |
- MTEB or MMTEB significantly changes the registered task catalog.
|
| 295 |
+
- Saturation increases on `NanoCoIR` or `NanoMMTEB-v2` enough to reduce their
|
| 296 |
+
usefulness as Core components.
|
| 297 |
|
| 298 |
The Core set is not intended to replace the full `All` view. It is a compact
|
| 299 |
summary. Domain and language-specific diagnosis should still use `All`,
|
docs/duckdb_schema.md
CHANGED
|
@@ -726,10 +726,10 @@ Only models that have every expected task in the selected view are ranked.
|
|
| 726 |
3. Build the expected task set from the remaining rows.
|
| 727 |
4. Keep only models whose task-key set exactly matches the expected task set.
|
| 728 |
|
| 729 |
-
For
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
aggregated task set.
|
| 733 |
|
| 734 |
### Benchmark and Overall Views
|
| 735 |
|
|
@@ -740,20 +740,22 @@ For overall views:
|
|
| 740 |
|
| 741 |
- `All`: all configured benchmark views using raw task rows.
|
| 742 |
- `Core`: a compact curated set covering broad multilingual retrieval,
|
| 743 |
-
multilingual BEIR, English RTEB domains,
|
| 744 |
retrieval, reasoning-heavy retrieval, legal retrieval, and code retrieval
|
| 745 |
-
(`MNanoBEIR`, `NanoMMTEB-v2`, `NanoRTEB`, `
|
| 746 |
-
`
|
|
|
|
| 747 |
- `Group`: all configured benchmark views aggregated by each component's
|
| 748 |
`group_by` setting before ranking.
|
| 749 |
- `micro_mean`: mean over all included tasks with equal task weight.
|
| 750 |
- `macro_mean`: mean of benchmark-level means with equal benchmark weight.
|
| 751 |
- `mean_score`: `macro_mean` for overall views, task mean for benchmark views.
|
| 752 |
|
| 753 |
-
`All`
|
| 754 |
-
from `overall.yaml` to average tasks into benchmark-local
|
| 755 |
-
computing Borda and means. For task x language collections such as
|
| 756 |
-
`
|
|
|
|
| 757 |
|
| 758 |
Grouped overall views also expose the aggregated benchmark-local units as
|
| 759 |
metric columns. These columns use the aggregated `task_key` values, such as
|
|
@@ -898,11 +900,12 @@ choices:
|
|
| 898 |
`NanoBEIR-ja` -> `ja` or the NanoMIRACL split code, rather than expanding
|
| 899 |
every code in `languages`.
|
| 900 |
- Viewer benchmark groups put the compact curated Core set under
|
| 901 |
-
Core benchmarks. Other broader multilingual/domain suites
|
| 902 |
-
Domain-specific unless they are an official
|
| 903 |
-
such as `NanoJMTEB-v2`, `NanoFaMTEB-v2`,
|
| 904 |
-
`NanoCMTEB`. `NanoIndicQA`, `NanoMuPLeR`, and
|
| 905 |
-
Domain-specific by viewer policy even when they expose
|
|
|
|
| 906 |
|
| 907 |
The viewer logs timing records through the `hakari_bench.viewer` logger:
|
| 908 |
|
|
@@ -952,8 +955,9 @@ materialized by a custom build. It is keyed by `view_name`, `score_target`, and
|
|
| 952 |
the four display flags
|
| 953 |
`include_quantization_variants`, `include_truncate_variants`,
|
| 954 |
`include_rescore_variants`, and `include_other_variants`. The viewer uses this
|
| 955 |
-
mart when language filters, task-score columns,
|
| 956 |
-
|
|
|
|
| 957 |
`LeaderboardService` computation from task-score rows.
|
| 958 |
|
| 959 |
| column | type | meaning |
|
|
@@ -1233,9 +1237,9 @@ SELECT
|
|
| 1233 |
|
| 1234 |
### 3. Raw Overall View Leaderboard
|
| 1235 |
|
| 1236 |
-
For an overall view that uses raw tasks, such as `All`
|
| 1237 |
-
|
| 1238 |
-
|
| 1239 |
|
| 1240 |
```sql
|
| 1241 |
benchmark_means AS (
|
|
@@ -1290,11 +1294,11 @@ model_agg AS (
|
|
| 1290 |
The final result should return both `macro_mean` and `micro_mean`. `mean_rank`
|
| 1291 |
for overall views ranks `mean_score`, which is `macro_mean`.
|
| 1292 |
|
| 1293 |
-
### 4.
|
| 1294 |
|
| 1295 |
-
`Group` first
|
| 1296 |
-
Borda, means, and per-group metric
|
| 1297 |
-
`config/viewer/overall.yaml`.
|
| 1298 |
|
| 1299 |
```sql
|
| 1300 |
WITH
|
|
|
|
| 726 |
3. Build the expected task set from the remaining rows.
|
| 727 |
4. Keep only models whose task-key set exactly matches the expected task set.
|
| 728 |
|
| 729 |
+
For overall views with configured grouped components, such as `Core` and
|
| 730 |
+
`Group`, the viewer first checks raw task completeness within each
|
| 731 |
+
model/benchmark pair, then aggregates rows by the configured group key, and
|
| 732 |
+
finally applies the complete model rule again to the aggregated task set.
|
| 733 |
|
| 734 |
### Benchmark and Overall Views
|
| 735 |
|
|
|
|
| 740 |
|
| 741 |
- `All`: all configured benchmark views using raw task rows.
|
| 742 |
- `Core`: a compact curated set covering broad multilingual retrieval,
|
| 743 |
+
multilingual BEIR, English RTEB domains, multilingual long-document
|
| 744 |
retrieval, reasoning-heavy retrieval, legal retrieval, and code retrieval
|
| 745 |
+
(`MNanoBEIR`, `NanoMMTEB-v2`, `NanoRTEB`, `NanoMLDR`, `NanoBRIGHT`,
|
| 746 |
+
`NanoLaw`, and `NanoCoIR`). `MNanoBEIR` is grouped by `task_name` in Core so
|
| 747 |
+
each BEIR source task contributes once after averaging language variants.
|
| 748 |
- `Group`: all configured benchmark views aggregated by each component's
|
| 749 |
`group_by` setting before ranking.
|
| 750 |
- `micro_mean`: mean over all included tasks with equal task weight.
|
| 751 |
- `macro_mean`: mean of benchmark-level means with equal benchmark weight.
|
| 752 |
- `mean_score`: `macro_mean` for overall views, task mean for benchmark views.
|
| 753 |
|
| 754 |
+
`All` uses raw `task_key` values. `Core` and `Group` use any component-level
|
| 755 |
+
`group_by` settings from `overall.yaml` to average tasks into benchmark-local
|
| 756 |
+
units before computing Borda and means. For task x language collections such as
|
| 757 |
+
`MNanoBEIR`, Core and Group use the underlying task name (`task_name`) as the
|
| 758 |
+
grouped unit.
|
| 759 |
|
| 760 |
Grouped overall views also expose the aggregated benchmark-local units as
|
| 761 |
metric columns. These columns use the aggregated `task_key` values, such as
|
|
|
|
| 900 |
`NanoBEIR-ja` -> `ja` or the NanoMIRACL split code, rather than expanding
|
| 901 |
every code in `languages`.
|
| 902 |
- Viewer benchmark groups put the compact curated Core set under
|
| 903 |
+
Core benchmarks. Other broader multilingual/domain suites, including
|
| 904 |
+
`NanoMIRACL`, remain Domain-specific unless they are an official
|
| 905 |
+
language-specific NanoMTEB family such as `NanoJMTEB-v2`, `NanoFaMTEB-v2`,
|
| 906 |
+
`NanoRuMTEB`, `NanoVNMTEB`, or `NanoCMTEB`. `NanoIndicQA`, `NanoMuPLeR`, and
|
| 907 |
+
`NanoChemTEB` remain Domain-specific by viewer policy even when they expose
|
| 908 |
+
language pages.
|
| 909 |
|
| 910 |
The viewer logs timing records through the `hakari_bench.viewer` logger:
|
| 911 |
|
|
|
|
| 955 |
the four display flags
|
| 956 |
`include_quantization_variants`, `include_truncate_variants`,
|
| 957 |
`include_rescore_variants`, and `include_other_variants`. The viewer uses this
|
| 958 |
+
mart when language filters, task-score columns, task text filters, and
|
| 959 |
+
component-level overall grouping are not active. Those interactive and grouped
|
| 960 |
+
cases still fall back to the normal
|
| 961 |
`LeaderboardService` computation from task-score rows.
|
| 962 |
|
| 963 |
| column | type | meaning |
|
|
|
|
| 1237 |
|
| 1238 |
### 3. Raw Overall View Leaderboard
|
| 1239 |
|
| 1240 |
+
For an overall view that uses raw tasks, such as `All`, put the overall
|
| 1241 |
+
benchmarks into `selected_benchmarks` and replace `model_agg` with this
|
| 1242 |
+
version:
|
| 1243 |
|
| 1244 |
```sql
|
| 1245 |
benchmark_means AS (
|
|
|
|
| 1294 |
The final result should return both `macro_mean` and `micro_mean`. `mean_rank`
|
| 1295 |
for overall views ranks `mean_score`, which is `macro_mean`.
|
| 1296 |
|
| 1297 |
+
### 4. Grouped Overall View Leaderboard
|
| 1298 |
|
| 1299 |
+
Grouped overall views such as `Core` and `Group` first average raw tasks into
|
| 1300 |
+
benchmark-local groups, then compute Borda, means, and per-group metric
|
| 1301 |
+
columns. Generate `overall_components` from `config/viewer/overall.yaml`.
|
| 1302 |
|
| 1303 |
```sql
|
| 1304 |
WITH
|
hakari_bench/cli.py
CHANGED
|
@@ -24,6 +24,7 @@ from hakari_bench.embedding_variants import (
|
|
| 24 |
TORCH_SCORE_REPRESENTATION,
|
| 25 |
dense_embedding_variants,
|
| 26 |
parse_embedding_variants,
|
|
|
|
| 27 |
)
|
| 28 |
from hakari_bench.evaluation import LoadedIrDataset, load_ir_dataset, start_encode_pool, stop_encode_pool
|
| 29 |
from hakari_bench.model_cards import load_model_cards, write_evaluation_model_card
|
|
@@ -223,7 +224,8 @@ def _add_embedding_variant_args(parser: argparse.ArgumentParser) -> None:
|
|
| 223 |
"sparse-document-max-active-dims:DIM, normalize, int8, binary, "
|
| 224 |
"rescore:int8, rescore:binary, int8-rescore, or binary-rescore. "
|
| 225 |
"Dense runs automatically include full-dim quantized/rescore variants; "
|
| 226 |
-
"explicit truncate:DIM also expands to truncate x quantized/rescore variants."
|
|
|
|
| 227 |
),
|
| 228 |
)
|
| 229 |
parser.add_argument(
|
|
@@ -243,7 +245,8 @@ def _add_embedding_variant_args(parser: argparse.ArgumentParser) -> None:
|
|
| 243 |
action="store_true",
|
| 244 |
help=(
|
| 245 |
"Disable automatic dense int8/binary quantized and top-100 rescore variants, "
|
| 246 |
-
"including truncate x quantized/rescore expansion
|
|
|
|
| 247 |
),
|
| 248 |
)
|
| 249 |
|
|
@@ -381,6 +384,12 @@ def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
|
|
| 381 |
args.embedding_variant_grid_values,
|
| 382 |
include_defaults=not args.no_default_embedding_variants,
|
| 383 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 384 |
else:
|
| 385 |
args.embedding_variants = parse_embedding_variants(
|
| 386 |
args.embedding_variant_values,
|
|
|
|
| 24 |
TORCH_SCORE_REPRESENTATION,
|
| 25 |
dense_embedding_variants,
|
| 26 |
parse_embedding_variants,
|
| 27 |
+
sparse_embedding_variants,
|
| 28 |
)
|
| 29 |
from hakari_bench.evaluation import LoadedIrDataset, load_ir_dataset, start_encode_pool, stop_encode_pool
|
| 30 |
from hakari_bench.model_cards import load_model_cards, write_evaluation_model_card
|
|
|
|
| 224 |
"sparse-document-max-active-dims:DIM, normalize, int8, binary, "
|
| 225 |
"rescore:int8, rescore:binary, int8-rescore, or binary-rescore. "
|
| 226 |
"Dense runs automatically include full-dim quantized/rescore variants; "
|
| 227 |
+
"explicit truncate:DIM also expands to truncate x quantized/rescore variants. "
|
| 228 |
+
"Sparse runs automatically include query/document max-active-dims grid variants."
|
| 229 |
),
|
| 230 |
)
|
| 231 |
parser.add_argument(
|
|
|
|
| 245 |
action="store_true",
|
| 246 |
help=(
|
| 247 |
"Disable automatic dense int8/binary quantized and top-100 rescore variants, "
|
| 248 |
+
"including truncate x quantized/rescore expansion, and automatic sparse "
|
| 249 |
+
"query/document max-active-dims grid variants."
|
| 250 |
),
|
| 251 |
)
|
| 252 |
|
|
|
|
| 384 |
args.embedding_variant_grid_values,
|
| 385 |
include_defaults=not args.no_default_embedding_variants,
|
| 386 |
)
|
| 387 |
+
elif args.model_type == "sparse":
|
| 388 |
+
args.embedding_variants = sparse_embedding_variants(
|
| 389 |
+
args.embedding_variant_values,
|
| 390 |
+
args.embedding_variant_grid_values,
|
| 391 |
+
include_defaults=not args.no_default_embedding_variants,
|
| 392 |
+
)
|
| 393 |
else:
|
| 394 |
args.embedding_variants = parse_embedding_variants(
|
| 395 |
args.embedding_variant_values,
|
hakari_bench/embedding_variants.py
CHANGED
|
@@ -53,6 +53,18 @@ def default_dense_quantized_embedding_variants() -> list[dict[str, Any]]:
|
|
| 53 |
return parse_embedding_variants(["int8,binary", "rescore:int8,binary"])
|
| 54 |
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
def dense_embedding_variants(
|
| 57 |
values: list[str] | None,
|
| 58 |
cross_values: list[list[str]] | None = None,
|
|
@@ -73,6 +85,18 @@ def dense_embedding_variants(
|
|
| 73 |
return _dedupe_variants([*variants, *auto_variants])
|
| 74 |
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
def _default_truncate_embedding_variants(dims: list[int]) -> list[dict[str, Any]]:
|
| 77 |
if not dims:
|
| 78 |
return []
|
|
|
|
| 53 |
return parse_embedding_variants(["int8,binary", "rescore:int8,binary"])
|
| 54 |
|
| 55 |
|
| 56 |
+
def default_sparse_truncation_embedding_variants() -> list[dict[str, Any]]:
|
| 57 |
+
return parse_embedding_variants(
|
| 58 |
+
None,
|
| 59 |
+
[
|
| 60 |
+
[
|
| 61 |
+
"sparse-query-max-active-dims:8,16,24,32",
|
| 62 |
+
"sparse-document-max-active-dims:64,128,256,512",
|
| 63 |
+
]
|
| 64 |
+
],
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
def dense_embedding_variants(
|
| 69 |
values: list[str] | None,
|
| 70 |
cross_values: list[list[str]] | None = None,
|
|
|
|
| 85 |
return _dedupe_variants([*variants, *auto_variants])
|
| 86 |
|
| 87 |
|
| 88 |
+
def sparse_embedding_variants(
|
| 89 |
+
values: list[str] | None,
|
| 90 |
+
cross_values: list[list[str]] | None = None,
|
| 91 |
+
*,
|
| 92 |
+
include_defaults: bool = True,
|
| 93 |
+
) -> list[dict[str, Any]]:
|
| 94 |
+
variants = parse_embedding_variants(values, cross_values)
|
| 95 |
+
if not include_defaults:
|
| 96 |
+
return variants
|
| 97 |
+
return _dedupe_variants([*variants, *default_sparse_truncation_embedding_variants()])
|
| 98 |
+
|
| 99 |
+
|
| 100 |
def _default_truncate_embedding_variants(dims: list[int]) -> list[dict[str, Any]]:
|
| 101 |
if not dims:
|
| 102 |
return []
|
hakari_bench/viewer/app.py
CHANGED
|
@@ -713,11 +713,11 @@ def render_analysis_shell(*, view: str) -> str:
|
|
| 713 |
<p class="text-sm text-zinc-600">Use these panels for paper-facing variant, reranking, and Nano subset audits.</p>
|
| 714 |
</div>
|
| 715 |
<div class="flex flex-wrap gap-2">
|
| 716 |
-
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-
|
| 717 |
hx-get="/analysis?{escape(variant_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("git-compare-arrows", class_name="hakari-icon action-icon shrink-0")}<span>Variant impact</span></button>
|
| 718 |
-
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-
|
| 719 |
hx-get="/analysis?{escape(rerank_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("arrow-down-up", class_name="hakari-icon action-icon shrink-0")}<span>Reranking diagnostics</span></button>
|
| 720 |
-
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-
|
| 721 |
hx-get="/analysis?{escape(dataset_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("database", class_name="hakari-icon action-icon shrink-0")}<span>Dataset diagnostics</span></button>
|
| 722 |
</div>
|
| 723 |
</div>
|
|
@@ -803,7 +803,7 @@ def render_tabs(
|
|
| 803 |
doc = benchmark_docs.group_doc(view_name) if benchmark_docs is not None else None
|
| 804 |
if doc is None:
|
| 805 |
grouped_buttons[_view_group(view_name)].append(
|
| 806 |
-
f"""<button type="button" class="border px-
|
| 807 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 808 |
{_leaderboard_control_hx_attrs()}>
|
| 809 |
{escape(view_label)}
|
|
@@ -812,8 +812,8 @@ def render_tabs(
|
|
| 812 |
continue
|
| 813 |
doc_trigger = _render_doc_summary_trigger(doc=doc, label=f"{view_label} overview")
|
| 814 |
grouped_buttons[_view_group(view_name)].append(
|
| 815 |
-
f"""<span class="doc-label-group inline-flex items-center border text-
|
| 816 |
-
<button type="button" class="py-1
|
| 817 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 818 |
{_leaderboard_control_hx_attrs()}>
|
| 819 |
{escape(view_label)}
|
|
@@ -891,7 +891,7 @@ def _render_target_group(*, result: LeaderboardResult, sort: str, direction: str
|
|
| 891 |
query_payload["target"] = target
|
| 892 |
query = urlencode(query_payload, doseq=True)
|
| 893 |
buttons.append(
|
| 894 |
-
f"""<button type="button" class="border px-
|
| 895 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 896 |
{_leaderboard_control_hx_attrs()}>
|
| 897 |
{escape(label)}
|
|
@@ -925,7 +925,6 @@ def _view_group(view_name: str) -> str:
|
|
| 925 |
"MNanoBEIR",
|
| 926 |
"NanoMMTEB-v2",
|
| 927 |
"NanoRTEB",
|
| 928 |
-
"NanoMIRACL",
|
| 929 |
"NanoMLDR",
|
| 930 |
"NanoBRIGHT",
|
| 931 |
"NanoLaw",
|
|
@@ -981,7 +980,7 @@ def render_language_pages(
|
|
| 981 |
)
|
| 982 |
more = f"""
|
| 983 |
<details class="relative">
|
| 984 |
-
<summary class="cursor-pointer border border-zinc-300 bg-white px-
|
| 985 |
<div class="absolute z-10 mt-1 grid max-h-72 min-w-[28rem] grid-cols-3 gap-1 overflow-auto border border-zinc-300 bg-white p-2 shadow-sm sm:grid-cols-5">
|
| 986 |
{more_buttons}
|
| 987 |
</div>
|
|
@@ -989,7 +988,7 @@ def render_language_pages(
|
|
| 989 |
"""
|
| 990 |
return f"""
|
| 991 |
<nav class="mb-4 flex flex-wrap items-start gap-2" aria-label="Language pages">
|
| 992 |
-
{_control_label(icon="languages", text="Language pages", extra_class="pt-1
|
| 993 |
{''.join(buttons)}
|
| 994 |
{more}
|
| 995 |
</nav>
|
|
@@ -1020,7 +1019,7 @@ def _language_page_button(
|
|
| 1020 |
)
|
| 1021 |
query = urlencode(query_payload, doseq=True)
|
| 1022 |
data_attr = "" if option is None else f' data-language-page="{escape(option.code)}"'
|
| 1023 |
-
return f"""<button type="button"{data_attr} class="border px-
|
| 1024 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 1025 |
{_leaderboard_control_hx_attrs()}>{escape(label)}</button>"""
|
| 1026 |
|
|
@@ -1325,7 +1324,7 @@ def render_controls(
|
|
| 1325 |
else ""
|
| 1326 |
)
|
| 1327 |
return f"""
|
| 1328 |
-
<div class="mb-4 text-
|
| 1329 |
<form id="column-controls" class="flex flex-wrap items-center gap-x-5 gap-y-2"
|
| 1330 |
hx-get="/leaderboard" hx-push-url="true"
|
| 1331 |
{_leaderboard_control_hx_attrs()}
|
|
@@ -1381,13 +1380,13 @@ def render_controls(
|
|
| 1381 |
<label class="flex min-w-64 items-center gap-2">
|
| 1382 |
<span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Model name</span>
|
| 1383 |
<input id="model-filter-input" type="search" name="model_filter" value="{escape(filter_state.model_filter)}"
|
| 1384 |
-
class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-
|
| 1385 |
autocomplete="off">
|
| 1386 |
</label>
|
| 1387 |
<label class="flex min-w-64 items-center gap-2">
|
| 1388 |
<span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Task name</span>
|
| 1389 |
<input id="task-filter-input" type="search" name="task_filter" value="{escape(filter_state.task_filter)}"
|
| 1390 |
-
class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-
|
| 1391 |
autocomplete="off">
|
| 1392 |
</label>
|
| 1393 |
<label class="inline-flex items-center gap-2 pt-1">
|
|
@@ -1395,7 +1394,7 @@ def render_controls(
|
|
| 1395 |
<span class="whitespace-nowrap font-medium text-zinc-800">Recalculate Borda, Mean</span>
|
| 1396 |
{_render_help_tooltip("When enabled, Model name, Task name, and active facet filters narrow the ranking population before Borda and Mean are recomputed. With a Task name filter, Borda is computed from per-task ranks over the filtered tasks.")}
|
| 1397 |
</label>
|
| 1398 |
-
<button type="submit" class="border border-zinc-300 bg-zinc-50 px-
|
| 1399 |
Apply
|
| 1400 |
</button>
|
| 1401 |
<div id="facet-filters" class="flex flex-wrap items-start gap-3">
|
|
@@ -1442,7 +1441,7 @@ def _text_filter_hidden_fields(filter_state: FilterState) -> list[tuple[str, str
|
|
| 1442 |
|
| 1443 |
def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
|
| 1444 |
input_class = (
|
| 1445 |
-
"w-24 border border-zinc-300 bg-white px-2 py-1 text-
|
| 1446 |
"focus:border-cyan-700"
|
| 1447 |
)
|
| 1448 |
active_classes = "border-cyan-700 bg-cyan-50" if filter_state.has_task_length_filters else "border-zinc-200 bg-zinc-50"
|
|
@@ -1451,7 +1450,7 @@ def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
|
|
| 1451 |
"Tasks missing length metadata are excluded when a bound is set."
|
| 1452 |
)
|
| 1453 |
return f"""
|
| 1454 |
-
<fieldset class="flex flex-wrap items-center gap-2 border {active_classes} px-
|
| 1455 |
<legend class="inline-flex items-center gap-1 px-1 text-xs font-semibold uppercase text-zinc-500">
|
| 1456 |
<span>Task string length</span>
|
| 1457 |
{_render_help_tooltip(tooltip)}
|
|
@@ -1497,7 +1496,7 @@ def render_score_groups(*, result: LeaderboardResult, sort: str, direction: str,
|
|
| 1497 |
query = urlencode(query_payload, doseq=True)
|
| 1498 |
page_url = _page_url(query_payload)
|
| 1499 |
buttons.append(
|
| 1500 |
-
f"""<button type="button" class="border px-
|
| 1501 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{page_url}"
|
| 1502 |
{_leaderboard_control_hx_attrs()}>
|
| 1503 |
{escape(score_group.label)}
|
|
@@ -1888,7 +1887,7 @@ def _variant_analysis_toggle(
|
|
| 1888 |
else "border-zinc-300 bg-white text-zinc-700 hover:border-cyan-600 hover:text-cyan-700"
|
| 1889 |
)
|
| 1890 |
return f"""
|
| 1891 |
-
<button type="button" class="border px-
|
| 1892 |
hx-get="/analysis?{escape(toggle_query, quote=True)}"
|
| 1893 |
hx-target="#analysis-panel" hx-swap="innerHTML">{escape(toggle_label)}</button>
|
| 1894 |
"""
|
|
@@ -2102,7 +2101,7 @@ def _render_filter_details(
|
|
| 2102 |
for value, label in options:
|
| 2103 |
checked = " checked" if value in selected_values else ""
|
| 2104 |
checkboxes.append(
|
| 2105 |
-
f"""<label class="flex min-w-0 items-center gap-2 whitespace-nowrap px-
|
| 2106 |
<input type="checkbox" name="{escape(name)}" value="{escape(value)}" class="h-4 w-4 accent-cyan-700"{checked}>
|
| 2107 |
<span>{escape(label)}</span>
|
| 2108 |
</label>"""
|
|
@@ -2113,7 +2112,7 @@ def _render_filter_details(
|
|
| 2113 |
none_page_url = _page_url(none_query)
|
| 2114 |
return f"""
|
| 2115 |
<details class="filter-detail border border-zinc-300 bg-white" data-filter-detail="{escape(name, quote=True)}" data-filter-icon="{escape(icon, quote=True)}">
|
| 2116 |
-
<summary class="cursor-pointer px-
|
| 2117 |
<span class="inline-flex items-center gap-1.5">
|
| 2118 |
{_icon_svg(icon, class_name="hakari-icon filter-detail-icon shrink-0")}
|
| 2119 |
<span>{escape(summary)}</span>
|
|
|
|
| 713 |
<p class="text-sm text-zinc-600">Use these panels for paper-facing variant, reranking, and Nano subset audits.</p>
|
| 714 |
</div>
|
| 715 |
<div class="flex flex-wrap gap-2">
|
| 716 |
+
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
|
| 717 |
hx-get="/analysis?{escape(variant_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("git-compare-arrows", class_name="hakari-icon action-icon shrink-0")}<span>Variant impact</span></button>
|
| 718 |
+
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
|
| 719 |
hx-get="/analysis?{escape(rerank_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("arrow-down-up", class_name="hakari-icon action-icon shrink-0")}<span>Reranking diagnostics</span></button>
|
| 720 |
+
<button type="button" class="inline-flex items-center gap-1.5 border border-zinc-300 px-2 py-1 text-[0.8125rem] text-zinc-800 hover:border-cyan-600 hover:text-cyan-700"
|
| 721 |
hx-get="/analysis?{escape(dataset_query, quote=True)}" hx-target="#analysis-panel" hx-swap="innerHTML">{_icon_svg("database", class_name="hakari-icon action-icon shrink-0")}<span>Dataset diagnostics</span></button>
|
| 722 |
</div>
|
| 723 |
</div>
|
|
|
|
| 803 |
doc = benchmark_docs.group_doc(view_name) if benchmark_docs is not None else None
|
| 804 |
if doc is None:
|
| 805 |
grouped_buttons[_view_group(view_name)].append(
|
| 806 |
+
f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
|
| 807 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 808 |
{_leaderboard_control_hx_attrs()}>
|
| 809 |
{escape(view_label)}
|
|
|
|
| 812 |
continue
|
| 813 |
doc_trigger = _render_doc_summary_trigger(doc=doc, label=f"{view_label} overview")
|
| 814 |
grouped_buttons[_view_group(view_name)].append(
|
| 815 |
+
f"""<span class="doc-label-group inline-flex items-center border text-[0.8125rem] {classes}" data-doc-label-group="benchmark">
|
| 816 |
+
<button type="button" class="py-1 pl-2 pr-0 text-left"
|
| 817 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 818 |
{_leaderboard_control_hx_attrs()}>
|
| 819 |
{escape(view_label)}
|
|
|
|
| 891 |
query_payload["target"] = target
|
| 892 |
query = urlencode(query_payload, doseq=True)
|
| 893 |
buttons.append(
|
| 894 |
+
f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
|
| 895 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 896 |
{_leaderboard_control_hx_attrs()}>
|
| 897 |
{escape(label)}
|
|
|
|
| 925 |
"MNanoBEIR",
|
| 926 |
"NanoMMTEB-v2",
|
| 927 |
"NanoRTEB",
|
|
|
|
| 928 |
"NanoMLDR",
|
| 929 |
"NanoBRIGHT",
|
| 930 |
"NanoLaw",
|
|
|
|
| 980 |
)
|
| 981 |
more = f"""
|
| 982 |
<details class="relative">
|
| 983 |
+
<summary class="cursor-pointer border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-700 hover:border-cyan-500 hover:text-cyan-700">More</summary>
|
| 984 |
<div class="absolute z-10 mt-1 grid max-h-72 min-w-[28rem] grid-cols-3 gap-1 overflow-auto border border-zinc-300 bg-white p-2 shadow-sm sm:grid-cols-5">
|
| 985 |
{more_buttons}
|
| 986 |
</div>
|
|
|
|
| 988 |
"""
|
| 989 |
return f"""
|
| 990 |
<nav class="mb-4 flex flex-wrap items-start gap-2" aria-label="Language pages">
|
| 991 |
+
{_control_label(icon="languages", text="Language pages", extra_class="pt-1 text-[0.8125rem]")}
|
| 992 |
{''.join(buttons)}
|
| 993 |
{more}
|
| 994 |
</nav>
|
|
|
|
| 1019 |
)
|
| 1020 |
query = urlencode(query_payload, doseq=True)
|
| 1021 |
data_attr = "" if option is None else f' data-language-page="{escape(option.code)}"'
|
| 1022 |
+
return f"""<button type="button"{data_attr} class="border px-2 py-1 text-[0.8125rem] {classes}"
|
| 1023 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{_page_url(query_payload)}"
|
| 1024 |
{_leaderboard_control_hx_attrs()}>{escape(label)}</button>"""
|
| 1025 |
|
|
|
|
| 1324 |
else ""
|
| 1325 |
)
|
| 1326 |
return f"""
|
| 1327 |
+
<div class="mb-4 text-[0.8125rem] text-zinc-700">
|
| 1328 |
<form id="column-controls" class="flex flex-wrap items-center gap-x-5 gap-y-2"
|
| 1329 |
hx-get="/leaderboard" hx-push-url="true"
|
| 1330 |
{_leaderboard_control_hx_attrs()}
|
|
|
|
| 1380 |
<label class="flex min-w-64 items-center gap-2">
|
| 1381 |
<span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Model name</span>
|
| 1382 |
<input id="model-filter-input" type="search" name="model_filter" value="{escape(filter_state.model_filter)}"
|
| 1383 |
+
class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none focus:border-cyan-700"
|
| 1384 |
autocomplete="off">
|
| 1385 |
</label>
|
| 1386 |
<label class="flex min-w-64 items-center gap-2">
|
| 1387 |
<span class="shrink-0 whitespace-nowrap font-medium text-zinc-800">Task name</span>
|
| 1388 |
<input id="task-filter-input" type="search" name="task_filter" value="{escape(filter_state.task_filter)}"
|
| 1389 |
+
class="w-72 max-w-full border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none focus:border-cyan-700"
|
| 1390 |
autocomplete="off">
|
| 1391 |
</label>
|
| 1392 |
<label class="inline-flex items-center gap-2 pt-1">
|
|
|
|
| 1394 |
<span class="whitespace-nowrap font-medium text-zinc-800">Recalculate Borda, Mean</span>
|
| 1395 |
{_render_help_tooltip("When enabled, Model name, Task name, and active facet filters narrow the ranking population before Borda and Mean are recomputed. With a Task name filter, Borda is computed from per-task ranks over the filtered tasks.")}
|
| 1396 |
</label>
|
| 1397 |
+
<button type="submit" class="border border-zinc-300 bg-zinc-50 px-2 py-0.5 text-[0.8125rem] font-medium text-zinc-800 hover:border-cyan-600 hover:text-cyan-700">
|
| 1398 |
Apply
|
| 1399 |
</button>
|
| 1400 |
<div id="facet-filters" class="flex flex-wrap items-start gap-3">
|
|
|
|
| 1441 |
|
| 1442 |
def _render_task_length_filter_inputs(filter_state: FilterState) -> str:
|
| 1443 |
input_class = (
|
| 1444 |
+
"w-24 border border-zinc-300 bg-white px-2 py-1 text-[0.8125rem] text-zinc-900 outline-none "
|
| 1445 |
"focus:border-cyan-700"
|
| 1446 |
)
|
| 1447 |
active_classes = "border-cyan-700 bg-cyan-50" if filter_state.has_task_length_filters else "border-zinc-200 bg-zinc-50"
|
|
|
|
| 1450 |
"Tasks missing length metadata are excluded when a bound is set."
|
| 1451 |
)
|
| 1452 |
return f"""
|
| 1453 |
+
<fieldset class="flex flex-wrap items-center gap-2 border {active_classes} px-1.5 py-1">
|
| 1454 |
<legend class="inline-flex items-center gap-1 px-1 text-xs font-semibold uppercase text-zinc-500">
|
| 1455 |
<span>Task string length</span>
|
| 1456 |
{_render_help_tooltip(tooltip)}
|
|
|
|
| 1496 |
query = urlencode(query_payload, doseq=True)
|
| 1497 |
page_url = _page_url(query_payload)
|
| 1498 |
buttons.append(
|
| 1499 |
+
f"""<button type="button" class="border px-2 py-1 text-[0.8125rem] {classes}"
|
| 1500 |
hx-get="{_leaderboard_url(query)}" hx-push-url="{page_url}"
|
| 1501 |
{_leaderboard_control_hx_attrs()}>
|
| 1502 |
{escape(score_group.label)}
|
|
|
|
| 1887 |
else "border-zinc-300 bg-white text-zinc-700 hover:border-cyan-600 hover:text-cyan-700"
|
| 1888 |
)
|
| 1889 |
return f"""
|
| 1890 |
+
<button type="button" class="border px-2 py-1 text-[0.8125rem] {toggle_classes}"
|
| 1891 |
hx-get="/analysis?{escape(toggle_query, quote=True)}"
|
| 1892 |
hx-target="#analysis-panel" hx-swap="innerHTML">{escape(toggle_label)}</button>
|
| 1893 |
"""
|
|
|
|
| 2101 |
for value, label in options:
|
| 2102 |
checked = " checked" if value in selected_values else ""
|
| 2103 |
checkboxes.append(
|
| 2104 |
+
f"""<label class="flex min-w-0 items-center gap-2 whitespace-nowrap px-1.5 py-0.5">
|
| 2105 |
<input type="checkbox" name="{escape(name)}" value="{escape(value)}" class="h-4 w-4 accent-cyan-700"{checked}>
|
| 2106 |
<span>{escape(label)}</span>
|
| 2107 |
</label>"""
|
|
|
|
| 2112 |
none_page_url = _page_url(none_query)
|
| 2113 |
return f"""
|
| 2114 |
<details class="filter-detail border border-zinc-300 bg-white" data-filter-detail="{escape(name, quote=True)}" data-filter-icon="{escape(icon, quote=True)}">
|
| 2115 |
+
<summary class="cursor-pointer px-1.5 py-0.5 text-[0.8125rem] font-medium text-zinc-800">
|
| 2116 |
<span class="inline-flex items-center gap-1.5">
|
| 2117 |
{_icon_svg(icon, class_name="hakari-icon filter-detail-icon shrink-0")}
|
| 2118 |
<span>{escape(summary)}</span>
|
hakari_bench/viewer/assets/app.css
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
*,:after,:before{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }::backdrop{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }/*! tailwindcss v3.4.17 | MIT License | https://tailwindcss.com*/*,:after,:before{box-sizing:border-box;border:0 solid #e5e7eb}:after,:before{--tw-content:""}:host,html{line-height:1.5;-webkit-text-size-adjust:100%;-moz-tab-size:4;-o-tab-size:4;tab-size:4;font-family:ui-sans-serif,system-ui,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;font-feature-settings:normal;font-variation-settings:normal;-webkit-tap-highlight-color:transparent}body{margin:0;line-height:inherit}hr{height:0;color:inherit;border-top-width:1px}abbr:where([title]){-webkit-text-decoration:underline dotted;text-decoration:underline dotted}h1,h2,h3,h4,h5,h6{font-size:inherit;font-weight:inherit}a{color:inherit;text-decoration:inherit}b,strong{font-weight:bolder}code,kbd,pre,samp{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace;font-feature-settings:normal;font-variation-settings:normal;font-size:1em}small{font-size:80%}sub,sup{font-size:75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}table{text-indent:0;border-color:inherit;border-collapse:collapse}button,input,optgroup,select,textarea{font-family:inherit;font-feature-settings:inherit;font-variation-settings:inherit;font-size:100%;font-weight:inherit;line-height:inherit;letter-spacing:inherit;color:inherit;margin:0;padding:0}button,select{text-transform:none}button,input:where([type=button]),input:where([type=reset]),input:where([type=submit]){-webkit-appearance:button;background-color:transparent;background-image:none}:-moz-focusring{outline:auto}:-moz-ui-invalid{box-shadow:none}progress{vertical-align:baseline}::-webkit-inner-spin-button,::-webkit-outer-spin-button{height:auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appearance:button;font:inherit}summary{display:list-item}blockquote,dd,dl,figure,h1,h2,h3,h4,h5,h6,hr,p,pre{margin:0}fieldset{margin:0}fieldset,legend{padding:0}menu,ol,ul{list-style:none;margin:0;padding:0}dialog{padding:0}textarea{resize:vertical}input::-moz-placeholder,textarea::-moz-placeholder{opacity:1;color:#9ca3af}input::placeholder,textarea::placeholder{opacity:1;color:#9ca3af}[role=button],button{cursor:pointer}:disabled{cursor:default}audio,canvas,embed,iframe,img,object,svg,video{display:block;vertical-align:middle}img,video{max-width:100%;height:auto}[hidden]:where(:not([hidden=until-found])){display:none}:root{color-scheme:light dark;--hakari-bg:#f4f1e8;--hakari-surface:#fffaf0;--hakari-surface-muted:#ebe6d9;--hakari-surface-faint:#f8f4eb;--hakari-border:#d3ccb9;--hakari-border-strong:#aaa18e;--hakari-text:#1d1b18;--hakari-text-muted:#70695c;--hakari-text-faint:#918978;--hakari-accent:#256d57;--hakari-accent-soft:#dceee4;--hakari-accent-border:#80b39d;--hakari-warn-bg:#f7e7c6;--hakari-warn-border:#c28f31;--hakari-warn-text:#704b08;--hakari-danger:#b24a57;--hakari-danger-soft:#f3d7da;--hakari-purple-soft:#e8e0ef;--hakari-row-hover:#ece4d2;--hakari-font:"SFMono-Regular","Cascadia Code","Roboto Mono","Noto Sans Mono","Yu Gothic UI","Meiryo",ui-monospace,monospace}body{background-color:var(--hakari-bg);color:var(--hakari-text);font-family:var(--hakari-font);-webkit-font-smoothing:antialiased;text-rendering:optimizeLegibility}body:before{background-image:linear-gradient(hsla(0,0%,100%,.18) 1px,transparent 0),linear-gradient(90deg,rgba(29,27,24,.05) 1px,transparent 0),radial-gradient(circle at 1px 1px,rgba(29,27,24,.08) 1px,transparent 0);background-size:100% 3rem,3rem 100%,4px 4px;content:"";inset:0;opacity:.36;pointer-events:none;position:fixed;z-index:-1}@media (prefers-color-scheme:dark){:root{--hakari-bg:#29292d;--hakari-surface:#242428;--hakari-surface-muted:#303035;--hakari-surface-faint:#2a2a2f;--hakari-border:#45474c;--hakari-border-strong:#5b625f;--hakari-text:#f0eee8;--hakari-text-muted:#aaa8a1;--hakari-text-faint:#7f7f7b;--hakari-accent:#8bd99a;--hakari-accent-soft:#23352c;--hakari-accent-border:#5a8d66;--hakari-warn-bg:#3d2c1c;--hakari-warn-border:#a46a22;--hakari-warn-text:#f0c36f;--hakari-danger:#ff7d8a;--hakari-danger-soft:#41272d;--hakari-purple-soft:#332b3c;--hakari-row-hover:#38393d}body:before{background-image:linear-gradient(hsla(0,0%,100%,.035) 1px,transparent 0),linear-gradient(90deg,hsla(0,0%,100%,.025) 1px,transparent 0),radial-gradient(circle at 1px 1px,hsla(0,0%,100%,.16) 1px,transparent 0);opacity:.42}}.visible{visibility:visible}.static{position:static}.fixed{position:fixed}.absolute{position:absolute}.relative{position:relative}.sticky{position:sticky}.bottom-4{bottom:1rem}.right-4{right:1rem}.z-10{z-index:10}.z-20{z-index:20}.z-50{z-index:50}.mx-auto{margin-left:auto;margin-right:auto}.mb-1{margin-bottom:.25rem}.mb-2{margin-bottom:.5rem}.mb-3{margin-bottom:.75rem}.mb-4{margin-bottom:1rem}.mb-5{margin-bottom:1.25rem}.ml-2{margin-left:.5rem}.mt-1{margin-top:.25rem}.mt-2{margin-top:.5rem}.mt-3{margin-top:.75rem}.mt-6{margin-top:1.5rem}.block{display:block}.inline{display:inline}.flex{display:flex}.inline-flex{display:inline-flex}.table{display:table}.grid{display:grid}.hidden{display:none}.h-3\.5{height:.875rem}.h-4{height:1rem}.h-8{height:2rem}.max-h-60{max-height:15rem}.max-h-72{max-height:18rem}.max-h-80{max-height:20rem}.w-24{width:6rem}.w-3\.5{width:.875rem}.w-4{width:1rem}.w-72{width:18rem}.w-8{width:2rem}.w-\[4\.75rem\]{width:4.75rem}.w-\[min\(92vw\2c 42rem\)\]{width:min(92vw,42rem)}.w-full{width:100%}.min-w-0{min-width:0}.min-w-64{min-width:16rem}.min-w-\[28rem\]{min-width:28rem}.min-w-\[4\.75rem\]{min-width:4.75rem}.min-w-full{min-width:100%}.max-w-4xl{max-width:56rem}.max-w-\[1600px\]{max-width:1600px}.max-w-\[4\.75rem\]{max-width:4.75rem}.max-w-full{max-width:100%}.flex-1{flex:1 1 0%}.shrink-0{flex-shrink:0}.border-collapse{border-collapse:collapse}.transform{transform:translate(var(--tw-translate-x),var(--tw-translate-y)) rotate(var(--tw-rotate)) skewX(var(--tw-skew-x)) skewY(var(--tw-skew-y)) scaleX(var(--tw-scale-x)) scaleY(var(--tw-scale-y))}.cursor-pointer{cursor:pointer}.grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.grid-cols-\[10rem_1fr\]{grid-template-columns:10rem 1fr}.flex-wrap{flex-wrap:wrap}.items-start{align-items:flex-start}.items-end{align-items:flex-end}.items-center{align-items:center}.justify-start{justify-content:flex-start}.justify-end{justify-content:flex-end}.justify-center{justify-content:center}.justify-between{justify-content:space-between}.gap-0\.5{gap:.125rem}.gap-1{gap:.25rem}.gap-1\.5{gap:.375rem}.gap-2{gap:.5rem}.gap-3{gap:.75rem}.gap-x-2{-moz-column-gap:.5rem;column-gap:.5rem}.gap-x-3{-moz-column-gap:.75rem;column-gap:.75rem}.gap-x-5{-moz-column-gap:1.25rem;column-gap:1.25rem}.gap-y-1{row-gap:.25rem}.gap-y-2{row-gap:.5rem}.space-y-2>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.5rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.5rem*var(--tw-space-y-reverse))}.space-y-3>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.75rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.75rem*var(--tw-space-y-reverse))}.overflow-auto{overflow:auto}.overflow-x-auto{overflow-x:auto}.truncate{overflow:hidden;text-overflow:ellipsis}.truncate,.whitespace-nowrap{white-space:nowrap}.whitespace-pre-wrap{white-space:pre-wrap}.break-all{word-break:break-all}.rounded{border-radius:.25rem}.rounded-full{border-radius:9999px}.border{border-width:1px}.border-b{border-bottom-width:1px}.border-l{border-left-width:1px}.border-t{border-top-width:1px}.border-amber-200{--tw-border-opacity:1;border-color:rgb(253 230 138/var(--tw-border-opacity,1))}.border-cyan-200{--tw-border-opacity:1;border-color:rgb(165 243 252/var(--tw-border-opacity,1))}.border-cyan-700{--tw-border-opacity:1;border-color:rgb(14 116 144/var(--tw-border-opacity,1))}.border-violet-200{--tw-border-opacity:1;border-color:rgb(221 214 254/var(--tw-border-opacity,1))}.border-zinc-200{--tw-border-opacity:1;border-color:rgb(228 228 231/var(--tw-border-opacity,1))}.border-zinc-300{--tw-border-opacity:1;border-color:rgb(212 212 216/var(--tw-border-opacity,1))}.border-zinc-900{--tw-border-opacity:1;border-color:rgb(24 24 27/var(--tw-border-opacity,1))}.bg-amber-50{--tw-bg-opacity:1;background-color:rgb(255 251 235/var(--tw-bg-opacity,1))}.bg-cyan-50{--tw-bg-opacity:1;background-color:rgb(236 254 255/var(--tw-bg-opacity,1))}.bg-inherit{background-color:inherit}.bg-violet-50{--tw-bg-opacity:1;background-color:rgb(245 243 255/var(--tw-bg-opacity,1))}.bg-white{--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1))}.bg-zinc-100{--tw-bg-opacity:1;background-color:rgb(244 244 245/var(--tw-bg-opacity,1))}.bg-zinc-50{--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.bg-zinc-900{--tw-bg-opacity:1;background-color:rgb(24 24 27/var(--tw-bg-opacity,1))}.p-0{padding:0}.p-2{padding:.5rem}.p-3{padding:.75rem}.px-1{padding-left:.25rem;padding-right:.25rem}.px-2{padding-left:.5rem;padding-right:.5rem}.px-3{padding-left:.75rem;padding-right:.75rem}.px-4{padding-left:1rem;padding-right:1rem}.py-0{padding-top:0;padding-bottom:0}.py-0\.5{padding-top:.125rem;padding-bottom:.125rem}.py-1{padding-top:.25rem;padding-bottom:.25rem}.py-1\.5{padding-top:.375rem;padding-bottom:.375rem}.py-2{padding-top:.5rem;padding-bottom:.5rem}.py-3{padding-top:.75rem;padding-bottom:.75rem}.py-4{padding-top:1rem;padding-bottom:1rem}.py-5{padding-top:1.25rem;padding-bottom:1.25rem}.py-6{padding-top:1.5rem;padding-bottom:1.5rem}.pb-4{padding-bottom:1rem}.pl-0\.5{padding-left:.125rem}.pl-3{padding-left:.75rem}.pr-0{padding-right:0}.pr-2{padding-right:.5rem}.pt-1{padding-top:.25rem}.pt-1\.5{padding-top:.375rem}.text-left{text-align:left}.text-center{text-align:center}.text-right{text-align:right}.align-middle{vertical-align:middle}.font-mono{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace}.text-2xl{font-size:1.5rem;line-height:2rem}.text-\[0\.6875rem\]{font-size:.6875rem}.text-\[0\.8125rem\]{font-size:.8125rem}.text-\[9px\]{font-size:9px}.text-base{font-size:1rem;line-height:1.5rem}.text-lg{font-size:1.125rem;line-height:1.75rem}.text-sm{font-size:.875rem;line-height:1.25rem}.text-xl{font-size:1.25rem;line-height:1.75rem}.text-xs{font-size:.75rem;line-height:1rem}.font-medium{font-weight:500}.font-semibold{font-weight:600}.uppercase{text-transform:uppercase}.normal-case{text-transform:none}.tabular-nums{--tw-numeric-spacing:tabular-nums;font-variant-numeric:var(--tw-ordinal) var(--tw-slashed-zero) var(--tw-numeric-figure) var(--tw-numeric-spacing) var(--tw-numeric-fraction)}.leading-none{line-height:1}.leading-snug{line-height:1.375}.leading-tight{line-height:1.25}.text-amber-800{--tw-text-opacity:1;color:rgb(146 64 14/var(--tw-text-opacity,1))}.text-cyan-700{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.text-cyan-800{--tw-text-opacity:1;color:rgb(21 94 117/var(--tw-text-opacity,1))}.text-cyan-900{--tw-text-opacity:1;color:rgb(22 78 99/var(--tw-text-opacity,1))}.text-violet-800{--tw-text-opacity:1;color:rgb(91 33 182/var(--tw-text-opacity,1))}.text-white{--tw-text-opacity:1;color:rgb(255 255 255/var(--tw-text-opacity,1))}.text-zinc-400{--tw-text-opacity:1;color:rgb(161 161 170/var(--tw-text-opacity,1))}.text-zinc-500{--tw-text-opacity:1;color:rgb(113 113 122/var(--tw-text-opacity,1))}.text-zinc-600{--tw-text-opacity:1;color:rgb(82 82 91/var(--tw-text-opacity,1))}.text-zinc-700{--tw-text-opacity:1;color:rgb(63 63 70/var(--tw-text-opacity,1))}.text-zinc-800{--tw-text-opacity:1;color:rgb(39 39 42/var(--tw-text-opacity,1))}.text-zinc-900{--tw-text-opacity:1;color:rgb(24 24 27/var(--tw-text-opacity,1))}.text-zinc-950{--tw-text-opacity:1;color:rgb(9 9 11/var(--tw-text-opacity,1))}.underline{text-decoration-line:underline}.underline-offset-2{text-underline-offset:2px}.accent-cyan-700{accent-color:#0e7490}.shadow-sm{--tw-shadow:0 1px 2px 0 rgba(0,0,0,.05);--tw-shadow-colored:0 1px 2px 0 var(--tw-shadow-color);box-shadow:var(--tw-ring-offset-shadow,0 0 #0000),var(--tw-ring-shadow,0 0 #0000),var(--tw-shadow)}.outline-none{outline:2px solid transparent;outline-offset:2px}.filter{filter:var(--tw-blur) var(--tw-brightness) var(--tw-contrast) var(--tw-grayscale) var(--tw-hue-rotate) var(--tw-invert) var(--tw-saturate) var(--tw-sepia) var(--tw-drop-shadow)}body.bg-zinc-50{background-color:var(--hakari-bg)}.bg-white{background-color:var(--hakari-surface)}.bg-zinc-50,.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.bg-zinc-100{background-color:var(--hakari-surface-muted)}.bg-zinc-900{background-color:var(--hakari-accent)}.bg-cyan-50{background-color:var(--hakari-accent-soft)}.bg-amber-50{background-color:var(--hakari-warn-bg)}.bg-violet-50{background-color:var(--hakari-purple-soft)}.text-zinc-800,.text-zinc-900,.text-zinc-950{color:var(--hakari-text)}.text-zinc-600,.text-zinc-700{color:var(--hakari-text-muted)}.text-zinc-400,.text-zinc-500{color:var(--hakari-text-faint)}.text-white{color:var(--hakari-bg)}.text-cyan-700,.text-cyan-800,.text-cyan-900{color:var(--hakari-accent)}.text-amber-800{color:var(--hakari-warn-text)}.text-violet-800{color:#765a8f}.border-zinc-200,.border-zinc-300{border-color:var(--hakari-border)}.border-cyan-200,.border-cyan-500,.border-cyan-600,.border-cyan-700{border-color:var(--hakari-accent-border)}.border-amber-200{border-color:var(--hakari-warn-border)}.border-violet-200{border-color:#a78bbd}button,details,input,select,summary,textarea{border-color:var(--hakari-border)}input,select,textarea{background-color:var(--hakari-surface-faint);color:var(--hakari-text)}.hover\:text-cyan-700:hover,button:hover,summary:hover{color:var(--hakari-accent)}.model-detail-trigger,.model-detail-trigger:hover{color:var(--hakari-text)}.hakari-icon{height:.875rem;width:.875rem}.doc-summary-trigger .hakari-icon,.tooltip-trigger .hakari-icon{height:.75rem;width:.75rem}.action-icon,.control-heading-icon,.filter-detail-icon,.section-heading-icon{color:var(--hakari-accent)}.leaderboard-row:hover>td{background-color:var(--hakari-row-hover)}.leaderboard-table-scroll{--hakari-model-col-width:clamp(18rem,40vw,36rem);--hakari-rank-col-width:4rem}.leaderboard-col-model{box-sizing:border-box;left:0;max-width:var(--hakari-model-col-width);min-width:var(--hakari-model-col-width);width:var(--hakari-model-col-width)}.leaderboard-col-rank{box-sizing:border-box;max-width:var(--hakari-rank-col-width);min-width:var(--hakari-rank-col-width);width:var(--hakari-rank-col-width)}.benchmark-doc h1{font-size:1.5rem;font-weight:600;line-height:2rem}.benchmark-doc h2{border-top:1px solid var(--hakari-border);font-size:1.125rem;font-weight:600;line-height:1.75rem;margin-top:1.5rem;padding-top:1rem}.benchmark-doc h3{font-size:1rem;font-weight:600;line-height:1.5rem;margin-top:1rem}.benchmark-doc blockquote,.benchmark-doc p,.benchmark-doc pre,.benchmark-doc table,.benchmark-doc ul{margin-top:.75rem}.benchmark-doc blockquote,.benchmark-doc li,.benchmark-doc p,.benchmark-doc td{color:var(--hakari-text-muted);font-size:.875rem;line-height:1.55}.benchmark-doc ul{list-style:disc;padding-left:1.25rem}.benchmark-doc a{color:var(--hakari-accent);text-decoration-line:underline;text-underline-offset:2px}.benchmark-doc table{min-width:100%}.benchmark-doc td{border-top:1px solid var(--hakari-border);padding:.375rem .5rem;vertical-align:top}.benchmark-doc code,.benchmark-doc pre{font-family:var(--hakari-font)}.benchmark-doc pre{border:1px solid var(--hakari-border);overflow-x:auto;padding:.75rem}.tooltip-trigger{cursor:pointer;position:relative}.tooltip-trigger:after{display:none}.global-tooltip{z-index:1000;max-width:min(24rem,calc(100vw - 2rem));opacity:0;overflow-wrap:anywhere;pointer-events:none;text-transform:none;transition:opacity .12s ease-in-out;white-space:normal}.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.18)}.global-tooltip[data-visible=true]{opacity:1}.leaderboard-loading-toast{opacity:0;pointer-events:none;transform:translateY(.5rem);transition:opacity .16s ease-in-out,transform .16s ease-in-out}.leaderboard-loading-toast.htmx-request{opacity:1;transform:translateY(0)}[data-leaderboard-pending=true]{cursor:progress;opacity:.72}[data-leaderboard-pending=true]:after{animation:hakari-leaderboard-spin .7s linear infinite;border:2px solid;border-right:2px solid transparent;border-radius:9999px;content:"";display:inline-block;height:.55rem;margin-left:.375rem;vertical-align:-.075rem;width:.55rem}.task-z-score{box-sizing:border-box;display:inline-flex;width:3.75rem;min-width:3.75rem;flex-direction:column;align-items:flex-end;justify-content:center;border:1px solid rgba(29,27,24,.14);border-radius:0;line-height:1;padding:.1rem .3rem}.task-z-score-value{font-size:.8125rem;font-weight:400}.task-z-score-delta{margin-top:0;font-size:.5625rem;font-weight:400}.task-z-neutral{background-color:transparent;color:inherit}.task-z-pos-025{background-color:#f4f1df;color:#5a6335}.task-z-pos-050{background-color:#e8e5c4;color:#4f5b2e}.task-z-pos-075{background-color:#d7d49f;color:#3f4a24}.task-z-pos-100{background-color:#c5c27c;color:#30391b}.task-z-pos-125{background-color:#aaa85f;color:#252d16}.task-z-pos-150{background-color:#8c8f46;color:#fffaf0}.task-z-pos-175{background-color:#707735;color:#fffaf0}.task-z-pos-200{background-color:#566126;color:#fffaf0}.task-z-neg-025{background-color:#f7ebe4;color:#87513e}.task-z-neg-050{background-color:#efd8cb;color:#7a4234}.task-z-neg-075{background-color:#e5c0ae;color:#6e342c}.task-z-neg-100{background-color:#d99f88;color:#55251f}.task-z-neg-125{background-color:#c97c63;color:#3a1714}.task-z-neg-150{background-color:#ad5d4d;color:#fffaf0}.task-z-neg-175{background-color:#924436;color:#fffaf0}.task-z-neg-200{background-color:#733126;color:#fffaf0}@keyframes hakari-leaderboard-spin{to{transform:rotate(1turn)}}@media (prefers-reduced-motion:reduce){.leaderboard-loading-toast{transition:none}[data-leaderboard-pending=true]:after{animation:none}}@media (prefers-color-scheme:dark){.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.5)}button,input,select,summary,textarea{color-scheme:dark}dialog::backdrop{background-color:rgba(24,24,27,.76)}.global-tooltip{border-color:var(--hakari-border-strong);background-color:var(--hakari-surface);color:var(--hakari-text)}.task-z-score{border-color:hsla(45,21%,93%,.22)}.task-z-pos-025{background-color:#022c22;color:#a7f3d0}.task-z-pos-050{background-color:#064e3b;color:#d1fae5}.task-z-pos-075{background-color:#065f46;color:#ecfdf5}.task-z-pos-100{background-color:#047857;color:#fff}.task-z-pos-125{background-color:#059669;color:#fff}.task-z-pos-150{background-color:#10b981;color:#022c22}.task-z-pos-175{background-color:#34d399;color:#022c22}.task-z-pos-200{background-color:#6ee7b7;color:#022c22}.task-z-neg-025{background-color:#4c0519;color:#fecdd3}.task-z-neg-050{background-color:#881337;color:#ffe4e6}.task-z-neg-075{background-color:#9f1239;color:#fff1f2}.task-z-neg-100{background-color:#be123c;color:#fff}.task-z-neg-125{background-color:#e11d48;color:#fff}.task-z-neg-150{background-color:#f43f5e;color:#fff}.task-z-neg-175{background-color:#e11d48;color:#fff}.task-z-neg-200{background-color:#be123c;color:#fff}}.\[index\:end\]{index:end}.\[overflow-wrap\:anywhere\]{overflow-wrap:anywhere}.backdrop\:bg-zinc-950\/35::backdrop{background-color:rgba(9,9,11,.35)}.odd\:bg-white:nth-child(odd){--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1));background-color:var(--hakari-surface)}.even\:bg-zinc-50:nth-child(2n){--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.even\:bg-zinc-50:nth-child(2n)body{background-color:var(--hakari-bg)}.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.hover\:border-cyan-500:hover{--tw-border-opacity:1;border-color:rgb(6 182 212/var(--tw-border-opacity,1))}.hover\:border-cyan-600:hover{--tw-border-opacity:1;border-color:rgb(8 145 178/var(--tw-border-opacity,1))}.hover\:text-cyan-700:hover{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.hover\:underline:hover{text-decoration-line:underline}.hover\:text-cyan-700:hover{color:var(--hakari-accent)}.focus\:border-cyan-700:focus,.hover\:border-cyan-500:hover,.hover\:border-cyan-600:hover{border-color:var(--hakari-accent-border)}.focus\:border-cyan-700:focus{--tw-border-opacity:1}@media (min-width:640px){.sm\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.sm\:grid-cols-5{grid-template-columns:repeat(5,minmax(0,1fr))}.sm\:px-6{padding-left:1.5rem;padding-right:1.5rem}}@media (min-width:1024px){.lg\:grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.lg\:grid-cols-6{grid-template-columns:repeat(6,minmax(0,1fr))}}@media (min-width:1280px){.xl\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}}
|
|
|
|
| 1 |
+
*,:after,:before{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }::backdrop{--tw-border-spacing-x:0;--tw-border-spacing-y:0;--tw-translate-x:0;--tw-translate-y:0;--tw-rotate:0;--tw-skew-x:0;--tw-skew-y:0;--tw-scale-x:1;--tw-scale-y:1;--tw-pan-x: ;--tw-pan-y: ;--tw-pinch-zoom: ;--tw-scroll-snap-strictness:proximity;--tw-gradient-from-position: ;--tw-gradient-via-position: ;--tw-gradient-to-position: ;--tw-ordinal: ;--tw-slashed-zero: ;--tw-numeric-figure: ;--tw-numeric-spacing: ;--tw-numeric-fraction: ;--tw-ring-inset: ;--tw-ring-offset-width:0px;--tw-ring-offset-color:#fff;--tw-ring-color:rgba(59,130,246,.5);--tw-ring-offset-shadow:0 0 #0000;--tw-ring-shadow:0 0 #0000;--tw-shadow:0 0 #0000;--tw-shadow-colored:0 0 #0000;--tw-blur: ;--tw-brightness: ;--tw-contrast: ;--tw-grayscale: ;--tw-hue-rotate: ;--tw-invert: ;--tw-saturate: ;--tw-sepia: ;--tw-drop-shadow: ;--tw-backdrop-blur: ;--tw-backdrop-brightness: ;--tw-backdrop-contrast: ;--tw-backdrop-grayscale: ;--tw-backdrop-hue-rotate: ;--tw-backdrop-invert: ;--tw-backdrop-opacity: ;--tw-backdrop-saturate: ;--tw-backdrop-sepia: ;--tw-contain-size: ;--tw-contain-layout: ;--tw-contain-paint: ;--tw-contain-style: }/*! tailwindcss v3.4.17 | MIT License | https://tailwindcss.com*/*,:after,:before{box-sizing:border-box;border:0 solid #e5e7eb}:after,:before{--tw-content:""}:host,html{line-height:1.5;-webkit-text-size-adjust:100%;-moz-tab-size:4;-o-tab-size:4;tab-size:4;font-family:ui-sans-serif,system-ui,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji;font-feature-settings:normal;font-variation-settings:normal;-webkit-tap-highlight-color:transparent}body{margin:0;line-height:inherit}hr{height:0;color:inherit;border-top-width:1px}abbr:where([title]){-webkit-text-decoration:underline dotted;text-decoration:underline dotted}h1,h2,h3,h4,h5,h6{font-size:inherit;font-weight:inherit}a{color:inherit;text-decoration:inherit}b,strong{font-weight:bolder}code,kbd,pre,samp{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace;font-feature-settings:normal;font-variation-settings:normal;font-size:1em}small{font-size:80%}sub,sup{font-size:75%;line-height:0;position:relative;vertical-align:baseline}sub{bottom:-.25em}sup{top:-.5em}table{text-indent:0;border-color:inherit;border-collapse:collapse}button,input,optgroup,select,textarea{font-family:inherit;font-feature-settings:inherit;font-variation-settings:inherit;font-size:100%;font-weight:inherit;line-height:inherit;letter-spacing:inherit;color:inherit;margin:0;padding:0}button,select{text-transform:none}button,input:where([type=button]),input:where([type=reset]),input:where([type=submit]){-webkit-appearance:button;background-color:transparent;background-image:none}:-moz-focusring{outline:auto}:-moz-ui-invalid{box-shadow:none}progress{vertical-align:baseline}::-webkit-inner-spin-button,::-webkit-outer-spin-button{height:auto}[type=search]{-webkit-appearance:textfield;outline-offset:-2px}::-webkit-search-decoration{-webkit-appearance:none}::-webkit-file-upload-button{-webkit-appearance:button;font:inherit}summary{display:list-item}blockquote,dd,dl,figure,h1,h2,h3,h4,h5,h6,hr,p,pre{margin:0}fieldset{margin:0}fieldset,legend{padding:0}menu,ol,ul{list-style:none;margin:0;padding:0}dialog{padding:0}textarea{resize:vertical}input::-moz-placeholder,textarea::-moz-placeholder{opacity:1;color:#9ca3af}input::placeholder,textarea::placeholder{opacity:1;color:#9ca3af}[role=button],button{cursor:pointer}:disabled{cursor:default}audio,canvas,embed,iframe,img,object,svg,video{display:block;vertical-align:middle}img,video{max-width:100%;height:auto}[hidden]:where(:not([hidden=until-found])){display:none}:root{color-scheme:light dark;--hakari-bg:#f4f1e8;--hakari-surface:#fffaf0;--hakari-surface-muted:#ebe6d9;--hakari-surface-faint:#f8f4eb;--hakari-border:#d3ccb9;--hakari-border-strong:#aaa18e;--hakari-text:#1d1b18;--hakari-text-muted:#70695c;--hakari-text-faint:#918978;--hakari-accent:#256d57;--hakari-accent-soft:#dceee4;--hakari-accent-border:#80b39d;--hakari-warn-bg:#f7e7c6;--hakari-warn-border:#c28f31;--hakari-warn-text:#704b08;--hakari-danger:#b24a57;--hakari-danger-soft:#f3d7da;--hakari-purple-soft:#e8e0ef;--hakari-row-hover:#ece4d2;--hakari-font:"SFMono-Regular","Cascadia Code","Roboto Mono","Noto Sans Mono","Yu Gothic UI","Meiryo",ui-monospace,monospace}body{background-color:var(--hakari-bg);color:var(--hakari-text);font-family:var(--hakari-font);-webkit-font-smoothing:antialiased;text-rendering:optimizeLegibility}body:before{background-image:linear-gradient(hsla(0,0%,100%,.18) 1px,transparent 0),linear-gradient(90deg,rgba(29,27,24,.05) 1px,transparent 0),radial-gradient(circle at 1px 1px,rgba(29,27,24,.08) 1px,transparent 0);background-size:100% 3rem,3rem 100%,4px 4px;content:"";inset:0;opacity:.36;pointer-events:none;position:fixed;z-index:-1}@media (prefers-color-scheme:dark){:root{--hakari-bg:#29292d;--hakari-surface:#242428;--hakari-surface-muted:#303035;--hakari-surface-faint:#2a2a2f;--hakari-border:#45474c;--hakari-border-strong:#5b625f;--hakari-text:#f0eee8;--hakari-text-muted:#aaa8a1;--hakari-text-faint:#7f7f7b;--hakari-accent:#8bd99a;--hakari-accent-soft:#23352c;--hakari-accent-border:#5a8d66;--hakari-warn-bg:#3d2c1c;--hakari-warn-border:#a46a22;--hakari-warn-text:#f0c36f;--hakari-danger:#ff7d8a;--hakari-danger-soft:#41272d;--hakari-purple-soft:#332b3c;--hakari-row-hover:#38393d}body:before{background-image:linear-gradient(hsla(0,0%,100%,.035) 1px,transparent 0),linear-gradient(90deg,hsla(0,0%,100%,.025) 1px,transparent 0),radial-gradient(circle at 1px 1px,hsla(0,0%,100%,.16) 1px,transparent 0);opacity:.42}}.visible{visibility:visible}.static{position:static}.fixed{position:fixed}.absolute{position:absolute}.relative{position:relative}.sticky{position:sticky}.bottom-4{bottom:1rem}.right-4{right:1rem}.z-10{z-index:10}.z-20{z-index:20}.z-50{z-index:50}.mx-auto{margin-left:auto;margin-right:auto}.mb-1{margin-bottom:.25rem}.mb-2{margin-bottom:.5rem}.mb-3{margin-bottom:.75rem}.mb-4{margin-bottom:1rem}.mb-5{margin-bottom:1.25rem}.ml-2{margin-left:.5rem}.mt-1{margin-top:.25rem}.mt-2{margin-top:.5rem}.mt-3{margin-top:.75rem}.mt-6{margin-top:1.5rem}.block{display:block}.inline{display:inline}.flex{display:flex}.inline-flex{display:inline-flex}.table{display:table}.grid{display:grid}.hidden{display:none}.h-3\.5{height:.875rem}.h-4{height:1rem}.h-8{height:2rem}.max-h-60{max-height:15rem}.max-h-72{max-height:18rem}.max-h-80{max-height:20rem}.w-24{width:6rem}.w-3\.5{width:.875rem}.w-4{width:1rem}.w-72{width:18rem}.w-8{width:2rem}.w-\[4\.75rem\]{width:4.75rem}.w-\[min\(92vw\2c 42rem\)\]{width:min(92vw,42rem)}.w-full{width:100%}.min-w-0{min-width:0}.min-w-64{min-width:16rem}.min-w-\[28rem\]{min-width:28rem}.min-w-\[4\.75rem\]{min-width:4.75rem}.min-w-full{min-width:100%}.max-w-4xl{max-width:56rem}.max-w-\[1600px\]{max-width:1600px}.max-w-\[4\.75rem\]{max-width:4.75rem}.max-w-full{max-width:100%}.flex-1{flex:1 1 0%}.shrink-0{flex-shrink:0}.border-collapse{border-collapse:collapse}.transform{transform:translate(var(--tw-translate-x),var(--tw-translate-y)) rotate(var(--tw-rotate)) skewX(var(--tw-skew-x)) skewY(var(--tw-skew-y)) scaleX(var(--tw-scale-x)) scaleY(var(--tw-scale-y))}.cursor-pointer{cursor:pointer}.grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.grid-cols-\[10rem_1fr\]{grid-template-columns:10rem 1fr}.flex-wrap{flex-wrap:wrap}.items-start{align-items:flex-start}.items-end{align-items:flex-end}.items-center{align-items:center}.justify-start{justify-content:flex-start}.justify-end{justify-content:flex-end}.justify-center{justify-content:center}.justify-between{justify-content:space-between}.gap-0\.5{gap:.125rem}.gap-1{gap:.25rem}.gap-1\.5{gap:.375rem}.gap-2{gap:.5rem}.gap-3{gap:.75rem}.gap-x-2{-moz-column-gap:.5rem;column-gap:.5rem}.gap-x-3{-moz-column-gap:.75rem;column-gap:.75rem}.gap-x-5{-moz-column-gap:1.25rem;column-gap:1.25rem}.gap-y-1{row-gap:.25rem}.gap-y-2{row-gap:.5rem}.space-y-2>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.5rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.5rem*var(--tw-space-y-reverse))}.space-y-3>:not([hidden])~:not([hidden]){--tw-space-y-reverse:0;margin-top:calc(.75rem*(1 - var(--tw-space-y-reverse)));margin-bottom:calc(.75rem*var(--tw-space-y-reverse))}.overflow-auto{overflow:auto}.overflow-x-auto{overflow-x:auto}.truncate{overflow:hidden;text-overflow:ellipsis}.truncate,.whitespace-nowrap{white-space:nowrap}.whitespace-pre-wrap{white-space:pre-wrap}.break-all{word-break:break-all}.rounded{border-radius:.25rem}.rounded-full{border-radius:9999px}.border{border-width:1px}.border-b{border-bottom-width:1px}.border-l{border-left-width:1px}.border-t{border-top-width:1px}.border-amber-200{--tw-border-opacity:1;border-color:rgb(253 230 138/var(--tw-border-opacity,1))}.border-cyan-200{--tw-border-opacity:1;border-color:rgb(165 243 252/var(--tw-border-opacity,1))}.border-cyan-700{--tw-border-opacity:1;border-color:rgb(14 116 144/var(--tw-border-opacity,1))}.border-violet-200{--tw-border-opacity:1;border-color:rgb(221 214 254/var(--tw-border-opacity,1))}.border-zinc-200{--tw-border-opacity:1;border-color:rgb(228 228 231/var(--tw-border-opacity,1))}.border-zinc-300{--tw-border-opacity:1;border-color:rgb(212 212 216/var(--tw-border-opacity,1))}.border-zinc-900{--tw-border-opacity:1;border-color:rgb(24 24 27/var(--tw-border-opacity,1))}.bg-amber-50{--tw-bg-opacity:1;background-color:rgb(255 251 235/var(--tw-bg-opacity,1))}.bg-cyan-50{--tw-bg-opacity:1;background-color:rgb(236 254 255/var(--tw-bg-opacity,1))}.bg-inherit{background-color:inherit}.bg-violet-50{--tw-bg-opacity:1;background-color:rgb(245 243 255/var(--tw-bg-opacity,1))}.bg-white{--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1))}.bg-zinc-100{--tw-bg-opacity:1;background-color:rgb(244 244 245/var(--tw-bg-opacity,1))}.bg-zinc-50{--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.bg-zinc-900{--tw-bg-opacity:1;background-color:rgb(24 24 27/var(--tw-bg-opacity,1))}.p-0{padding:0}.p-2{padding:.5rem}.p-3{padding:.75rem}.px-1{padding-left:.25rem;padding-right:.25rem}.px-1\.5{padding-left:.375rem;padding-right:.375rem}.px-2{padding-left:.5rem;padding-right:.5rem}.px-3{padding-left:.75rem;padding-right:.75rem}.px-4{padding-left:1rem;padding-right:1rem}.py-0{padding-top:0;padding-bottom:0}.py-0\.5{padding-top:.125rem;padding-bottom:.125rem}.py-1{padding-top:.25rem;padding-bottom:.25rem}.py-1\.5{padding-top:.375rem;padding-bottom:.375rem}.py-2{padding-top:.5rem;padding-bottom:.5rem}.py-3{padding-top:.75rem;padding-bottom:.75rem}.py-4{padding-top:1rem;padding-bottom:1rem}.py-5{padding-top:1.25rem;padding-bottom:1.25rem}.py-6{padding-top:1.5rem;padding-bottom:1.5rem}.pb-4{padding-bottom:1rem}.pl-0\.5{padding-left:.125rem}.pl-2{padding-left:.5rem}.pl-3{padding-left:.75rem}.pr-0{padding-right:0}.pr-2{padding-right:.5rem}.pt-1{padding-top:.25rem}.text-left{text-align:left}.text-center{text-align:center}.text-right{text-align:right}.align-middle{vertical-align:middle}.font-mono{font-family:ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,Liberation Mono,Courier New,monospace}.text-2xl{font-size:1.5rem;line-height:2rem}.text-\[0\.6875rem\]{font-size:.6875rem}.text-\[0\.8125rem\]{font-size:.8125rem}.text-\[9px\]{font-size:9px}.text-base{font-size:1rem;line-height:1.5rem}.text-lg{font-size:1.125rem;line-height:1.75rem}.text-sm{font-size:.875rem;line-height:1.25rem}.text-xl{font-size:1.25rem;line-height:1.75rem}.text-xs{font-size:.75rem;line-height:1rem}.font-medium{font-weight:500}.font-semibold{font-weight:600}.uppercase{text-transform:uppercase}.normal-case{text-transform:none}.tabular-nums{--tw-numeric-spacing:tabular-nums;font-variant-numeric:var(--tw-ordinal) var(--tw-slashed-zero) var(--tw-numeric-figure) var(--tw-numeric-spacing) var(--tw-numeric-fraction)}.leading-none{line-height:1}.leading-snug{line-height:1.375}.leading-tight{line-height:1.25}.text-amber-800{--tw-text-opacity:1;color:rgb(146 64 14/var(--tw-text-opacity,1))}.text-cyan-700{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.text-cyan-800{--tw-text-opacity:1;color:rgb(21 94 117/var(--tw-text-opacity,1))}.text-cyan-900{--tw-text-opacity:1;color:rgb(22 78 99/var(--tw-text-opacity,1))}.text-violet-800{--tw-text-opacity:1;color:rgb(91 33 182/var(--tw-text-opacity,1))}.text-white{--tw-text-opacity:1;color:rgb(255 255 255/var(--tw-text-opacity,1))}.text-zinc-400{--tw-text-opacity:1;color:rgb(161 161 170/var(--tw-text-opacity,1))}.text-zinc-500{--tw-text-opacity:1;color:rgb(113 113 122/var(--tw-text-opacity,1))}.text-zinc-600{--tw-text-opacity:1;color:rgb(82 82 91/var(--tw-text-opacity,1))}.text-zinc-700{--tw-text-opacity:1;color:rgb(63 63 70/var(--tw-text-opacity,1))}.text-zinc-800{--tw-text-opacity:1;color:rgb(39 39 42/var(--tw-text-opacity,1))}.text-zinc-900{--tw-text-opacity:1;color:rgb(24 24 27/var(--tw-text-opacity,1))}.text-zinc-950{--tw-text-opacity:1;color:rgb(9 9 11/var(--tw-text-opacity,1))}.underline{text-decoration-line:underline}.underline-offset-2{text-underline-offset:2px}.accent-cyan-700{accent-color:#0e7490}.shadow-sm{--tw-shadow:0 1px 2px 0 rgba(0,0,0,.05);--tw-shadow-colored:0 1px 2px 0 var(--tw-shadow-color);box-shadow:var(--tw-ring-offset-shadow,0 0 #0000),var(--tw-ring-shadow,0 0 #0000),var(--tw-shadow)}.outline-none{outline:2px solid transparent;outline-offset:2px}.filter{filter:var(--tw-blur) var(--tw-brightness) var(--tw-contrast) var(--tw-grayscale) var(--tw-hue-rotate) var(--tw-invert) var(--tw-saturate) var(--tw-sepia) var(--tw-drop-shadow)}body.bg-zinc-50{background-color:var(--hakari-bg)}.bg-white{background-color:var(--hakari-surface)}.bg-zinc-50,.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.bg-zinc-100{background-color:var(--hakari-surface-muted)}.bg-zinc-900{background-color:var(--hakari-accent)}.bg-cyan-50{background-color:var(--hakari-accent-soft)}.bg-amber-50{background-color:var(--hakari-warn-bg)}.bg-violet-50{background-color:var(--hakari-purple-soft)}.text-zinc-800,.text-zinc-900,.text-zinc-950{color:var(--hakari-text)}.text-zinc-600,.text-zinc-700{color:var(--hakari-text-muted)}.text-zinc-400,.text-zinc-500{color:var(--hakari-text-faint)}.text-white{color:var(--hakari-bg)}.text-cyan-700,.text-cyan-800,.text-cyan-900{color:var(--hakari-accent)}.text-amber-800{color:var(--hakari-warn-text)}.text-violet-800{color:#765a8f}.border-zinc-200,.border-zinc-300{border-color:var(--hakari-border)}.border-cyan-200,.border-cyan-500,.border-cyan-600,.border-cyan-700{border-color:var(--hakari-accent-border)}.border-amber-200{border-color:var(--hakari-warn-border)}.border-violet-200{border-color:#a78bbd}button,details,input,select,summary,textarea{border-color:var(--hakari-border)}input,select,textarea{background-color:var(--hakari-surface-faint);color:var(--hakari-text)}.hover\:text-cyan-700:hover,button:hover,summary:hover{color:var(--hakari-accent)}.model-detail-trigger,.model-detail-trigger:hover{color:var(--hakari-text)}.hakari-icon{height:.875rem;width:.875rem}.doc-summary-trigger .hakari-icon,.tooltip-trigger .hakari-icon{height:.75rem;width:.75rem}.action-icon,.control-heading-icon,.filter-detail-icon,.section-heading-icon{color:var(--hakari-accent)}.leaderboard-row:hover>td{background-color:var(--hakari-row-hover)}.leaderboard-table-scroll{--hakari-model-col-width:clamp(18rem,40vw,36rem);--hakari-rank-col-width:4rem}.leaderboard-col-model{box-sizing:border-box;left:0;max-width:var(--hakari-model-col-width);min-width:var(--hakari-model-col-width);width:var(--hakari-model-col-width)}.leaderboard-col-rank{box-sizing:border-box;max-width:var(--hakari-rank-col-width);min-width:var(--hakari-rank-col-width);width:var(--hakari-rank-col-width)}.benchmark-doc h1{font-size:1.5rem;font-weight:600;line-height:2rem}.benchmark-doc h2{border-top:1px solid var(--hakari-border);font-size:1.125rem;font-weight:600;line-height:1.75rem;margin-top:1.5rem;padding-top:1rem}.benchmark-doc h3{font-size:1rem;font-weight:600;line-height:1.5rem;margin-top:1rem}.benchmark-doc blockquote,.benchmark-doc p,.benchmark-doc pre,.benchmark-doc table,.benchmark-doc ul{margin-top:.75rem}.benchmark-doc blockquote,.benchmark-doc li,.benchmark-doc p,.benchmark-doc td{color:var(--hakari-text-muted);font-size:.875rem;line-height:1.55}.benchmark-doc ul{list-style:disc;padding-left:1.25rem}.benchmark-doc a{color:var(--hakari-accent);text-decoration-line:underline;text-underline-offset:2px}.benchmark-doc table{min-width:100%}.benchmark-doc td{border-top:1px solid var(--hakari-border);padding:.375rem .5rem;vertical-align:top}.benchmark-doc code,.benchmark-doc pre{font-family:var(--hakari-font)}.benchmark-doc pre{border:1px solid var(--hakari-border);overflow-x:auto;padding:.75rem}.tooltip-trigger{cursor:pointer;position:relative}.tooltip-trigger:after{display:none}.global-tooltip{z-index:1000;max-width:min(24rem,calc(100vw - 2rem));opacity:0;overflow-wrap:anywhere;pointer-events:none;text-transform:none;transition:opacity .12s ease-in-out;white-space:normal}.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.18)}.global-tooltip[data-visible=true]{opacity:1}.leaderboard-loading-toast{opacity:0;pointer-events:none;transform:translateY(.5rem);transition:opacity .16s ease-in-out,transform .16s ease-in-out}.leaderboard-loading-toast.htmx-request{opacity:1;transform:translateY(0)}[data-leaderboard-pending=true]{cursor:progress;opacity:.72}[data-leaderboard-pending=true]:after{animation:hakari-leaderboard-spin .7s linear infinite;border:2px solid;border-right:2px solid transparent;border-radius:9999px;content:"";display:inline-block;height:.55rem;margin-left:.375rem;vertical-align:-.075rem;width:.55rem}.task-z-score{box-sizing:border-box;display:inline-flex;width:3.75rem;min-width:3.75rem;flex-direction:column;align-items:flex-end;justify-content:center;border:1px solid rgba(29,27,24,.14);border-radius:0;line-height:1;padding:.1rem .3rem}.task-z-score-value{font-size:.8125rem;font-weight:400}.task-z-score-delta{margin-top:0;font-size:.5625rem;font-weight:400}.task-z-neutral{background-color:transparent;color:inherit}.task-z-pos-025{background-color:#f4f1df;color:#5a6335}.task-z-pos-050{background-color:#e8e5c4;color:#4f5b2e}.task-z-pos-075{background-color:#d7d49f;color:#3f4a24}.task-z-pos-100{background-color:#c5c27c;color:#30391b}.task-z-pos-125{background-color:#aaa85f;color:#252d16}.task-z-pos-150{background-color:#8c8f46;color:#fffaf0}.task-z-pos-175{background-color:#707735;color:#fffaf0}.task-z-pos-200{background-color:#566126;color:#fffaf0}.task-z-neg-025{background-color:#f7ebe4;color:#87513e}.task-z-neg-050{background-color:#efd8cb;color:#7a4234}.task-z-neg-075{background-color:#e5c0ae;color:#6e342c}.task-z-neg-100{background-color:#d99f88;color:#55251f}.task-z-neg-125{background-color:#c97c63;color:#3a1714}.task-z-neg-150{background-color:#ad5d4d;color:#fffaf0}.task-z-neg-175{background-color:#924436;color:#fffaf0}.task-z-neg-200{background-color:#733126;color:#fffaf0}@keyframes hakari-leaderboard-spin{to{transform:rotate(1turn)}}@media (prefers-reduced-motion:reduce){.leaderboard-loading-toast{transition:none}[data-leaderboard-pending=true]:after{animation:none}}@media (prefers-color-scheme:dark){.shadow-sm{box-shadow:0 1px 2px 0 rgba(0,0,0,.5)}button,input,select,summary,textarea{color-scheme:dark}dialog::backdrop{background-color:rgba(24,24,27,.76)}.global-tooltip{border-color:var(--hakari-border-strong);background-color:var(--hakari-surface);color:var(--hakari-text)}.task-z-score{border-color:hsla(45,21%,93%,.22)}.task-z-pos-025{background-color:#022c22;color:#a7f3d0}.task-z-pos-050{background-color:#064e3b;color:#d1fae5}.task-z-pos-075{background-color:#065f46;color:#ecfdf5}.task-z-pos-100{background-color:#047857;color:#fff}.task-z-pos-125{background-color:#059669;color:#fff}.task-z-pos-150{background-color:#10b981;color:#022c22}.task-z-pos-175{background-color:#34d399;color:#022c22}.task-z-pos-200{background-color:#6ee7b7;color:#022c22}.task-z-neg-025{background-color:#4c0519;color:#fecdd3}.task-z-neg-050{background-color:#881337;color:#ffe4e6}.task-z-neg-075{background-color:#9f1239;color:#fff1f2}.task-z-neg-100{background-color:#be123c;color:#fff}.task-z-neg-125{background-color:#e11d48;color:#fff}.task-z-neg-150{background-color:#f43f5e;color:#fff}.task-z-neg-175{background-color:#e11d48;color:#fff}.task-z-neg-200{background-color:#be123c;color:#fff}}.\[index\:end\]{index:end}.\[overflow-wrap\:anywhere\]{overflow-wrap:anywhere}.backdrop\:bg-zinc-950\/35::backdrop{background-color:rgba(9,9,11,.35)}.odd\:bg-white:nth-child(odd){--tw-bg-opacity:1;background-color:rgb(255 255 255/var(--tw-bg-opacity,1));background-color:var(--hakari-surface)}.even\:bg-zinc-50:nth-child(2n){--tw-bg-opacity:1;background-color:rgb(250 250 250/var(--tw-bg-opacity,1))}.even\:bg-zinc-50:nth-child(2n)body{background-color:var(--hakari-bg)}.even\:bg-zinc-50:nth-child(2n){background-color:var(--hakari-surface-faint)}.hover\:border-cyan-500:hover{--tw-border-opacity:1;border-color:rgb(6 182 212/var(--tw-border-opacity,1))}.hover\:border-cyan-600:hover{--tw-border-opacity:1;border-color:rgb(8 145 178/var(--tw-border-opacity,1))}.hover\:text-cyan-700:hover{--tw-text-opacity:1;color:rgb(14 116 144/var(--tw-text-opacity,1))}.hover\:underline:hover{text-decoration-line:underline}.hover\:text-cyan-700:hover{color:var(--hakari-accent)}.focus\:border-cyan-700:focus,.hover\:border-cyan-500:hover,.hover\:border-cyan-600:hover{border-color:var(--hakari-accent-border)}.focus\:border-cyan-700:focus{--tw-border-opacity:1}@media (min-width:640px){.sm\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}.sm\:grid-cols-5{grid-template-columns:repeat(5,minmax(0,1fr))}.sm\:px-6{padding-left:1.5rem;padding-right:1.5rem}}@media (min-width:1024px){.lg\:grid-cols-2{grid-template-columns:repeat(2,minmax(0,1fr))}.lg\:grid-cols-6{grid-template-columns:repeat(6,minmax(0,1fr))}}@media (min-width:1280px){.xl\:grid-cols-3{grid-template-columns:repeat(3,minmax(0,1fr))}}
|
hakari_bench/viewer/leaderboard.py
CHANGED
|
@@ -233,6 +233,7 @@ class LeaderboardService:
|
|
| 233 |
and not task_filter.strip()
|
| 234 |
and not has_length_filters
|
| 235 |
and language_filter_mode == "languages"
|
|
|
|
| 236 |
):
|
| 237 |
precomputed = _load_precomputed_leaderboard_rows(
|
| 238 |
duckdb_path=self.duckdb_path,
|
|
@@ -907,7 +908,7 @@ def _group_by_benchmark(rows: list[TaskScore]) -> dict[str, list[TaskScore]]:
|
|
| 907 |
|
| 908 |
def _aggregate_overall_scores(rows: list[TaskScore], overall: OverallConfig) -> list[TaskScore]:
|
| 909 |
component_by_benchmark = {component.name: component for component in overall.benchmark_components}
|
| 910 |
-
if not
|
| 911 |
return rows
|
| 912 |
|
| 913 |
expected_raw_tasks: dict[str, set[str]] = defaultdict(set)
|
|
@@ -1262,11 +1263,15 @@ def _language_filter_mode_for_view(config: ViewerConfig, view_name: str) -> Lang
|
|
| 1262 |
|
| 1263 |
|
| 1264 |
def _overall_metric_score_group(overall: OverallConfig) -> ScoreGroupConfig | None:
|
| 1265 |
-
if not
|
| 1266 |
return None
|
| 1267 |
return ScoreGroupConfig(name="grouped_tasks", label="Grouped Tasks", group_by="task_key")
|
| 1268 |
|
| 1269 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1270 |
def _record_display_model_names(records: list[TaskResultRow], *, include_variant_details: bool) -> list[str]:
|
| 1271 |
if not include_variant_details:
|
| 1272 |
return [record.model_name for record in records]
|
|
|
|
| 233 |
and not task_filter.strip()
|
| 234 |
and not has_length_filters
|
| 235 |
and language_filter_mode == "languages"
|
| 236 |
+
and (overall is None or not _overall_uses_grouped_components(overall))
|
| 237 |
):
|
| 238 |
precomputed = _load_precomputed_leaderboard_rows(
|
| 239 |
duckdb_path=self.duckdb_path,
|
|
|
|
| 908 |
|
| 909 |
def _aggregate_overall_scores(rows: list[TaskScore], overall: OverallConfig) -> list[TaskScore]:
|
| 910 |
component_by_benchmark = {component.name: component for component in overall.benchmark_components}
|
| 911 |
+
if not _overall_uses_grouped_components(overall):
|
| 912 |
return rows
|
| 913 |
|
| 914 |
expected_raw_tasks: dict[str, set[str]] = defaultdict(set)
|
|
|
|
| 1263 |
|
| 1264 |
|
| 1265 |
def _overall_metric_score_group(overall: OverallConfig) -> ScoreGroupConfig | None:
|
| 1266 |
+
if not _overall_uses_grouped_components(overall):
|
| 1267 |
return None
|
| 1268 |
return ScoreGroupConfig(name="grouped_tasks", label="Grouped Tasks", group_by="task_key")
|
| 1269 |
|
| 1270 |
|
| 1271 |
+
def _overall_uses_grouped_components(overall: OverallConfig) -> bool:
|
| 1272 |
+
return any(component.group_by is not None for component in overall.benchmark_components)
|
| 1273 |
+
|
| 1274 |
+
|
| 1275 |
def _record_display_model_names(records: list[TaskResultRow], *, include_variant_details: bool) -> list[str]:
|
| 1276 |
if not include_variant_details:
|
| 1277 |
return [record.model_name for record in records]
|
tests/test_cli.py
CHANGED
|
@@ -66,6 +66,31 @@ def _default_dense_quantized_variants() -> list[dict[str, object]]:
|
|
| 66 |
]
|
| 67 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
def _truncate_quantized_variants(*dims: int) -> list[dict[str, object]]:
|
| 70 |
variants: list[dict[str, object]] = []
|
| 71 |
for dim in dims:
|
|
@@ -465,7 +490,7 @@ def test_parse_args_no_default_keeps_explicit_truncate_variants_only() -> None:
|
|
| 465 |
]
|
| 466 |
|
| 467 |
|
| 468 |
-
def
|
| 469 |
args = parse_args(
|
| 470 |
[
|
| 471 |
"evaluate",
|
|
@@ -475,6 +500,20 @@ def test_parse_args_does_not_add_default_quantized_variants_to_sparse_models() -
|
|
| 475 |
]
|
| 476 |
)
|
| 477 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 478 |
assert args.embedding_variants == []
|
| 479 |
|
| 480 |
|
|
@@ -973,6 +1012,7 @@ def test_parse_args_accepts_query_truncate_sparse_max_dims_embedding_variants()
|
|
| 973 |
assert args.embedding_variants == [
|
| 974 |
_pipeline_variant("sparse_query_max_active_dims_128", _truncate_sparse_max_dims_step(128, target="query")),
|
| 975 |
_pipeline_variant("sparse_query_max_active_dims_64", _truncate_sparse_max_dims_step(64, target="query")),
|
|
|
|
| 976 |
]
|
| 977 |
|
| 978 |
|
|
@@ -1007,52 +1047,19 @@ def test_parse_args_accepts_query_and_docs_truncate_sparse_max_dims_cross_produc
|
|
| 1007 |
]
|
| 1008 |
)
|
| 1009 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1010 |
assert args.embedding_variants == [
|
| 1011 |
-
|
| 1012 |
-
|
| 1013 |
-
|
| 1014 |
-
|
| 1015 |
-
|
| 1016 |
-
|
| 1017 |
-
"sparse_query_max_active_dims_8_sparse_document_max_active_dims_128",
|
| 1018 |
-
_truncate_sparse_max_dims_step(8, target="query"),
|
| 1019 |
-
_truncate_sparse_max_dims_step(128, target="corpus"),
|
| 1020 |
-
),
|
| 1021 |
-
_pipeline_variant(
|
| 1022 |
-
"sparse_query_max_active_dims_8_sparse_document_max_active_dims_256",
|
| 1023 |
-
_truncate_sparse_max_dims_step(8, target="query"),
|
| 1024 |
-
_truncate_sparse_max_dims_step(256, target="corpus"),
|
| 1025 |
-
),
|
| 1026 |
-
_pipeline_variant(
|
| 1027 |
-
"sparse_query_max_active_dims_16_sparse_document_max_active_dims_64",
|
| 1028 |
-
_truncate_sparse_max_dims_step(16, target="query"),
|
| 1029 |
-
_truncate_sparse_max_dims_step(64, target="corpus"),
|
| 1030 |
-
),
|
| 1031 |
-
_pipeline_variant(
|
| 1032 |
-
"sparse_query_max_active_dims_16_sparse_document_max_active_dims_128",
|
| 1033 |
-
_truncate_sparse_max_dims_step(16, target="query"),
|
| 1034 |
-
_truncate_sparse_max_dims_step(128, target="corpus"),
|
| 1035 |
-
),
|
| 1036 |
-
_pipeline_variant(
|
| 1037 |
-
"sparse_query_max_active_dims_16_sparse_document_max_active_dims_256",
|
| 1038 |
-
_truncate_sparse_max_dims_step(16, target="query"),
|
| 1039 |
-
_truncate_sparse_max_dims_step(256, target="corpus"),
|
| 1040 |
-
),
|
| 1041 |
-
_pipeline_variant(
|
| 1042 |
-
"sparse_query_max_active_dims_32_sparse_document_max_active_dims_64",
|
| 1043 |
-
_truncate_sparse_max_dims_step(32, target="query"),
|
| 1044 |
-
_truncate_sparse_max_dims_step(64, target="corpus"),
|
| 1045 |
-
),
|
| 1046 |
-
_pipeline_variant(
|
| 1047 |
-
"sparse_query_max_active_dims_32_sparse_document_max_active_dims_128",
|
| 1048 |
-
_truncate_sparse_max_dims_step(32, target="query"),
|
| 1049 |
-
_truncate_sparse_max_dims_step(128, target="corpus"),
|
| 1050 |
-
),
|
| 1051 |
-
_pipeline_variant(
|
| 1052 |
-
"sparse_query_max_active_dims_32_sparse_document_max_active_dims_256",
|
| 1053 |
-
_truncate_sparse_max_dims_step(32, target="query"),
|
| 1054 |
-
_truncate_sparse_max_dims_step(256, target="corpus"),
|
| 1055 |
-
),
|
| 1056 |
]
|
| 1057 |
|
| 1058 |
|
|
|
|
| 66 |
]
|
| 67 |
|
| 68 |
|
| 69 |
+
def _default_sparse_truncation_variants() -> list[dict[str, object]]:
|
| 70 |
+
return _sparse_truncation_grid_variants(
|
| 71 |
+
query_dims=[8, 16, 24, 32],
|
| 72 |
+
document_dims=[64, 128, 256, 512],
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def _sparse_truncation_grid_variants(
|
| 77 |
+
*,
|
| 78 |
+
query_dims: list[int],
|
| 79 |
+
document_dims: list[int],
|
| 80 |
+
) -> list[dict[str, object]]:
|
| 81 |
+
variants: list[dict[str, object]] = []
|
| 82 |
+
for query_dim in query_dims:
|
| 83 |
+
for document_dim in document_dims:
|
| 84 |
+
variants.append(
|
| 85 |
+
_pipeline_variant(
|
| 86 |
+
f"sparse_query_max_active_dims_{query_dim}_sparse_document_max_active_dims_{document_dim}",
|
| 87 |
+
_truncate_sparse_max_dims_step(query_dim, target="query"),
|
| 88 |
+
_truncate_sparse_max_dims_step(document_dim, target="corpus"),
|
| 89 |
+
)
|
| 90 |
+
)
|
| 91 |
+
return variants
|
| 92 |
+
|
| 93 |
+
|
| 94 |
def _truncate_quantized_variants(*dims: int) -> list[dict[str, object]]:
|
| 95 |
variants: list[dict[str, object]] = []
|
| 96 |
for dim in dims:
|
|
|
|
| 490 |
]
|
| 491 |
|
| 492 |
|
| 493 |
+
def test_parse_args_defaults_to_sparse_truncation_grid_variants() -> None:
|
| 494 |
args = parse_args(
|
| 495 |
[
|
| 496 |
"evaluate",
|
|
|
|
| 500 |
]
|
| 501 |
)
|
| 502 |
|
| 503 |
+
assert args.embedding_variants == _default_sparse_truncation_variants()
|
| 504 |
+
|
| 505 |
+
|
| 506 |
+
def test_parse_args_can_disable_default_sparse_truncation_grid_variants() -> None:
|
| 507 |
+
args = parse_args(
|
| 508 |
+
[
|
| 509 |
+
"evaluate",
|
| 510 |
+
"sparse",
|
| 511 |
+
"--model",
|
| 512 |
+
"naver/splade-v3",
|
| 513 |
+
"--no-default-embedding-variants",
|
| 514 |
+
]
|
| 515 |
+
)
|
| 516 |
+
|
| 517 |
assert args.embedding_variants == []
|
| 518 |
|
| 519 |
|
|
|
|
| 1012 |
assert args.embedding_variants == [
|
| 1013 |
_pipeline_variant("sparse_query_max_active_dims_128", _truncate_sparse_max_dims_step(128, target="query")),
|
| 1014 |
_pipeline_variant("sparse_query_max_active_dims_64", _truncate_sparse_max_dims_step(64, target="query")),
|
| 1015 |
+
*_default_sparse_truncation_variants(),
|
| 1016 |
]
|
| 1017 |
|
| 1018 |
|
|
|
|
| 1047 |
]
|
| 1048 |
)
|
| 1049 |
|
| 1050 |
+
explicit_variants = _sparse_truncation_grid_variants(
|
| 1051 |
+
query_dims=[8, 16, 32],
|
| 1052 |
+
document_dims=[64, 128, 256],
|
| 1053 |
+
)
|
| 1054 |
+
explicit_names = {str(variant["name"]) for variant in explicit_variants}
|
| 1055 |
+
|
| 1056 |
assert args.embedding_variants == [
|
| 1057 |
+
*explicit_variants,
|
| 1058 |
+
*[
|
| 1059 |
+
variant
|
| 1060 |
+
for variant in _default_sparse_truncation_variants()
|
| 1061 |
+
if str(variant["name"]) not in explicit_names
|
| 1062 |
+
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1063 |
]
|
| 1064 |
|
| 1065 |
|
tests/test_viewer.py
CHANGED
|
@@ -56,7 +56,6 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
|
|
| 56 |
"MNanoBEIR",
|
| 57 |
"NanoMMTEB-v2",
|
| 58 |
"NanoRTEB",
|
| 59 |
-
"NanoMIRACL",
|
| 60 |
"NanoMLDR",
|
| 61 |
"NanoBRIGHT",
|
| 62 |
"NanoLaw",
|
|
@@ -70,6 +69,15 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
|
|
| 70 |
core_overall = config.overall_for_view("Core")
|
| 71 |
assert core_overall is not None
|
| 72 |
assert core_overall.benchmark_names == core_benchmarks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
grouped_overall = config.overall_for_view("Group")
|
| 74 |
assert grouped_overall is not None
|
| 75 |
assert [component.name for component in grouped_overall.benchmark_components] == [
|
|
@@ -135,7 +143,7 @@ def test_viewer_config_uses_all_core_and_grouped_overall_views() -> None:
|
|
| 135 |
assert all(benchmark in config.overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
|
| 136 |
assert all(benchmark not in core_overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
|
| 137 |
assert "NanoMIRACL" in config.overall.benchmark_names
|
| 138 |
-
assert "NanoMIRACL" in core_overall.benchmark_names
|
| 139 |
assert "NanoCMTEB" in config.view_names
|
| 140 |
assert "NanoCMTEB" in config.overall.benchmark_names
|
| 141 |
nano_law = config.benchmark_for_view("NanoLaw")
|
|
@@ -173,7 +181,7 @@ def test_core_benchmark_view_group_only_contains_primary_core_benchmarks() -> No
|
|
| 173 |
assert _view_group("NanoMMTEB-v2") == "Core benchmarks"
|
| 174 |
assert _view_group("MNanoBEIR") == "Core benchmarks"
|
| 175 |
assert _view_group("NanoRTEB") == "Core benchmarks"
|
| 176 |
-
assert _view_group("NanoMIRACL") == "
|
| 177 |
assert _view_group("NanoMLDR") == "Core benchmarks"
|
| 178 |
assert _view_group("NanoBRIGHT") == "Core benchmarks"
|
| 179 |
assert _view_group("NanoLaw") == "Core benchmarks"
|
|
@@ -602,6 +610,8 @@ def test_leaderboard_renders_grouped_benchmark_picker_and_sticky_columns(tmp_pat
|
|
| 602 |
assert response.status_code == 200
|
| 603 |
assert "Benchmark groups" in response.text
|
| 604 |
assert 'data-icon="layers"' in response.text
|
|
|
|
|
|
|
| 605 |
assert response.text.index("Target") < response.text.index("Overall")
|
| 606 |
assert 'data-testid="primary-benchmark-column"' in response.text
|
| 607 |
primary_column = response.text.split('data-testid="primary-benchmark-column"', 1)[1].split('data-testid="secondary-benchmark-column"', 1)[0]
|
|
@@ -1333,6 +1343,7 @@ def test_viewer_can_include_embedding_variants_in_ranking(tmp_path: Path) -> Non
|
|
| 1333 |
assert 'data-filter-icon="ruler"' in response.text
|
| 1334 |
assert 'data-filter-detail="quant_filter"' in response.text
|
| 1335 |
assert 'data-filter-icon="binary"' in response.text
|
|
|
|
| 1336 |
assert "grid-cols-2" in response.text
|
| 1337 |
assert "sm:grid-cols-3" in response.text
|
| 1338 |
assert response.text.count(">All</button>") == 5
|
|
|
|
| 56 |
"MNanoBEIR",
|
| 57 |
"NanoMMTEB-v2",
|
| 58 |
"NanoRTEB",
|
|
|
|
| 59 |
"NanoMLDR",
|
| 60 |
"NanoBRIGHT",
|
| 61 |
"NanoLaw",
|
|
|
|
| 69 |
core_overall = config.overall_for_view("Core")
|
| 70 |
assert core_overall is not None
|
| 71 |
assert core_overall.benchmark_names == core_benchmarks
|
| 72 |
+
assert [component.group_by for component in core_overall.benchmark_components] == [
|
| 73 |
+
"task_name",
|
| 74 |
+
None,
|
| 75 |
+
None,
|
| 76 |
+
None,
|
| 77 |
+
None,
|
| 78 |
+
None,
|
| 79 |
+
None,
|
| 80 |
+
]
|
| 81 |
grouped_overall = config.overall_for_view("Group")
|
| 82 |
assert grouped_overall is not None
|
| 83 |
assert [component.name for component in grouped_overall.benchmark_components] == [
|
|
|
|
| 143 |
assert all(benchmark in config.overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
|
| 144 |
assert all(benchmark not in core_overall.benchmark_names for benchmark in language_nanomteb_benchmarks)
|
| 145 |
assert "NanoMIRACL" in config.overall.benchmark_names
|
| 146 |
+
assert "NanoMIRACL" not in core_overall.benchmark_names
|
| 147 |
assert "NanoCMTEB" in config.view_names
|
| 148 |
assert "NanoCMTEB" in config.overall.benchmark_names
|
| 149 |
nano_law = config.benchmark_for_view("NanoLaw")
|
|
|
|
| 181 |
assert _view_group("NanoMMTEB-v2") == "Core benchmarks"
|
| 182 |
assert _view_group("MNanoBEIR") == "Core benchmarks"
|
| 183 |
assert _view_group("NanoRTEB") == "Core benchmarks"
|
| 184 |
+
assert _view_group("NanoMIRACL") == "Domain-specific"
|
| 185 |
assert _view_group("NanoMLDR") == "Core benchmarks"
|
| 186 |
assert _view_group("NanoBRIGHT") == "Core benchmarks"
|
| 187 |
assert _view_group("NanoLaw") == "Core benchmarks"
|
|
|
|
| 610 |
assert response.status_code == 200
|
| 611 |
assert "Benchmark groups" in response.text
|
| 612 |
assert 'data-icon="layers"' in response.text
|
| 613 |
+
assert 'class="border px-2 py-1 text-[0.8125rem] border-cyan-700 bg-cyan-50 text-cyan-900"' in response.text
|
| 614 |
+
assert 'class="border px-3 py-1.5 text-sm' not in response.text
|
| 615 |
assert response.text.index("Target") < response.text.index("Overall")
|
| 616 |
assert 'data-testid="primary-benchmark-column"' in response.text
|
| 617 |
primary_column = response.text.split('data-testid="primary-benchmark-column"', 1)[1].split('data-testid="secondary-benchmark-column"', 1)[0]
|
|
|
|
| 1343 |
assert 'data-filter-icon="ruler"' in response.text
|
| 1344 |
assert 'data-filter-detail="quant_filter"' in response.text
|
| 1345 |
assert 'data-filter-icon="binary"' in response.text
|
| 1346 |
+
assert 'summary class="cursor-pointer px-1.5 py-0.5 text-[0.8125rem] font-medium text-zinc-800"' in response.text
|
| 1347 |
assert "grid-cols-2" in response.text
|
| 1348 |
assert "sm:grid-cols-3" in response.text
|
| 1349 |
assert response.text.count(">All</button>") == 5
|
tests/test_viewer_browser.py
CHANGED
|
@@ -56,6 +56,24 @@ def test_viewer_browser_smoke_covers_static_javascript(tmp_path: Path) -> None:
|
|
| 56 |
assert section_icon_state["color"] != "rgb(0, 0, 0)"
|
| 57 |
assert page.locator("button", has_text="Variant impact").locator("svg[data-icon='git-compare-arrows']").count() == 1
|
| 58 |
assert page.locator("summary", has_text="Languages").locator("svg[data-icon='languages']").count() == 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
page.get_by_text("256d <- 384").wait_for(timeout=15_000)
|
| 60 |
compact_table_state = page.locator("tbody tr:not([hidden]) td").first.evaluate(
|
| 61 |
"""(el) => ({
|
|
|
|
| 56 |
assert section_icon_state["color"] != "rgb(0, 0, 0)"
|
| 57 |
assert page.locator("button", has_text="Variant impact").locator("svg[data-icon='git-compare-arrows']").count() == 1
|
| 58 |
assert page.locator("summary", has_text="Languages").locator("svg[data-icon='languages']").count() == 1
|
| 59 |
+
compact_tile_state = page.locator("button", has_text="Variant impact").first.evaluate(
|
| 60 |
+
"""(el) => ({
|
| 61 |
+
fontSize: parseFloat(getComputedStyle(el).fontSize),
|
| 62 |
+
paddingLeft: parseFloat(getComputedStyle(el).paddingLeft),
|
| 63 |
+
paddingTop: parseFloat(getComputedStyle(el).paddingTop),
|
| 64 |
+
})"""
|
| 65 |
+
)
|
| 66 |
+
assert compact_tile_state["fontSize"] == pytest.approx(13.0, abs=0.1)
|
| 67 |
+
assert compact_tile_state["paddingLeft"] <= 8.0
|
| 68 |
+
assert compact_tile_state["paddingTop"] <= 4.0
|
| 69 |
+
language_tile_state = page.locator("nav[aria-label='Language pages'] button", has_text="All").first.evaluate(
|
| 70 |
+
"""(el) => ({
|
| 71 |
+
fontSize: parseFloat(getComputedStyle(el).fontSize),
|
| 72 |
+
paddingLeft: parseFloat(getComputedStyle(el).paddingLeft),
|
| 73 |
+
paddingTop: parseFloat(getComputedStyle(el).paddingTop),
|
| 74 |
+
})"""
|
| 75 |
+
)
|
| 76 |
+
assert language_tile_state == pytest.approx({"fontSize": 13.0, "paddingLeft": 8.0, "paddingTop": 4.0}, abs=0.1)
|
| 77 |
page.get_by_text("256d <- 384").wait_for(timeout=15_000)
|
| 78 |
compact_table_state = page.locator("tbody tr:not([hidden]) td").first.evaluate(
|
| 79 |
"""(el) => ({
|