Spaces:
Running
Running
Commit History
Cross-source dedup, plotbox polish, pretty URLs, eval page fallbacks 6b39d1f
Cross-suite signals, sortable leaderboard, theme cleanup 0314721
Mount comparability panel above leaderboard, restyle, drop empty promises 02691ce
Fix "null–null (null%)" confidence interval rendering ae31eaf
Add rule-based policy-mode summaries for model & eval views aacebd7
Cross-source dedup, plotbox polish, pretty URLs, eval page fallbacks 0b45710
Match nested benchmarks in /evals search; auto-expand families with hits 26f932a
Eval detail polish: hide empty fields, redesign splits, surface evaluator c8aca27
Move reader-mode toggle to detail pages; theme banners + apples-to-apples 4629534
Merge cross-source benchmark families; tidy leaderboard panel + table chrome 8ef4cbc
Restore curated benchmark families; polish frontier panel UX ca20f78
Live snapshot date, hide empty Updated col, clean slice contamination cb0ce7c
Humanize family names whose display matches the key under different separators b763f91
Make /models tables column-sortable; rebalance /evals + /models toolbars 5a2d59c
Clean up Source column and per-row dataset label noise eec1852
Hide subtask-scope metrics from chips by default in matrix view 4cb8b56
Render score-distribution metric picker as chips, not a dropdown 1303965
Treat single-root-metric subtask evals as slice-pickable, not matrix 4ac3a9b
Move split selector below the reporting comparison heading 629a612
Fix sort toggle direction and remove categories as sortable column c9c5a30
Sort evals list by family name; add sortable columns; use cleaned display names 919a75f
Fix ranks-high/low-in using only sidecar ordinal data 970fdbe
Wire search bar to overlaps table and hide chips in overlaps view 0f5fb5f
Compute and apply cleaned benchmark counts per model c2e86ea
Harden cleanHierarchy fallback and add family-name filter chips 8529a4b
Restructure model details + extend cleanHierarchy for split families and aggregator dedup 06313c1
Add list-view toggle to consolidate cross-family duplicate benchmarks 26eb09f
Square off deep-dive theme and surface cross-family duplicates b75f4c3
Switch family/model views to curated category tags bc08b3b
Route peer-ranks fetch through SNAPSHOT_URL sidecar 6cc7b0b
Group model/eval-detail benchmarks by hierarchy.json families f073e7a
Drop latest_timestamp fallback for release_date display 8717cca
Guard summaryText against null in PolicyOverview c3a3598
Refactor to align on benchmark hierarchy 2ed4959
Update with datafix v2 11542d9
Consolidate hierarchy terminology + handle v2 hierarchy shape 350e866
Reconcile UI with v2 backend payload + drop redundant signal cards d52d9e0
Tighten eval cards UI and clean up stale local data 32864b0
Add new component files and align app to EvalEval design system dbdd6d1
Replace shadcn-styled UI elements with design system primitives 187ffe6
Add plain-language captions and mode-aware framing for policy readers 3ad47c6
Align user-facing labels with paper terminology 4be62f9
Merge corpus dashboard into home as paper-aligned landing 5279156
Deploy DuckDB-backed frontend to da8db3e
Jenny Chim commited on