languagebench / evals

Commit History

Upload from GitHub Actions: blocklist: drop the grace for slow failing models, not just egregious ones
608b646
Running
verified

davidpomerenke commited on

Upload from GitHub Actions: models: drop kimi-k2.6; exclude egregiously-failing models after one run
4c601bb
verified

davidpomerenke commited on

Upload from GitHub Actions: eval: check runtime budget per-batch so a slow model can't blow the 6h cap
594d28a
verified

davidpomerenke commited on

Upload from GitHub Actions: eval: fix per-combo resilience (tqdm_asyncio.gather has no return_exceptions)
18acfdb
verified

davidpomerenke commited on

Upload from GitHub Actions: eval: don't let one bad (task,language) combo crash the whole run
a92221e
verified

davidpomerenke commited on

Upload from GitHub Actions: models: migrate catalog to /api/v1/models; enforce privacy per-request
f502bec
verified

davidpomerenke commited on

Upload from GitHub Actions: discovery: surface newer flagships from curated families; blocklist: require 2 consecutive bad runs
f28fed1
verified

davidpomerenke commited on

Upload from GitHub Actions: discovery: one flagship per product line; eval: graceful 6h-safe runtime budget
4047210
verified

davidpomerenke commited on

Upload from GitHub Actions: main: gate publishing on coverage-completeness, not just error rate
c1041db
verified

davidpomerenke commited on

Upload from GitHub Actions: util: retry HF push with backoff; write local snapshot before push
2a1f0a5
verified

davidpomerenke commited on

Upload from GitHub Actions: main: checkpoint per fully-evaluated model instead of once at the end
eaa7534
verified

davidpomerenke commited on

Upload from GitHub Actions: models: replace claude-opus-4.5/4.6 with 4.8 in curated list
15e8f68
verified

davidpomerenke commited on

Upload from GitHub Actions: discovery: filter out voice/ASR/vision/build endpoints
f2add9e
verified

davidpomerenke commited on

Upload from GitHub Actions: backend: handle null creation_date in three apply() calls
f6a28ed
verified

davidpomerenke commited on

Upload from GitHub Actions: fast-fail on account-level API errors; refuse to ship runs with >80% errors
c2afc16
verified

davidpomerenke commited on

Upload from GitHub Actions: unblock workflow: materialize gcloud creds on runner; lazy-init translate client
bec2f46
verified

davidpomerenke commited on

Upload from GitHub Actions: guard main.py against partial-scale HF pushes; restore aggregated results
691e6c2
verified

davidpomerenke commited on

Upload from GitHub Actions: refresh pyproject metadata + README HF frontmatter
1eccc3f
verified

davidpomerenke commited on

Upload from GitHub Actions: new model
78be468
verified

davidpomerenke commited on

Upload from GitHub Actions: added new models
83d2972
verified

davidpomerenke commited on

Upload from GitHub Actions: added new models
93a8617
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #28 from datenlabor-bmz/jn-dev
55b63ea
verified

davidpomerenke commited on

Upload from GitHub Actions: cleaned up code
2586cfe
verified

davidpomerenke commited on

Upload from GitHub Actions: added opus 4.5
0a17acf
verified

davidpomerenke commited on

Upload from GitHub Actions: add gpt-5.1, gemini-3
9ea2dd3
verified

davidpomerenke commited on

Upload from GitHub Actions: flores filter for available dev split
34b05c6
verified

davidpomerenke commited on

Upload from GitHub Actions: model name no bracket stuff
aa92add
verified

davidpomerenke commited on

Upload from GitHub Actions: drop normalization
972026c
verified

davidpomerenke commited on

Upload from GitHub Actions: improve norwegian fix
6f0e312
verified

davidpomerenke commited on

Upload from GitHub Actions: fix norwegian
0cbac6c
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #22 from datenlabor-bmz/dev
2cdada4
verified

davidpomerenke commited on

Upload from GitHub Actions: Add auto-translated datasets
68a93b5
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17
a0d1624
verified

davidpomerenke commited on

Upload from GitHub Actions: Add auto-translated datasets
c790fdb
verified

davidpomerenke commited on

Upload from GitHub Actions: ran full evaluation locally
088f96f
verified

davidpomerenke commited on

Upload from GitHub Actions: minor chashing change
b39df3c
verified

davidpomerenke commited on

Upload from GitHub Actions: updated and cleaned up scripts for new eval runs
963cb78
verified

davidpomerenke commited on

Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions
8eebb41
verified

davidpomerenke commited on

Upload from GitHub Actions: Add Todos for using existing machine-translated datasets rather than our own ones
56adaa2
verified

davidpomerenke commited on

Upload from GitHub Actions: updated translation functions
8f5ce26
verified

davidpomerenke commited on

Upload from GitHub Actions: import flexibility on backend
b8cbeff
verified

davidpomerenke commited on

Upload from GitHub Actions: fixed import error
0a30811
verified

davidpomerenke commited on

Upload from GitHub Actions: updated frontend and backend to fix bugs
4e8cb1a
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #13 from datenlabor-bmz/jn-dev
80d21cb
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #10 from datenlabor-bmz/jn-dev
c2eeeac
verified

davidpomerenke commited on

Upload from GitHub Actions: updated batch size and delay
02f927b
verified

davidpomerenke commited on

Upload from GitHub Actions: updated workflow settings
e51c770
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev
7c06aef
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #7 from datenlabor-bmz/jn-dev
6878a71
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #6 from datenlabor-bmz/jn-dev
6234f5c
verified

davidpomerenke commited on