Jenny Chim
Deploy DuckDB-backed frontend to
da8db3e
|
raw
history blame
10.3 kB

Transformations registry

Canonical specs for data transformations the Next.js app currently performs that should ultimately live in the upstream pipeline (eval_cards_backend_pipeline).

Framing

The TS code in lib/ does meaningful data transformation work β€” token canonicalization, variant grouping, source-metadata defaults, category inference, score normalization, etc. β€” most of which the Python pipeline does not yet do (or does differently). For this migration:

  • TS is the current source of truth for what these transformations should produce. The TS rules have been refined over time against real evaluation data; they encode product decisions.
  • Each transformation is a candidate to move upstream. The pipeline owns data shape and content; doing the transformation there means every downstream consumer (not just this Next.js app) gets the canonical form.
  • The migration path for each item: write a spec here β†’ write executable tests sourced from the spec β†’ hand the spec + tests to the pipeline owner β†’ verify pipeline output matches across the full corpus β†’ delete the TS implementation.
  • Pre-processing in this repo (e.g. a one-shot Python script run at data-ingestion time) is allowed but not required. Default is "transformation stays in TS until pipeline catches up." Front-load only when the TS implementation is so brittle or expensive that on-the-fly is intolerable.

One thing to watch: defaults vs unconditional normalization

For each rule, classify in the spec how it interacts with pre-existing data:

  • Default-only (don't overwrite when pipeline already emits a value): e.g. "if source_metadata.evaluator_relationship is missing, default to other". Pipeline-side fix is to emit the default upstream rather than letting consumers fill in.
  • Unconditional normalization (always overwrite): e.g. "lowercase the v in version tokens regardless of upstream input". Pipeline-side fix is to apply the rule before emitting; downstream consumers should not need to re-derive.

Mis-classifying these is the failure mode that bit us in Phase 3 (treating a normalization rule as if it were a default, ending up overwriting category data). Each spec must call out which kind it is.

Where it belongs: cleaning vs reshape

A second classification each spec should call out β€” what kind of work is this transformation?

  • Cleaning / standardization (changes a value): license shorthand, developer name canonicalization, identity tokens, timestamp format, category labels. These belong in the pipeline; the per-item workflow above is built for them. Default-vs-normalization (the section above) is the sub-question.
  • Reshape / dedup / aggregate (computes a derived view): variant dedup with "freshest wins", per-category counts, top-scores ranking, hierarchy flattening. These belong in DuckDB SQL β€” either materialized into pipeline parquet (the answer is the same for every consumer; emit it pre-computed) or expressed as a query at request time (consumers slice differently; let SQL do the work). The TS implementation here is scaffolding.

For the reshape class the migration target shape is itself a design choice β€” capture the operation (e.g. "max retrieved_timestamp wins per variant key, take its source_metadata along") and flag it for the parquet-schema / SQL conversation rather than mechanically translating the TS code line-by-line. See notes/migration-plan.md Β§ "Data direction" for the full framing.

Index

# Transformation Spec Tests Pipeline status Migration item
01 Model identity canonicalization 01-identity-canonicalization.md tests/transformations/identity-canonicalization.test.ts partial (model_family_id βœ…, model_family_name ❌ on 1,260 cards) #1
02 Setup-alias variant merging 02-setup-alias-merging.md tests/transformations/setup-alias-merging.test.ts not started (cache file shows pre-merge state; runtime normalizer is what merges) #2
03 License string normalization 03-license-normalization.md tests/transformations/license-normalization.test.ts not implemented; pipeline emits free-text data_licensing only #18
04 Dataset URL synthesis 04-dataset-url-synthesis.md tests/transformations/dataset-url-synthesis.test.ts not implemented; dataset_url field never populated in prod (564/587 use url[0], 22/587 use hf_repo template) #20
05 Slug candidate generation (file lookup) 05-slug-candidates.md tests/transformations/slug-candidates.test.ts not implemented; 39% of model lookups + 43% of developer lookups need a non-zero retry position in production #19
06 Developer name canonicalization 06-developer-name-canonicalization.md tests/transformations/developer-name-canonicalization.test.ts not implemented; pipeline emits raw developer string. TS map covers 1.8% of devs / 11.9% of cards; title-case fallback fires on 55.6% of devs / 48.4% of cards #9
07 Timestamp normalization 07-timestamp-normalization.md tests/transformations/timestamp-normalization.test.ts not implemented; production is 99.99% unix-seconds-strings (86,178/86,183) + 5 ISO datetime. Three different TS variants exist with subtly different semantics β€” pipeline canonicalization to ISO 8601 collapses them #13
08 Benchmark display names 08-benchmark-display-names.md tests/transformations/benchmark-display-names.test.ts not implemented; 30-entry hand-curated map covers ~74% of distinct suite keys but only ~3% of distinct benchmark values; ~97% fall through to a humanizeToken fallback that mangles acronyms (MMLU-PRO β†’ MMLU PRO, helm_air_bench β†’ Helm Air Bench). A second functionally-dead duplicate exists in lib/eval-processing.ts with substring-match semantics that disagree with the active path. #8
09 Metric display name expansion 09-metric-display-name-expansion.md tests/transformations/metric-display-name-expansion.test.ts defensive scaffolding; both rules fire 0 times against current corpus (0/86,183 result rows for generic-name expansion; 0/587 eval-list entries for prefersBenchmarkName heuristic) #10
10 Params billions parsing 10-params-parsing.md tests/transformations/params-parsing.test.ts partial; model-cards.json.params_billions is clean number for 87% of cards (5072/5830); per-row additional_details.params_billions is string for 31.7% of rows (27,361/86,183), 47.2% (40,648) have no resolved value at all. Five different TS parsers diverge on edge cases β€” Variant C's "context-window beats param count" quirk fires on 472 rows (0.55%, names like Yi-1.5-34B-32K) #12
11 Benchmark-card attachment (per-eval lookup join) 11-benchmark-card-attachment.md tests/transformations/benchmark-card-attachment.test.ts partial; pipeline inlines benchmark_card on 88/587 evals (15%), other 499 fall through to runtime retry. Of those, 10 hit at position 0 and 489 miss entirely (most lack any matching card in benchmark-metadata.json). Map-build first-write-wins silently drops the helm_instruct card (different content from kept helm_capabilities, both named "HELM"); 29/83 cards orphaned #17
12 Per-instance JSONL normalization 12-instance-level-data.md tests/transformations/instance-level-data.test.ts pipeline emits canonical shape; parser's defensive fallback branches mostly dead (input/ground_truth/is_correct/sample_id all 100% via the canonical path; response splits 97.31% answer_attribution / 2.49% messages / 0.20% output). 712/86,183 result rows (0.83%) have inline samples; full sets (~35k samples) sit behind source_url and are accessible only via the orphaned fetchInstanceLevelData (no current UI consumer). #7

(More entries land as we work each migration item. See notes/migration-plan.md for the full backlog.)

File format

Each spec follows the same structure so pipeline owner can read them uniformly. Template:

# <Transformation name>

## Rule
[Plain-English description of what the transformation does]

## Classification
- [ ] Default-only (do not overwrite when value present)
- [ ] Unconditional normalization (always apply)
[explanation]

- [ ] Cleaning / standardization β†’ pipeline (changes a value's content; use per-item workflow)
- [ ] Reshape / dedup / aggregate β†’ DuckDB SQL (computes a derived view; capture operation, flag for parquet-schema/SQL conversation)
[explanation β€” if both halves apply, explain the split]

## Inputs and expected outputs
[Table: input | expected output | notes / which rule branch this hits]

## Current TS implementation
- [file:line references]
- [key helpers/constants]

## Pipeline status
[Per-rule status against full live cache: matches / disagrees / not implemented]

## Divergences detected
[Concrete examples of pipeline-vs-TS disagreement, with row counts]

## Migration checklist
- [ ] Spec written
- [ ] Tests cover each rule branch
- [ ] Filed with pipeline owner (link)
- [ ] Pipeline emits matching values across full corpus
- [ ] TS code deleted; callers read pipeline fields directly