Spaces:
Running on CPU Spr
Testing strategy for the TSβpipeline migration
Drafted 2026-04-27. The motivation is the 2026-04-27 review session: subagent audits caught two regressions the parity harness missed (a 22% category change and a coding: "Reasoning" mistake based on a substring fallacy). Both required full-production-cache analysis to surface. Subagent audits are not a sustainable workflow.
Design principle: separate code drift from upstream drift
Upstream data (the published evaleval/card_backend HF dataset) is our best guess at a source of truth, but it isn't immutable. Pipeline-side relabeling, registry updates, and schema changes happen. If our regression tests run against live data, every upstream update lights up the test suite and we can't tell "I broke something" from "upstream changed something I happen to consume."
The fix: pin tests to a committed snapshot of upstream data. Refresh the snapshot deliberately (script + commit), and the snapshot diff + test diff arrive together for review. Live-data drift detection is a separate, opt-in concern.
βββ tests run against βββ
live cache ββββ€ ββββ pinned fixtures βββ tests
βββ refresh script ββββββ (committed)
(manual, reviewed)
Live cache drift is checked by an opt-in audit, not by the test suite.
The three tiers
Tier A β Pipeline contract tests
Catches: "pipeline upstream silently dropped a field we depend on." Three repeated manual checks (source_metadata, category, hierarchy keys) motivated automating this.
Mechanic: vitest file that walks every fixture file and asserts presence/shape of fields the TS code depends on. Each contract is a field-level invariant.
File: tests/pipeline-contract.test.ts
Initial contract set (every one corresponds to a real failure mode):
every model_result has source_metadata(we deleted the synthesis fallback assuming this)every model_result.source_metadata has evaluator_relationship in {first_party, third_party, other}every eval-detail has category as a non-empty stringevery eval-detail has eval_summary_id, benchmark, benchmark_leaf_nameevery model card has model_family_id matching pipelineSlugify(model_family_id)every hierarchy_by_category key is one of the 9 known pipeline categoriesevery BenchmarkEvaluation produced by flattenModelEvaluations has source_metadata(cross-check: contract + adapter together)every model card has total_evaluations as a numberevery model_result.retrieved_timestamp parses as a valid Date
Exit criteria: all contracts pass against pinned fixtures. Each contract should fail loudly with the offending file path + key path when violated.
Acceptance: runs in pnpm test. Takes <2s. Adding a new contract is 5 lines.
Tier B β Adapter snapshot tests
Catches: "I changed TS code and didn't realize it changes the output for some input shape." This is the bulk of regression-detection.
Mechanic: vitest snapshot tests. Each adapter Γ each fixture β snapshot. Regenerate via vitest --update-snapshots when changes are intentional; review the snapshot diff alongside the code diff.
Files:
tests/adapters/hf-eval-detail-to-summary.test.tstests/adapters/hf-model-card-to-evaluation-card-data.test.tstests/adapters/flatten-model-evaluations.test.tstests/adapters/hf-developer-detail-to-summary.test.tstests/adapters/hf-eval-entry-to-list-item.test.tstests/adapters/build-benchmark-leaderboard-matrix.test.tstests/adapters/build-single-metric-suite-matrix-summary.test.tstests/adapters/aggregate-benchmark-summaries.test.ts
Snapshot format: tests/__snapshots__/<test>.snap (vitest default). Commit them.
Acceptance: pnpm test runs all snapshots, reports any diff, exit non-zero on diff. Adding a new fixture is one line of test.each.
Tier C β Full-cache differential audit
Catches: "what is the full impact of my code change across all 5 830 production models?" Used for big migration items where snapshot fixtures can't enumerate every shape.
Mechanic: a Node script that runs all adapters against either pinned fixtures or the live cache, produces a deterministic JSON digest (per-output hash + value distributions + invariant violation counts), and supports diff mode.
File: scripts/audit-adapters.mjs
Output digest shape:
{
"version": 1,
"source": ".cache/hf-data",
"generated_at": "2026-04-27T22:00:00Z",
"adapters": {
"hfModelCardToEvaluationCardData": {
"outputs_count": 5830,
"outputs_hash": "sha256:...", // hash of all outputs concatenated
"field_distributions": {
"developer": { "OpenAI": 12, "Anthropic": 8, ... },
"categories.length": { "1": 100, "2": 2000, "3": 3000, ... },
"evaluator_count": { "0": 200, "1": 1500, ... }
}
},
"flattenModelEvaluations": {
"outputs_count": 86183,
"outputs_hash": "sha256:...",
"invariant_violations": []
}
}
}
Modes:
node scripts/audit-adapters.mjs --output baseline.jsonβ write digestnode scripts/audit-adapters.mjs --output candidate.jsonβ write digest after changenode scripts/audit-adapters.mjs --diff baseline.json candidate.jsonβ human-readable diffnode scripts/audit-adapters.mjs --against tests/fixturesβ use pinned set instead of live cachenode scripts/audit-adapters.mjs --live --against .cache/hf-dataβ drift check against live data
Acceptance: runs in <30s against full live cache. Diff mode highlights field-distribution shifts, output-hash changes, and new invariant violations with sample paths.
Fixture management
Source
Fixtures are pinned copies of files in .cache/hf-data/ at a moment in time. They are committed JSON. Reviewers can see them in PR diffs.
Layout
tests/fixtures/
manifest.json β list of fixture IDs + source-cache snapshot ts
evals/
helm_classic_truthfulqa.json
helm_safety.json
apex_v1.json β first-party (Mercor)
artificial_analysis_*_aime.json β third-party (AA)
helm_capabilities.json β composite
helm_lite_narrativeqa.json β subtask
rewardbench2_chat.json β coding key in hierarchy
...
models/
openai__gpt-5.json β multiple variants
anthropic__claude-opus-4-5.json β typical
google__gemini-3-flash.json β already in the parity test
...
developers/
openai.json
anthropic.json
...
Curation criteria
Every fixture earns its place by exercising a specific code path. Avoid random sampling.
Required edge cases:
- A model with multiple variants (
openai__gpt-5) - A model with subtask hierarchy (helm_lite, helm_classic)
- A first-party eval (Mercor ACE/APEX)
- A third-party eval (Artificial Analysis)
- A composite eval (helm_capabilities)
- A matrix eval id pattern (synthetic, but the adapter handles it)
- An eval with
category: "other"(most of the corpus) - An eval that the regex
inferCategoryFromBenchmarkand the pipeline category disagree on (truthfulqa, helm_safety) - A model with setup-alias merging (multiple "prompt"/"fc" variants of same release)
- An ABC-only benchmark (if any are exposed in eval-list)
- An aggregate eval URL pattern (
aggregate__<suite>)
Aim for ~25-35 fixtures total. Small enough to review, broad enough to catch the patterns we know about.
Refresh workflow
pnpm refresh-fixtures # copies tests/fixtures/manifest.json IDs
# from .cache/hf-data/ into tests/fixtures/
# bumps manifest.json snapshot_ts
git diff tests/fixtures/ # review what upstream changed
pnpm test # snapshot tests will probably diff
pnpm test -- -u # update snapshots if intentional
git diff tests/__snapshots__/ # review what adapter outputs changed
git add ... # commit fixtures + snapshots together
The diff in tests/fixtures/ shows raw upstream changes. The diff in tests/__snapshots__/ shows what changes when you feed the new data through the adapters. Both belong in the same commit.
Refresh cadence
Manual, on demand. Recommended triggers:
- Before starting a new migration item (to work against current upstream)
- After observing a discrepancy between live cache and pinned fixtures
- Periodically (~monthly) to keep fixtures from drifting
There is no auto-refresh. The whole point is that upstream changes are reviewed.
Live-data drift detection
Separate from regression tests. A vitest file tests/upstream-drift.test.ts runs Tier-A contracts against the LIVE cache and reports violations. Run it manually (pnpm test:drift); not part of pnpm test. If contracts fail there but pass on fixtures, upstream has drifted and someone should refresh fixtures + investigate.
How upstream changes propagate
Three independent data layers, each updated by a different command:
huggingface.co/datasets/evaleval/card_backend β truth (changes when pipeline publishes)
β pnpm cache-hf-data β user-triggered download
βΌ
.cache/hf-data/ β live local cache (mutable)
β pnpm refresh-fixtures β user-triggered re-pin
βΌ
tests/fixtures/ β committed pinned snapshots
β pnpm test (adapter outputs)
βΌ
tests/__snapshots__/ β committed expected outputs
Default pnpm test only sees the pinned bottom two layers, so upstream churn never flaps the regression suite by accident. Each upstream change is observed deliberately by re-pinning and reviewing the diff.
Scenario matrix β what each layer reports
| What changed upstream | pnpm test |
pnpm test:drift (live cache contracts) |
pnpm refresh-fixtures && pnpm test (snapshot diff) |
pnpm audit-adapters --diff baseline.json candidate.json |
|---|---|---|---|---|
| Pure data refresh, no shape change | β | β | β snapshots diff (timestamps, scores) | hash flips for affected adapters |
| Additive (new field that no adapter consumes) | β | β | β (raw fixture diff visible, snapshots stable) | distributions stable |
New enum value (e.g. evaluator_relationship: "fourth_party") |
β | β unknown-value contract | β unless consumed | distribution gains a key |
Drops a required field (e.g. source_metadata) |
β | β contract violation with N/M count | β contracts now fail on pinned data too | throws count rises |
Reclassifies an existing value (e.g. category: "other" β "safety") |
β | β (still a known string) | β snapshots diff for that fixture | hash flips |
| Renames a field | β | varies | β snapshot diff + likely contract failure | hash + throws change |
| Rewrites the schema (breaking) | β | β multiple contracts | β contracts + snapshots both fail | many hash flips |
The "β
" in pnpm test for every row is intentional: by design, default tests only fail when our code drifts from a pinned baseline. Upstream drift is reported by the opt-in pnpm test:drift and by the snapshot diff that lands the moment fixtures are re-pinned.
Drift-triage decision tree
A pnpm test:drift failure means live cache no longer satisfies a contract our deletions assumed. Three possibilities:
- Pipeline regressed (e.g. dropped
source_metadataon some rows) β coordinate with the pipeline owner to restore. Don't refresh fixtures yet; the regression would propagate into our pinned set. The runtimeassertSourceMetadataguards (lib/hf-data.ts, lib/model-data.ts) would also start firing in production, providing a second signal. - Pipeline emitted a new value our enum doesn't recognise (e.g. new
evaluator_relationship) β extend the correspondingKNOWN_*set intests/upstream-drift.test.tsandtests/pipeline-contract.test.tsAND any consumer code that branches on the old set. - Pipeline made a schema-level change β review the upstream commit log (
git -C ../eval_cards_backend_pipeline log) for context, decide if our consumer needs updates, then refresh fixtures.
A snapshot diff after pnpm refresh-fixtures always means some output changed. Read the fixture diff and snapshot diff side-by-side:
- Fixture diff explains what upstream changed (raw data shift)
- Snapshot diff explains how the adapter projected that change into user-visible output
- Together β review and decide if the new output is correct (
pnpm test -- -u) or a regression to fix
Known gaps in drift coverage
- Stale
.cache/hf-data/:pnpm test:driftruns against whatever is on disk; it doesn't auto-refresh from huggingface.co. Ifpnpm cache-hf-datahasn't been run recently, "drift" reports stale-cache-vs-fixtures, not upstream-vs-fixtures. Fix: runpnpm cache-hf-databeforepnpm test:driftwhen you care about true upstream. - Hand-edited fixtures aren't detected: nothing checks that
tests/fixtures/X.jsonmatches whatpnpm refresh-fixtureswould produce. If someone edits a fixture for debugging and forgets to restore, tests stay green against the mutation. Mitigation would be a content-hash entry per fixture inmanifest.json; defer until it's actually a problem. - Drift covers Tier A invariants only, not Tier B snapshots: a value-reclassification (Scenario "reclassifies an existing value" above) is invisible to drift. Detection requires
pnpm refresh-fixtures(snapshot diff) orpnpm audit-adapters --live --diffagainst an older baseline. By design β running snapshots against live data would flap on every refresh. pnpm test:driftis opt-in, not scheduled: nobody runs it unless prompted. A CI nightly cron (orpnpm test:driftin a weekly task) would catch upstream contract breaks earlier; currently you discover them only when you next run drift.- Audit script doesn't check Tier A contracts: if a row violates a contract, the audit reports it indirectly via increased
throwscount (the runtime guards fire) but you'd needpnpm test:driftfor the exact contract message and per-row locator.
Build order
Tier A first (smallest, foundational). Tier B next (replaces subagent audits for normal regression detection). Tier C last (heaviest tooling).
Each tier is independently usable, so they can be built in parallel by different agents:
| Tier | Estimated effort | Depends on | Parallelizable? |
|---|---|---|---|
| A β contract tests | 1-2h | nothing | yes |
| B β snapshot tests | 2-3h | fixture set (shared) | mostly |
| C β audit script | 2-3h | nothing | yes |
| Fixture set (~25 files) | 1h | curation decisions | shared dep |
Recommended: build the fixture set + Tier A in series (one agent), Tier B and Tier C in parallel after fixtures are in.
Test-additions deferred to specific migration items
The original Tier B plan listed 8 adapters; 4 are built. The remaining 4 (hfEvalEntryToListItem, aggregateBenchmarkSummaries, buildSingleMetricSuiteMatrixSummary, createModelFamilySummary) are deferred to the migration items that touch them β adding fixtures + snapshots speculatively now would be testing-for-testing's-sake. Specifically:
hfEvalEntryToListItemsnapshot β add when starting #1 (identity parsing) or #2 (setup-alias). Needs aneval_list_entriesfixture group extracted from.cache/hf-data/eval-list.json. Cover at least: a typical entry, one withdisplay_namestarting with "accuracy on " (triggersprefersBenchmarkName), one withdisplay_namecontaining "for scorer", one with a missingdisplay_name.- Setup-alias collision fixture β add when starting #2. Pick a model with
additional_details.modeβ {"prompt", "fc", "thinking"} appearing across multiple submissions for the same model_id.openai__gpt-5.2model card has thinking variants; find a corresponding model detail file. aggregate__<suite>pattern β add when starting #5 (composites) or #6 (matrix synthesis). The aggregate URL pattern is synthetic, not on disk; the test would callaggregateBenchmarkSummariesdirectly with a curated input set. Defer until that adapter is actually being touched.createModelFamilySummarysnapshot β add when starting #3. The flatten + family-summary chain is whatgetModelSummaryByIdreturns; snapshottingcreateModelFamilySummary(flattenModelEvaluations(model))locks the full surface before the refactor.
Reshape-class items: testing addendum (added 2026-04-28)
The Tier B snapshot framework above assumes the migration target is "pipeline emits the value, TS reads it." That works for cleaning-class items. For reshape-class items (#3 hierarchy flatten, #5 composite rollup, #6 matrix synthesis, #14 score summary stats, #16 per-category counts; plus the reshape halves of #2 and #13), the migration target is different: pipeline emits relational rows, DuckDB SQL does the dedup/groupby/aggregate. See notes/migration-plan.md Β§ "Data direction" for framing.
This shifts what the test set has to verify:
- Tier A contracts gain a parquet schema dimension. Today's contracts assert JSON field invariants on
.cache/hf-data/**. When the parquet schema goes more relational (e.g. one row per(eval_summary_id, variant_key, retrieved_timestamp)for the variant dedup case), Tier A grows a parallel set of contracts asserting the new typed columns are present and well-typed. File:tests/parquet-contract.test.ts(new, parallel totests/pipeline-contract.test.ts). - Tier B snapshots become parity gates, not destinations. Today,
tests/adapters/flatten-model-evaluations.test.tssnapshots the TS reshape output. Once SQL replaces the TS, the same snapshot becomes a TS-vs-SQL parity assertion: run both, diff. The snapshot is committed; the SQL output is computed at test time; equality is the gate. Reshape-class snapshots stay green during the migration exactly because they assert behavior preservation, not implementation. Don't delete them on TS removal β convert them. - Tier C audit script grows a backend dimension.
scripts/audit-adapters.mjscurrently runs adapters against the live cache. Add--backend duckdbso the same adapter contract is exercised against the DuckDB read path, producing a digest that diffs against the JSON-backend digest. This is the full-corpus generalization ofscripts/compare-data-backends.mjs, but at the adapter-output level rather than the HTTP-endpoint level. - Five of the eight planned Tier B adapters are reshape-class:
flattenModelEvaluations,buildBenchmarkLeaderboardMatrix,buildSingleMetricSuiteMatrixSummary,aggregateBenchmarkSummaries,createModelFamilySummary. Their snapshots are the contract the SQL replacement must match. Build them when migrating each item β the snapshots gate the deletion.
What this doesn't change: cleaning-class items (the 12 that aren't reshape) work exactly as the existing framework describes β refresh fixtures β snapshot diff β review β ship. No structural test changes needed for cleaning items.
What this DOESN'T cover
- End-to-end UI tests. No clicking through pages. Adapter snapshots are a proxy.
- Performance regression. No timing assertions.
- Pipeline-side correctness. Pipeline has its own tests in the sibling repo. Our contracts assert what we consume, not what's correct upstream.
- The DuckDB shadow read. That's covered by the existing
scripts/compare-data-backends.mjsparity harness β at the HTTP-endpoint level. The adapter-level parity for reshape items (TS reshape output vs SQL reshape output) is the addendum above.
Workflows
Migration workflow (TS deletion against current upstream)
Use this for items #1, #2, #3 and any pipeline-side change that flows back into deletions in this repo.
# 1. Sync to current upstream so the work is against fresh data
pnpm cache-hf-data
pnpm test:drift # does upstream still satisfy our contracts?
# if no β triage per "Drift-triage decision tree" first
# 2. Re-pin fixtures to current upstream
pnpm refresh-fixtures
pnpm test # any pre-deletion snapshot diffs?
# if yes β review, then `pnpm test -- -u`, separate commit
# so the pin-update is isolated from the deletion
# 3. Capture a full-cache baseline so we can diff the impact of the change
pnpm audit-adapters --output /tmp/baseline.json --live
# 4. Make the deletion (or refactor)
# 5. Verify
pnpm test # snapshots flag any unexpected output change
pnpm audit-adapters --output /tmp/candidate.json --live
pnpm audit-adapters --diff /tmp/baseline.json /tmp/candidate.json # full-cache impact
pnpm compare-data-backends --json-base http://localhost:3001 --duckdb-base http://localhost:3002
# 6. Review snapshot diff alongside code diff
# - intentional behaviour change: `pnpm test -- -u`, document the why in the commit
# - unintentional: fix the code
# 7. Ship
Each step covers a distinct failure mode; nothing duplicates. Steps 3, 5b, 5c (the audit captures) are skippable for tiny changes β start with pnpm test alone and escalate if you want fuller coverage.
Light-touch workflow (small change, no upstream sync needed)
pnpm test # baseline green
# make the change
pnpm test # snapshots flag any output change
# review snapshot diff, `pnpm test -- -u` if intentional
pnpm compare-data-backends ...
Drift-only workflow (you suspect upstream changed)
pnpm cache-hf-data # ensure local cache is current
pnpm test:drift # 5 contracts against full live cache
# if green: upstream still satisfies our deletions' assumptions
# if red: triage per "Drift-triage decision tree"
Cross-repo workflow (pipeline-side change first, TS deletion later)
# In ../eval_cards_backend_pipeline
uv run --with huggingface_hub --no-project python -m scripts.pipeline --dry-run \
-e EXPORT_EXPERIMENTAL_PARQUET=1
# verify output/ has the new field
# Back in this repo
pnpm cache-hf-data # picks up the new published artifact
pnpm test:drift # do we now have a NEW contract we want to assert?
# if yes: extend tests/pipeline-contract.test.ts + drift
pnpm refresh-fixtures
pnpm test # snapshots reflect the new field if any adapter consumes it
# now eligible to delete the TS code that the pipeline emission obviates