Spaces:
Running
Per-instance JSONL normalization
Drafted 2026-04-28. Migration item #7 in notes/migration-plan.md.
Framing reminder
We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire β pipeline emits a single canonical shape β but the parser preserves them as defensive scaffolding for older or different harness output formats.
Architecture choice (cleaning-only, no SQL involvement): the inline instance_examples preview (β€5 samples per result) gets normalized in pipeline; the per-result source_url pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned fetchInstanceLevelData (lib/hf-data.ts:890-917) gets re-wired to a future "show all samples" UI feature rather than deleted.
Rule (as TS implements it today)
parseInstanceLevelData (lib/hf-data.ts:933-1043)
Pure function: takes a JSON object, walks instance_examples[], returns SampleResult[].
Top-level guard:
- If
datais null/non-object β return[] - If
data.instance_examplesis array β use it - Else if
dataitself is array β use it - Else β
[]
Per-example field extraction (each is a fallback chain):
input (string) β first non-empty wins:
raw.inputis string β use as-israw.input.rawis set βString(raw.input.raw)raw.promptβ use as-israw.questionβ use as-israw.doc.questionβ use as-israw.docexists βJSON.stringify(raw.doc).slice(0, 500)- (none) β empty string
ground_truth (string | undefined) β first non-null wins:
raw.input.referenceis array βarray.join(", "); else βString(...)raw.ground_truthβString(...)raw.targetβString(...)raw.goldβString(...)raw.doc.answerβString(...)- (none) β
undefined
response (string) β first non-empty wins:
raw.outputis set β string-as-is orJSON.stringify(...)raw.responseβ use as-israw.model_outputβ use as-israw.answer_attributionis non-empty array β take last element'sextracted_value(or empty string)raw.messagesis non-empty array β reverse + find last assistant message β string content or stringifiedraw.filtered_resps[0][0]β useraw.resps[0][0]β use- (none) β empty string
is_correct (boolean | undefined) β first defined wins:
raw.evaluation.is_correct(boolean)raw.is_correct(boolean)raw.metrics.exact_match === 1β true;=== 0β false; else undefined- (none) β
undefined
metadata (object | undefined) β merged from (in order):
raw.evaluation(if object)raw.performance(if object)raw.metadata(if object)raw.metrics(if object)- If merged object is empty β
undefined
sample_id (string) β first non-null wins:
raw.sample_idβ as-israw.doc_idβ as-israw.idβ as-is- (none) β
String(arrayIndex)(positional fallback)
choices (any | undefined) β first non-null wins:
raw.choicesraw.doc.choices- (none) β
undefined
If a row's first-pass map returns null (i.e. raw was null or non-object), it's filtered out via .filter(s => s !== null).
fetchInstanceLevelData (lib/hf-data.ts:890-917) β currently orphaned
Takes a (url, limit?). Fetches the URL via fetch(). Splits text on newlines (filters empty lines). For each line up to limit (or all if no limit), tries JSON.parse; skips malformed lines. Wraps the parsed array as { instance_examples: parsed } and passes to parseInstanceLevelData. Returns SampleResult[] or [] on any error.
Active call sites: zero. Verified by grep across app/, components/, scripts/, other lib/ files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.
The function is intended for "load more / show all samples" UI feature β pipeline ships a source_url per row pointing to the full JSONL (typically 50 samples), but the inline instance_examples preview only carries 5. The fetcher would let UI request the full set on user demand. This UI feature has never shipped.
Classification
- Unconditional normalization. The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
- Cleaning β pipeline. Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
- NOT reshape. Per the architecture choice in the framing note above, samples stay as on-demand fetches via
source_url; no parquetinstance_samplestable, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)
Inputs and expected outputs
Group A β Pipeline-canonical shape (the only shape that fires in production today)
Input shape (from cache result.instance_level_data.instance_examples[i]):
{
"schema_version": "...",
"evaluation_id": "...",
"model_id": "...",
"evaluation_name": "...",
"sample_id": "...",
"sample_hash": "...", // sometimes present
"interaction_type": "multi_turn",
"input": { "raw": "..." }, // ALWAYS object with .raw in production
"output": "..." | { ... }, // sometimes
"messages": [{ role: "...", content: "..." }, ...], // typically present
"answer_attribution": [..., { "extracted_value": "..." }], // typically present
"evaluation": { "is_correct": true|false, ... }, // ALWAYS present
"performance": { ... },
"metadata": { ... },
"token_usage": { ... },
"error": null | "...",
"hierarchy": [...]
}
Expected output (SampleResult):
| Output field | Source path that fires | Notes |
|---|---|---|
sample_id |
raw.sample_id (100% of production) |
always present |
input |
raw.input.raw (100%) |
branch #2 in the chain |
ground_truth |
raw.input.reference (100%) |
branch #1 |
response |
raw.answer_attribution (97.31%) OR raw.messages (2.49%) OR raw.output (0.20%) |
branches #4, #5, #1 |
is_correct |
raw.evaluation.is_correct (100%) |
branch #1 |
choices |
undefined (100%) |
no branch fires; field is unset in production |
metadata |
merged from raw.evaluation, raw.performance, raw.metadata, raw.metrics (always at least 2 of 4 present) |
merged object |
Group B β Defensive fallback branches (zero firing rate in current production)
| Branch | Output field | Production hits | Origin (presumed) |
|---|---|---|---|
raw.input (string) |
input | 0 | older harness shapes |
raw.prompt |
input | 0 | lm-eval-harness |
raw.question |
input | 0 | other harnesses |
raw.doc.question |
input | 0 | HELM-style |
raw.doc (JSON.stringify) |
input | 0 | last-resort |
raw.ground_truth |
ground_truth | 0 | older shapes |
raw.target |
ground_truth | 0 | classification benchmarks |
raw.gold |
ground_truth | 0 | older lm-eval |
raw.doc.answer |
ground_truth | 0 | HELM-style |
raw.response |
response | 0 | older shapes |
raw.model_output |
response | 0 | older shapes |
raw.filtered_resps[0][0] |
response | 0 | lm-eval-harness format |
raw.resps[0][0] |
response | 0 | lm-eval-harness format |
raw.is_correct |
is_correct | 0 | flat shape |
raw.metrics.exact_match |
is_correct | 0 | metric-based correctness |
raw.doc_id |
sample_id | 0 | HELM-style |
raw.id |
sample_id | 0 | generic |
| index fallback | sample_id | 0 | last-resort |
raw.choices |
choices | 0 | multiple-choice |
raw.doc.choices |
choices | 0 | HELM multiple-choice |
These branches exist for shapes the pipeline currently does not emit. Preserve verbatim until pipeline-side guarantees the canonical shape across all data sources.
Group C β fetchInstanceLevelData JSONL parsing edge cases
| Input | Behavior |
|---|---|
URL returns !res.ok (404, 500, etc.) |
returns [] (no throw) |
| URL throws (network error) | logs warning to console, returns [] |
| Empty body | splits to [], returns [] |
| Body with empty lines | .filter(line => line.trim()) strips them |
| Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues |
limit=0 or limit=undefined |
parses ALL lines |
limit > lines.length |
parses all lines (capped via Math.min) |
Current TS implementation
| Concern | Location | Notes |
|---|---|---|
parseInstanceLevelData |
lib/hf-data.ts:933-1043 |
The parser; ~110 lines |
fetchInstanceLevelData |
lib/hf-data.ts:890-917 |
URL fetcher; orphaned (zero callers) |
| Active call site | lib/hf-data.ts:1273 (inside flattenHierarchyNode) |
parseInstanceLevelData(result.instance_level_data) β inlineSamples |
| Internal call site | lib/hf-data.ts:912 |
inside fetchInstanceLevelData itself, recursive call to the parser |
SampleResult type |
lib/benchmark-schema.ts:135 |
Output shape definition |
| Output field | BenchmarkEvaluation.detailed_evaluation_results_per_samples |
lib/benchmark-schema.ts:33 |
Caller chain for parseInstanceLevelData
getModelSummaryById (lib/model-data.ts:1490+) β flattenModelEvaluations β flattenHierarchyNode (lib/hf-data.ts:1273) β parseInstanceLevelData(result.instance_level_data) β set as inlineSamples on each variant bucket β propagated to BenchmarkEvaluation.detailed_evaluation_results_per_samples.
UI consumers (read data.detailed_evaluation_results_per_samples):
components/benchmark-detail.tsx:3869β random sample for preview blockcomponents/benchmark-detail.tsx:4174-4216β sample preview UI in benchmark detailcomponents/benchmark-detail.tsx:4982β variant-level sample availability checkcomponents/benchmark-detail.tsx:5284-5286β variant sample pickercomponents/benchmark-detail.tsx:5569-5608β sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion
Caller chain for fetchInstanceLevelData
None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.
Pipeline status
Side-by-side comparison
| Aspect | TS (this spec) | Pipeline today | Result for users |
|---|---|---|---|
| Inline preview shape | parser handles many variants | emits ONE canonical shape (input.raw, evaluation.is_correct, etc.) |
parser's fallback branches almost all dead |
| URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data |
| Inline preview size | parser doesn't care | always exactly 5 samples per instance_examples array |
UI capped at 5 today |
| Total samples per row | n/a | typically 50 (per instance_count field), one outlier 18 |
only 10% accessible to UI today |
| URL-fetch use case | fetchInstanceLevelData exists |
source_url always emitted |
dead code on TS side; no UI consumer |
Concrete worked example with quantified scope
Audited 2026-04-28 against .cache/hf-data/. Verified by scripts/verify-instance-level-data.mjs.
Prevalence:
- Total model files: 5,830
- Files with any
instance_level_data: 55 (0.94%) - Total
(metric Γ model_result)rows: 86,183 - Result rows with
instance_level_data: 712 (0.83%) - Total inline preview examples (sum of
instance_examples.length): 3,532 (always β€5 per row) - Total full samples (sum of
instance_count): 66,057 (full set available viasource_url, not loaded today; ~19Γ larger than what UI currently shows)
ild-level shape uniformity (712/712 rows):
- Top-level keys are always exactly
{interaction_type, instance_count, source_url, instance_examples} interaction_typeis always"multi_turn"(no single_turn samples in cache)
Per-example branch firing rates (3,532 examples):
input:input.raw100%ground_truth:input.reference100%response:answer_attribution97.31%,messages2.49%,output0.20%is_correct:evaluation.is_correct100%sample_id:sample_id100%choices: nothing (always undefined)
The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are defensive scaffolding for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.
URL JSONL shape verification: sampled one source_url (anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...); first line had identical 18 keys to the inline instance_examples[0], with input.raw and evaluation.is_correct in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.
Notes for pipeline implementer
- The pipeline already emits a canonical per-sample shape consistently. No structural change needed to current emission. The migration is to make this shape an explicit guarantee, not to change what's being emitted.
- Suggested guarantee: every
instance_examples[i](inline AND in JSONL atsource_url) has at minimum{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}. - Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract
raw.input.raw,raw.input.reference,raw.evaluation.is_correct,raw.sample_idas direct field reads. The response field still needs the 3-branch fallback (answer_attribution β messages β output) until pipeline emits a single normalizedresponsefield. - Do NOT pre-ingest the URL JSONL into pipeline parquet (per the architecture choice). The runtime UI fetches
source_urlon demand; this is the orphanedfetchInstanceLevelData's intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it. - The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are
multi_turn) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (tests/pipeline-contract.test.ts).
Migration checklist
- Spec written
- Tests cover each rule branch (
tests/transformations/instance-level-data.test.ts) - Audit script (
scripts/verify-instance-level-data.mjs) - Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
- Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting
every instance_example has input.raw, sample_id, evaluation.is_correct) - TS deleted:
parseInstanceLevelDatashrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records.fetchInstanceLevelDatastays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.
Future product decisions (deferred)
- "Show all samples" UI feature β would un-orphan
fetchInstanceLevelDataand let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope. - Cross-model sample querying / search / filter β would require
instance_samples.parquetand SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks. - Single-turn samples β pipeline currently only emits
interaction_type: multi_turn. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flatinputstrings,prompt/questionfields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.