Spaces:

evaleval
/

general-eval-card

Running

File size: 16,640 Bytes

da8db3e

# Per-instance JSONL normalization

Drafted 2026-04-28. Migration item #7 in `notes/migration-plan.md`.

## Framing reminder

We are refactoring for UI efficiency. TS-as-is is the canonical spec for behaviour. This spec covers a per-record value transform that extracts UI-friendly strings (input/response/correctness/etc.) from rich per-sample objects emitted by the pipeline. Audited 2026-04-28 against the full corpus, the parser's many fallback branches mostly do not fire — pipeline emits a single canonical shape — but the parser preserves them as defensive scaffolding for older or different harness output formats.

**Architecture choice (cleaning-only, no SQL involvement):** the inline `instance_examples` preview (≤5 samples per result) gets normalized in pipeline; the per-result `source_url` pointing to a full JSONL dump (typically 50 samples) stays as an on-demand UI fetch, NOT pre-ingested into a parquet table. This avoids speculative pipeline ingestion work and matches the actual product need (lazy-load all samples on user click, not cross-corpus sample querying). The orphaned `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) gets re-wired to a future "show all samples" UI feature rather than deleted.

## Rule (as TS implements it today)

### `parseInstanceLevelData` (`lib/hf-data.ts:933-1043`)

Pure function: takes a JSON object, walks `instance_examples[]`, returns `SampleResult[]`.

Top-level guard:
- If `data` is null/non-object → return `[]`
- If `data.instance_examples` is array → use it
- Else if `data` itself is array → use it
- Else → `[]`

Per-example field extraction (each is a fallback chain):

**`input` (string)** — first non-empty wins:
1. `raw.input` is string → use as-is
2. `raw.input.raw` is set → `String(raw.input.raw)`
3. `raw.prompt` → use as-is
4. `raw.question` → use as-is
5. `raw.doc.question` → use as-is
6. `raw.doc` exists → `JSON.stringify(raw.doc).slice(0, 500)`
7. (none) → empty string

**`ground_truth` (string | undefined)** — first non-null wins:
1. `raw.input.reference` is array → `array.join(", ")`; else → `String(...)`
2. `raw.ground_truth` → `String(...)`
3. `raw.target` → `String(...)`
4. `raw.gold` → `String(...)`
5. `raw.doc.answer` → `String(...)`
6. (none) → `undefined`

**`response` (string)** — first non-empty wins:
1. `raw.output` is set → string-as-is or `JSON.stringify(...)`
2. `raw.response` → use as-is
3. `raw.model_output` → use as-is
4. `raw.answer_attribution` is non-empty array → take last element's `extracted_value` (or empty string)
5. `raw.messages` is non-empty array → reverse + find last assistant message → string content or stringified
6. `raw.filtered_resps[0][0]` → use
7. `raw.resps[0][0]` → use
8. (none) → empty string

**`is_correct` (boolean | undefined)** — first defined wins:
1. `raw.evaluation.is_correct` (boolean)
2. `raw.is_correct` (boolean)
3. `raw.metrics.exact_match === 1` → true; `=== 0` → false; else undefined
4. (none) → `undefined`

**`metadata` (object | undefined)** — merged from (in order):
- `raw.evaluation` (if object)
- `raw.performance` (if object)
- `raw.metadata` (if object)
- `raw.metrics` (if object)
- If merged object is empty → `undefined`

**`sample_id` (string)** — first non-null wins:
1. `raw.sample_id` → as-is
2. `raw.doc_id` → as-is
3. `raw.id` → as-is
4. (none) → `String(arrayIndex)` (positional fallback)

**`choices` (any | undefined)** — first non-null wins:
1. `raw.choices`
2. `raw.doc.choices`
3. (none) → `undefined`

If a row's first-pass map returns `null` (i.e. `raw` was null or non-object), it's filtered out via `.filter(s => s !== null)`.

### `fetchInstanceLevelData` (`lib/hf-data.ts:890-917`) — currently orphaned

Takes a `(url, limit?)`. Fetches the URL via `fetch()`. Splits text on newlines (filters empty lines). For each line up to `limit` (or all if no limit), tries `JSON.parse`; skips malformed lines. Wraps the parsed array as `{ instance_examples: parsed }` and passes to `parseInstanceLevelData`. Returns `SampleResult[]` or `[]` on any error.

**Active call sites: zero.** Verified by grep across `app/`, `components/`, `scripts/`, other `lib/` files. Only mention is its own declaration. Git log shows one commit ("Refresh eval cards UI and backend data flow") in its history.

The function is intended for "load more / show all samples" UI feature — pipeline ships a `source_url` per row pointing to the full JSONL (typically 50 samples), but the inline `instance_examples` preview only carries 5. The fetcher would let UI request the full set on user demand. **This UI feature has never shipped.**

## Classification

- **Unconditional normalization.** The function always runs on whatever shape is provided; never gates on a pre-existing canonical field. Pipeline-side fix: emit a canonical per-sample shape so the multi-field fallback chains become unnecessary.
- **Cleaning → pipeline.** Pure value transform per sample. No aggregation, no reshape, no cross-record operations. Migration target: pipeline emits canonical per-sample shape on both the inline preview AND the URL JSONL files; TS parser shrinks to direct field reads or deletes entirely.
- **NOT reshape.** Per the architecture choice in the framing note above, samples stay as on-demand fetches via `source_url`; no parquet `instance_samples` table, no SQL queries over samples. (If a future product feature wants cross-model sample search/filter/comparison, that's a separate reshape spec.)

## Inputs and expected outputs

### Group A — Pipeline-canonical shape (the only shape that fires in production today)

Input shape (from cache `result.instance_level_data.instance_examples[i]`):

```jsonc
{
  "schema_version": "...",
  "evaluation_id": "...",
  "model_id": "...",
  "evaluation_name": "...",
  "sample_id": "...",
  "sample_hash": "...",            // sometimes present
  "interaction_type": "multi_turn",
  "input": { "raw": "..." },        // ALWAYS object with .raw in production
  "output": "..." | { ... },        // sometimes
  "messages": [{ role: "...", content: "..." }, ...],  // typically present
  "answer_attribution": [..., { "extracted_value": "..." }],  // typically present
  "evaluation": { "is_correct": true|false, ... },  // ALWAYS present
  "performance": { ... },
  "metadata": { ... },
  "token_usage": { ... },
  "error": null | "...",
  "hierarchy": [...]
}
```

Expected output (`SampleResult`):

| Output field | Source path that fires | Notes |
|---|---|---|
| `sample_id` | `raw.sample_id` (100% of production) | always present |
| `input` | `raw.input.raw` (100%) | branch #2 in the chain |
| `ground_truth` | `raw.input.reference` (100%) | branch #1 |
| `response` | `raw.answer_attribution` (97.31%) OR `raw.messages` (2.49%) OR `raw.output` (0.20%) | branches #4, #5, #1 |
| `is_correct` | `raw.evaluation.is_correct` (100%) | branch #1 |
| `choices` | `undefined` (100%) | no branch fires; field is unset in production |
| `metadata` | merged from `raw.evaluation`, `raw.performance`, `raw.metadata`, `raw.metrics` (always at least 2 of 4 present) | merged object |

### Group B — Defensive fallback branches (zero firing rate in current production)

| Branch | Output field | Production hits | Origin (presumed) |
|---|---|---|---|
| `raw.input` (string) | input | 0 | older harness shapes |
| `raw.prompt` | input | 0 | lm-eval-harness |
| `raw.question` | input | 0 | other harnesses |
| `raw.doc.question` | input | 0 | HELM-style |
| `raw.doc` (JSON.stringify) | input | 0 | last-resort |
| `raw.ground_truth` | ground_truth | 0 | older shapes |
| `raw.target` | ground_truth | 0 | classification benchmarks |
| `raw.gold` | ground_truth | 0 | older lm-eval |
| `raw.doc.answer` | ground_truth | 0 | HELM-style |
| `raw.response` | response | 0 | older shapes |
| `raw.model_output` | response | 0 | older shapes |
| `raw.filtered_resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.resps[0][0]` | response | 0 | lm-eval-harness format |
| `raw.is_correct` | is_correct | 0 | flat shape |
| `raw.metrics.exact_match` | is_correct | 0 | metric-based correctness |
| `raw.doc_id` | sample_id | 0 | HELM-style |
| `raw.id` | sample_id | 0 | generic |
| index fallback | sample_id | 0 | last-resort |
| `raw.choices` | choices | 0 | multiple-choice |
| `raw.doc.choices` | choices | 0 | HELM multiple-choice |

These branches exist for shapes the pipeline currently does not emit. **Preserve verbatim** until pipeline-side guarantees the canonical shape across all data sources.

### Group C — `fetchInstanceLevelData` JSONL parsing edge cases

| Input | Behavior |
|---|---|
| URL returns `!res.ok` (404, 500, etc.) | returns `[]` (no throw) |
| URL throws (network error) | logs warning to console, returns `[]` |
| Empty body | splits to `[]`, returns `[]` |
| Body with empty lines | `.filter(line => line.trim())` strips them |
| Body with malformed JSON line | swallowed in inner try-catch; line skipped, processing continues |
| `limit=0` or `limit=undefined` | parses ALL lines |
| `limit > lines.length` | parses all lines (capped via `Math.min`) |

## Current TS implementation

| Concern | Location | Notes |
|---|---|---|
| `parseInstanceLevelData` | `lib/hf-data.ts:933-1043` | The parser; ~110 lines |
| `fetchInstanceLevelData` | `lib/hf-data.ts:890-917` | URL fetcher; orphaned (zero callers) |
| Active call site | `lib/hf-data.ts:1273` (inside `flattenHierarchyNode`) | `parseInstanceLevelData(result.instance_level_data)` → `inlineSamples` |
| Internal call site | `lib/hf-data.ts:912` | inside `fetchInstanceLevelData` itself, recursive call to the parser |
| `SampleResult` type | `lib/benchmark-schema.ts:135` | Output shape definition |
| Output field | `BenchmarkEvaluation.detailed_evaluation_results_per_samples` | `lib/benchmark-schema.ts:33` |

### Caller chain for `parseInstanceLevelData`

`getModelSummaryById` (lib/model-data.ts:1490+) → `flattenModelEvaluations` → `flattenHierarchyNode` (lib/hf-data.ts:1273) → `parseInstanceLevelData(result.instance_level_data)` → set as `inlineSamples` on each variant bucket → propagated to `BenchmarkEvaluation.detailed_evaluation_results_per_samples`.

UI consumers (read `data.detailed_evaluation_results_per_samples`):
- `components/benchmark-detail.tsx:3869` — random sample for preview block
- `components/benchmark-detail.tsx:4174-4216` — sample preview UI in benchmark detail
- `components/benchmark-detail.tsx:4982` — variant-level sample availability check
- `components/benchmark-detail.tsx:5284-5286` — variant sample picker
- `components/benchmark-detail.tsx:5569-5608` — sample preview list with INSTANCE_PREVIEW_LIMIT and "see all" expansion

### Caller chain for `fetchInstanceLevelData`

None. Function is exported and unreached. Preserved with the intent that a future "show all samples" UI consumer wires up to it.

## Pipeline status

### Side-by-side comparison

| Aspect | TS (this spec) | Pipeline today | Result for users |
|---|---|---|---|
| Inline preview shape | parser handles many variants | emits ONE canonical shape (`input.raw`, `evaluation.is_correct`, etc.) | parser's fallback branches almost all dead |
| URL JSONL shape | same parser handles | emits IDENTICAL canonical shape (verified by sampling one URL on 2026-04-28) | parser would work the same on URL data |
| Inline preview size | parser doesn't care | always exactly 5 samples per `instance_examples` array | UI capped at 5 today |
| Total samples per row | n/a | typically 50 (per `instance_count` field), one outlier 18 | only 10% accessible to UI today |
| URL-fetch use case | `fetchInstanceLevelData` exists | `source_url` always emitted | dead code on TS side; no UI consumer |

### Concrete worked example with quantified scope

Audited 2026-04-28 against `.cache/hf-data/`. Verified by `scripts/verify-instance-level-data.mjs`.

**Prevalence:**
- Total model files: 5,830
- Files with any `instance_level_data`: **55 (0.94%)**
- Total `(metric × model_result)` rows: 86,183
- Result rows with `instance_level_data`: **712 (0.83%)**
- Total inline preview examples (sum of `instance_examples.length`): **3,532** (always ≤5 per row)
- Total full samples (sum of `instance_count`): **66,057** (full set available via `source_url`, not loaded today; ~19× larger than what UI currently shows)

**ild-level shape uniformity (712/712 rows):**
- Top-level keys are always exactly `{interaction_type, instance_count, source_url, instance_examples}`
- `interaction_type` is always `"multi_turn"` (no single_turn samples in cache)

**Per-example branch firing rates (3,532 examples):**
- `input`: `input.raw` 100%
- `ground_truth`: `input.reference` 100%
- `response`: `answer_attribution` 97.31%, `messages` 2.49%, `output` 0.20%
- `is_correct`: `evaluation.is_correct` 100%
- `sample_id`: `sample_id` 100%
- `choices`: nothing (always undefined)

The 7-branch input chain, 5-branch ground_truth chain, 4-branch is_correct chain, 4-branch sample_id chain, 2-branch choices chain are **defensive scaffolding** for shapes the pipeline does not currently emit. The 7-branch response chain has 3 active sub-branches.

**URL JSONL shape verification:** sampled one source_url (`anthropic__anthropic-claude-3-7-sonnet/swe_bench_verified_mini_...`); first line had identical 18 keys to the inline `instance_examples[0]`, with `input.raw` and `evaluation.is_correct` in expected paths. Pipeline emits the same canonical shape on both inline and URL paths.

## Notes for pipeline implementer

- The pipeline already emits a canonical per-sample shape consistently. **No structural change needed to current emission.** The migration is to make this shape an explicit guarantee, not to change what's being emitted.
- Suggested guarantee: every `instance_examples[i]` (inline AND in JSONL at `source_url`) has at minimum `{sample_id, input.raw, input.reference?, evaluation.is_correct, answer_attribution? || messages?, metadata?}`.
- Once that guarantee is documented and verified, the TS parser shrinks dramatically: extract `raw.input.raw`, `raw.input.reference`, `raw.evaluation.is_correct`, `raw.sample_id` as direct field reads. The response field still needs the 3-branch fallback (answer_attribution → messages → output) until pipeline emits a single normalized `response` field.
- **Do NOT pre-ingest the URL JSONL into pipeline parquet** (per the architecture choice). The runtime UI fetches `source_url` on demand; this is the orphaned `fetchInstanceLevelData`'s intended use. The benefit of pre-ingestion (cross-corpus SQL queries over samples) is speculative; defer until a product feature demands it.
- The "shape uniformity" finding (712/712 rows have identical ild-level keys; all are `multi_turn`) suggests the pipeline already enforces the canonical shape. Worth documenting in the pipeline contract test (`tests/pipeline-contract.test.ts`).

## Migration checklist

- [x] Spec written
- [x] Tests cover each rule branch (`tests/transformations/instance-level-data.test.ts`)
- [x] Audit script (`scripts/verify-instance-level-data.mjs`)
- [ ] Filed with pipeline owner with the spec + tests + audit script as acceptance criterion
- [ ] Pipeline contract: explicit guarantee of canonical per-sample shape (Tier A test asserting `every instance_example has input.raw, sample_id, evaluation.is_correct`)
- [ ] TS deleted: `parseInstanceLevelData` shrinks to direct field reads (~10 lines instead of 110), or fully deleted if pipeline emits already-flat normalized records. `fetchInstanceLevelData` stays orphaned-but-preserved for the future "show all samples" UI feature, OR is wired up if that feature ships.

## Future product decisions (deferred)

- **"Show all samples" UI feature** — would un-orphan `fetchInstanceLevelData` and let users see the full 50-sample set instead of just the 5-sample preview. Lazy fetch on user click. This is the concrete capability the URL-fetch architecture supports; the spec assumes it's a product roadmap item, not committed scope.
- **Cross-model sample querying / search / filter** — would require `instance_samples.parquet` and SQL queries (the alternative architecture I initially proposed). Out of scope for this spec; revisit if/when product asks.
- **Single-turn samples** — pipeline currently only emits `interaction_type: multi_turn`. If pipeline starts emitting single_turn shapes that exercise dormant parser branches (e.g. flat `input` strings, `prompt`/`question` fields), the spec's "100% canonical shape" claim breaks and the parser fallbacks become live again.