# InspectorRAGet — Architecture

> Living document. Captures the current state of the codebase. Update as architecture evolves.

## Overview

InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments — it is purely analytical.

Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System.

## High-Level Data Flow

```
                              ┌─────────────────────┐
                              │   JSON Input File    │
                              │  (user upload or     │
                              │   data/ directory)   │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   migrator.ts        │
                              │  Schema migration    │
                              │  (v1 → v2, etc.)     │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   validators.ts      │
                              │  Schema validation   │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   processor.ts       │
                              │  Qualify/disqualify  │
                              │  tasks by metric     │
                              │  completeness        │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   DataStore context  │
                              │  (store.tsx)         │
                              │  Data + taskMap      │
                              │       + resultsMap   │
                              └─────────┬───────────┘
                                        │
                    ┌───────────────────┼───────────────────┐
                    │                   │                   │
            ┌───────▼──────┐   ┌───────▼──────┐   ┌───────▼──────┐
            │  Aggregate   │   │  Instance    │   │  Annotator   │
            │  Views       │   │  Views       │   │  Views       │
            │  (overview,  │   │  (task       │   │  (agreement, │
            │   model,     │   │   detail,    │   │   behavior)  │
            │   metric)    │   │   per type)  │   │              │
            └──────────────┘   └──────────────┘   └──────────────┘
```

## Directory Structure

```
InspectorRAGet/
├── src/
│   ├── app/                    # Next.js App Router — thin page shells
│   │   ├── layout.tsx          # Root: ThemeProvider → NotificationProvider → DataStoreProvider
│   │   ├── page.tsx            # / — Home landing page
│   │   ├── visualize/          # /visualize — Upload and analyze
│   │   ├── examples/           # /examples — Browse pre-loaded datasets
│   │   │   └── [example_id]/   # /examples/:id — Specific dataset analysis
│   │
│   ├── views/                  # Page-level container components
│   │   ├── home/               # Landing page cards
│   │   ├── on-board/           # Multi-step upload wizard (instructions → upload → verify)
│   │   ├── example/            # Main analysis hub — 7-tab interface
│   │   ├── examples/           # Grid of dataset tiles
│   │   ├── visualization/      # Onboard → Example router
│   │   ├── task/               # Instance-level task viewer (modal overlay)
│   │   │   └── Task.tsx        # Dispatches to type-specific TaskView via registry
│   │   ├── performance-overview/   # Aggregate metric tables + charts
│   │   ├── model-behavior/     # Per-metric distribution analysis
│   │   ├── metric-behavior/    # Cross-metric correlation
│   │   ├── model-comparator/   # Head-to-head model comparison
│   │   ├── data-characteristics/   # Dataset statistics
│   │   ├── annotator-behavior/ # Inter-annotator agreement
│   │   ├── predictions-table/  # Filterable evaluation table
│   │   ├── tasks-table/        # Task listing with filters
│   │   ├── annotations-table/  # Per-task metric scores
│   │   └── document/           # Document viewer
│   │
│   ├── task-types/             # Vertical slice per evaluation type
│   │   ├── index.ts            # Registry: taskTypeRegistry maps type string → { TaskView, Copier }
│   │   ├── qa/                 # Single-turn QA with retrieved context
│   │   │   ├── types.ts        # RetrievedDocument, RetrievedDocumentAnnotation
│   │   │   ├── TaskView.tsx    # Input + contexts + per-model response + evaluations/steps tabs
│   │   │   └── Copier.tsx
│   │   ├── generation/         # Open-ended text/JSON generation
│   │   │   ├── TaskView.tsx    # Input + per-model response + evaluations/steps tabs
│   │   │   └── Copier.tsx
│   │   ├── rag/                # Multi-turn retrieval conversation
│   │   │   ├── types.ts        # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, …)
│   │   │   ├── TaskView.tsx    # Conversation thread + per-model response + evaluations/steps tabs
│   │   │   ├── Copier.tsx
│   │   │   └── components/
│   │   │       ├── ChatLine.tsx    # Renders a single OpenAI-format message (status ring, retries footer)
│   │   │       ├── Avatar.tsx      # Role avatar with status ring (pass/warn/fail outline)
│   │   │       └── DocumentsViewer.tsx
│   │   ├── tool_calling/       # Function/tool calling evaluation
│   │   │   ├── types.ts        # ToolDefinition (OpenAI JSON Schema format)
│   │   │   ├── TaskView.tsx    # Conversation + available tools panel + prediction/target/evaluations/steps
│   │   │   └── Copier.tsx
│   │   └── agentic/            # Goal-directed multi-turn agent execution
│   │       ├── TaskView.tsx    # Goal + initial state + target state + execution thread + evaluations/steps
│   │       └── Copier.tsx
│   │
│   ├── components/             # Reusable UI components
│   │   ├── header/             # App header with nav and theme toggle
│   │   ├── filters/            # Generic filter controls
│   │   ├── expression-builder/ # Advanced filter expression builder
│   │   ├── selectors/          # Model, Metric, Aggregator selectors
│   │   ├── evaluations/        # EvaluationsPanel — shared human + algorithmic score tables
│   │   ├── trace/              # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events)
│   │   ├── comments/           # Task commenting system (see Comment System section below)
│   │   ├── notification/       # Toast notifications (context provider)
│   │   ├── avatar/             # User/agent avatars
│   │   ├── task-tile/          # Task summary card
│   │   ├── example-tile/       # Dataset summary card
│   │   └── disabled/           # Disabled feature placeholder
│   │
│   ├── hooks/
│   │   ├── useBackButton.ts    # Browser back navigation
│   │   ├── useStorage.ts       # localStorage persistence
│   │   └── usePrevious.ts      # Previous render value
│   │
│   ├── utilities/
│   │   ├── strings.ts          # Hashing, truncation, search matching
│   │   ├── colors.ts           # Color scale generation
│   │   ├── objects.ts          # camelCase/snakeCase key conversion
│   │   ├── aggregators.ts      # Mean, median, majority, weighted aggregators
│   │   ├── metrics.ts          # Metric helper functions
│   │   ├── selectors.ts        # Mouse selection extraction
│   │   ├── expressions.ts      # Expression evaluation for advanced filters
│   │   ├── correlation.ts      # Statistical correlation
│   │   ├── significance.ts     # Statistical significance tests
│   │   ├── highlighter.ts      # Text overlap highlighting
│   │   └── time.ts             # Duration calculation
│   │
│   ├── workers/
│   │   └── filter.ts           # Web Worker for background data filtering
│   │
│   ├── types.ts                # Core TypeScript interfaces (re-exports task-type-specific types)
│   ├── store.tsx               # DataStoreProvider (React Context)
│   ├── migrator.ts             # Versioned schema migration chain (v1 → v2 → …)
│   ├── processor.ts            # Data qualification pipeline
│   ├── exporter.ts             # Export pipeline (split from processor.ts)
│   ├── validators.ts           # Input schema validation
│   ├── dataloader.ts           # Server-side data/ directory loader
│   └── theme.tsx               # ThemeProvider (Carbon g10/g90)
│
├── converters/                 # Dataset converters
│   ├── bfcl/                   # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4)
│   └── acebench/               # ACEBench (tool-calling and agentic categories)
├── data/                       # Pre-loaded example datasets (JSON, schema v2)
├── notebooks/                  # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL)
├── public/                     # Static assets (favicon, license)
└── docs/                       # Documentation (this file)
```

## Core Data Model

Defined in `src/types.ts`. The input JSON (schema v2) has this structure:

```
RawData
├── schema_version?: number            # 2 = current; absent or 1 = legacy (auto-migrated)
├── name?: string
├── models: Model[]                    # LLMs being evaluated
│   └── { modelId, name, owner, ... }
├── metrics: Metric[]                  # Evaluation criteria
│   └── { name, type: numerical|categorical|text, author: human|algorithm, ... }
├── documents?: RetrievedDocument[]    # Corpus documents (QA/RAG tasks)
│   └── { documentId, text, title?, url?, score? }
├── filters?: string[]                 # Task fields available for filtering
├── tasks: Task[]                      # Individual evaluation instances
│   └── { taskId, taskType: qa|generation|rag|tool_calling|agentic,
│          input, targets?: TaskTarget[], tools?: ToolDefinition[],
│          flagged?, comments?: TaskComment[], annotations? }
└── results: ModelResult[]             # Model outputs + metric scores
    └── { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } },
           contexts?, comments?: TaskComment[] }
```

`output` is always a `Message[]`. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For `agentic` tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as `message.trace`, not at the result level.

### Key type unions

**`Message`** — OpenAI-compatible message shape:

- `role: 'system' | 'user' | 'assistant' | 'tool'`
- `content?: string` — text response
- `tool_calls?: ToolCallRecord[]` — tool-calling output (on assistant messages)
- `trace?: TraceEvent[]` — execution trace attached to the assistant message that produced it; each event is discriminated on `type`: `invocation | tool_execution | observation`
- `retries?: MessageRetry[]` — intermediate retry attempts before final output
- `metadata?: Record<string, unknown>` — benchmark-supplied metadata; known keys: `status` (`'pass' | 'fail' | 'warn'`) rendered as a coloured badge in the chat footer, and `statusDefinition` (string) shown as a hover tooltip on that badge

**`ModelResult`** carries an optional `metadata?: Record<string, unknown>` bag for benchmark-supplied per-result diagnostics. Known key: `error` — `{ kind: 'text' | 'structured', context: unknown }` used by the BFCL agentic converter to surface structured state-diff details.

**`TaskTarget`** — discriminated on `type`:

- `{ type: 'text'; value: string }` — most task types
- `{ type: 'tool_calls'; calls: ToolCallRecord[] }` — tool-calling ground truth
- `{ type: 'state'; value: Record<string, unknown> }` — agentic expected final environment state
- `{ type: 'image'; url: string }` — multimodal (future)

**`TraceEvent`** — discriminated on `type`:

- `invocation` — an intermediate LLM call within a turn (before the accepted output)
- `tool_execution` — environment response(s) following an intermediate invocation
- `observation` — runner feedback after a decode failure, empty response, or forced termination
- All events carry an optional `label` (e.g. `"step_2"` matching the inference log key) and `content` string

### Schema migration

`migrator.ts` runs before `validators.ts` on every load. The migration chain is:

- **v1 → v2:** renames legacy task types (`rag` single-turn → `qa`, `rag` multi-turn → `rag`, `text_generation`/`json_generation` → `generation`, `chat` → `rag`); wraps `model_response` string → `output: [{ role: 'assistant', content }]`; renames `annotations` → `scores`; renames `evaluations` array → `results`

Exported files are always stamped with `schema_version: CURRENT_SCHEMA_VERSION`.

After processing (`processor.ts`), tasks are qualified or disqualified based on:

1. Whether all plottable metrics have scores
2. Whether results exist for all specified models
3. Whether score values are non-empty

The qualified data becomes the `Data` interface (extends `TileData`), stored in `DataStore` context.

## State Management

**Global state** is React Context, not Redux:

- `DataStoreProvider` (`store.tsx`): holds `Data`, `taskMap: Map<taskId, Task>`, and `resultsMap: Map<"taskId::modelId", ModelResult>`
  - `updateTask(taskId, update)` — immutable Map update for task-level changes (flags, task comments)
  - `updateResult(taskId, modelId, update)` — immutable Map update for model-result-level changes (model comments)
- `ThemeProvider` (`theme.tsx`): Carbon theme toggle (light g10 / dark g90)
- `NotificationProvider` (`components/notification/`): toast messages

**Local state**: each view manages its own filters, selections, and UI state via `useState`.

**Web Workers**: `ModelBehavior` view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread.

## Task-Type Registry

`src/task-types/index.ts` exports `taskTypeRegistry`:

```typescript
const taskTypeRegistry: Record<
  string,
  { TaskView: ComponentType; Copier: ComponentType }
> = {
  qa: { TaskView: QATaskView, Copier: QACopier },
  generation: { TaskView: GenerationTaskView, Copier: GenerationCopier },
  rag: { TaskView: RAGTaskView, Copier: RAGCopier },
  tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier },
  agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier },
};
```

`Task.tsx` and `TaskCopier.tsx` look up the component via `taskTypeRegistry[task.taskType]` — no if/else chains. Unknown task types degrade gracefully to null.

## Comment System

Comments live at two levels:

- **`task.comments`** — task-level observations shared across all models
- **`result.comments`** — per-model observations (e.g. noting an acceptable-but-different tool call)

`Task.tsx` routes a new comment to the correct level by inspecting the provenance component string: any component containing `::` is model-scoped (written to `updateResult`); others are task-scoped (written to `updateTask`).

### Provenance component string convention

| Component string pattern                               | Meaning                         | Scope |
| ------------------------------------------------------ | ------------------------------- | ----- |
| `input` / `messages`                                   | Input or conversation area      | Task  |
| `document_{id}`                                        | Retrieved context document      | Task  |
| `target`                                               | Ground-truth target area        | Task  |
| `{modelId}::evaluation::response`                      | Model response text             | Model |
| `{modelId}::evaluation::prediction`                    | Model prediction (tool calling) | Model |
| `{modelId}::evaluation::scores::{metric}::{annotator}` | Specific score cell             | Model |
| `{modelId}::steps::{stepId}`                           | Specific execution step         | Model |

### Floating selection button

After any `mouseup` on the `.taskViewWrapper` div, `Task.tsx` captures viewport coordinates. If provenance is also set (text was selected), `SelectionCommentButton` renders as a `position: fixed` button near the cursor. Clicking it opens `AddCommentModal` with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button.

### `provenanceTag.ts`

Single source of truth for deriving display pills from a provenance component string. Returns `{ primary: [label, carbonType], detail?: [label1, label2] }`. The `detail` pair is only set for score cells (metric + annotator) and step references (\"step\" + stepId). All three comment modals (`AddCommentModal`, `EditCommentModal`, `CommentsViewer`) import from here.

### `CommentFinding`

Optional structured annotation attached to a `TaskComment`. Discriminated on `type`:

- `tool_call` — points to the correct function name/arguments
- `query` — records the correct retrieval query
- `output` — records a corrected reference output
- `note` — free-form structured note

`CommentFindingEditor` renders type-appropriate fields filtered by `task.taskType`. Findings are stored in the comment but editing them post-creation is out of scope (display-only in `EditCommentModal`).

## Routing

| Route            | Server/Client | What it does                             |
| ---------------- | ------------- | ---------------------------------------- |
| `/`              | Server        | Renders Home with navigation cards       |
| `/visualize`     | Client        | Upload wizard → analysis view            |
| `/examples`      | Server        | Loads `data/` dir, renders dataset grid  |
| `/examples/[id]` | Server        | Loads specific dataset, renders analysis |

The analysis hub (`Example` view) provides 7 tabs:

1. **Data Characteristics** — dataset statistics, distributions
2. **Predictions Table** — filterable evaluation records
3. **Annotator Behavior** — inter-annotator agreement heatmap
4. **Performance Overview** — aggregate metrics with rankings
5. **Model Behavior** — per-metric distribution analysis
6. **Model Comparator** — head-to-head comparison
7. **Metric Behavior** — cross-metric correlation

Clicking a task in any table opens a **Task modal overlay** (`views/task/Task.tsx`) which dispatches to the type-specific `TaskView` from the registry.

## Key Technical Details

### Data Processing Pipeline

`migrator.ts` → `migrateData(raw)` → `validators.ts` → `validateInputData(data)` → `processor.ts` → `processData(raw)` returns `[Data, DisqualifiedTasks, Notification[]]`

- Migration runs first, before `camelCaseKeys`, so it operates on raw snake_case fields
- Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges
- `bin()` in `utilities/metrics.ts` maps a numeric value to its `[start, end]` bucket string using the metric's `range: [min, max, step]`. Values below `min` map to `<min` and values above `max` map to `>max` — both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers

### Input Validation

`validators.ts` → `validateInputData(data)` returns `{ valid, reasons[] }`

- Checks required fields on models, metrics, tasks, results
- Validates metric type constraints (categorical needs values, numerical can't use majority)
- QA tasks must reference documents
- Every value in a categorical metric must carry a `numeric_value` (camelCase: `numericValue`)

**Why `numeric_value` is required on categorical metric values**

The entire aggregation and sorting pipeline operates on numbers, not label strings. `castToNumber()` in `utilities/metrics.ts` maps a string label to its `numericValue` so that mean, median, and inter-annotator agreement distance can be computed arithmetically. `computeMajority()` uses `Math.abs(castToNumber(a) - castToNumber(b))` to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). `sortMetricValues()` in `processor.ts` sorts the metric's value list by `numericValue` so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. `compareMetricAggregatedValues()` in `metrics.ts` uses `numericValue` to order chart bars for majority-aggregated metrics.

Without `numericValue`, every one of these paths falls back to `parseFloat(label)`, which returns `NaN` for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing `numericValue` on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations.

**Convention:** assign `numericValue` so that **higher = better**. `sortMetricValues()` sorts ascending by `numericValue`, so `values[0]` becomes `minValue` (worst) and `values[last]` becomes `maxValue` (best). `PerformanceOverview` normalises scores as `(score - minValue) / (maxValue - minValue)` and ranks models with higher scores first. Example: `{ value: "poor", numeric_value: 0 }`, `{ value: "acceptable", numeric_value: 1 }`, `{ value: "good", numeric_value: 2 }`.

### JSON Key Convention

Input files use `snake_case`. The app converts to `camelCase` on load (`camelCaseKeys` in `objects.ts`) and back to `snake_case` on export (`snakeCaseKeys`). Migration runs before `camelCaseKeys`.

### Styling

- SCSS Modules: one `.module.scss` per component, co-located
- Carbon Design tokens for spacing, colors, typography
- Global styles in `src/app/global.scss`
- Theme: Carbon `g10` (light) and `g90` (dark), toggled via header
- Sass uses `@use` (not `@import`) for Carbon v11 / Turbopack compatibility
- Carbon font-face disabled via `$css--font-face: false` in global.scss (Turbopack can't resolve `~` prefix)

### Carbon Component Gotchas

**TabPanel renders all panels simultaneously (hidden, not unmounted)**

Carbon's `TabPanel` uses the HTML `hidden` attribute to hide inactive panels — it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences:

- Any component with a non-unique `id` prop will have duplicate DOM IDs across tabs. This silently breaks components that rely on `id` for internal DOM wiring (labels, aria associations, focus management). The browser uses the **first** matching element — clicks on a selector in a later tab quietly target the hidden first tab's element instead.
- **Rule:** every Carbon component that takes an `id` prop and is used in multiple tabs must have a globally unique `id`. Pattern used here: `{view-name}-{component-name}`, e.g. `model-behavior-model-selector`, `metric-behavior-model-selector`.
- Components confirmed affected: `FilterableMultiSelect`, `Toggle`, `Select`. Assume all interactive Carbon components are affected.

**`FilterableMultiSelect` — controlled vs uncontrolled**

- Use `selectedItems` (controlled) rather than `initialSelectedItems` (uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path fires `onChange` via a `useEffect` guarded by an `isMounted` ref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive.
- Always add a null guard to `itemToString`: `(item) => (item ? item.name : '')`. Carbon passes `null` to `itemToString` in some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state.

**`@carbon/charts-react/styles.css` import**

Import this stylesheet once, at the highest shared layout level (e.g. `app/layout.tsx` or `global.scss`), not per-component. All rules are scoped under `.cds--chart-holder` so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles.

### Deployment

- Dockerfile for containerized deployment
- `next.config.js` sets `output: 'standalone'` for minimal Docker images
- Security headers configured (CSP, HSTS, X-Frame-Options)
- Can also deploy on HuggingFace Spaces

## Known Technical Debt

1. **Type safety gaps** — `any` types on `Task.input` and `Model.trainingDetails`; `task.annotations` is untyped (context quality scores on RAG/QA documents, distinct from `result.scores` — see TODO comment in `types.ts`)
2. **Pre-existing ESLint warnings** — 19 errors and 15 warnings from `react-hooks/exhaustive-deps`, `react-compiler` rules, and `setState`-in-effect patterns; need case-by-case review
3. **ESLint 10 blocked** — `eslint-plugin-react` (bundled by `eslint-config-next`) uses deprecated `getFilename` API removed in ESLint 10; pinned to v9 until upstream fixes
4. **No client-side component tests** — all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually