Spaces:

kpfadnis
/

InspectorRAGet

Running

App Files Files

InspectorRAGet / docs /ARCHITECTURE.md

kpfadnis

feat (acebench): ACEBench converter, README, and architecture docs

b359fde 2 months ago

preview code

raw

history blame

27.2 kB

InspectorRAGet — Architecture

Living document. Captures the current state of the codebase. Update as architecture evolves.

Overview

InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments — it is purely analytical.

Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System.

High-Level Data Flow

                              ┌─────────────────────┐
                              │   JSON Input File    │
                              │  (user upload or     │
                              │   data/ directory)   │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   migrator.ts        │
                              │  Schema migration    │
                              │  (v1 → v2, etc.)     │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   validators.ts      │
                              │  Schema validation   │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   processor.ts       │
                              │  Qualify/disqualify  │
                              │  tasks by metric     │
                              │  completeness        │
                              └─────────┬───────────┘
                                        │
                              ┌─────────▼───────────┐
                              │   DataStore context  │
                              │  (store.tsx)         │
                              │  Data + taskMap      │
                              │       + resultsMap   │
                              └─────────┬───────────┘
                                        │
                    ┌───────────────────┼───────────────────┐
                    │                   │                   │
            ┌───────▼──────┐   ┌───────▼──────┐   ┌───────▼──────┐
            │  Aggregate   │   │  Instance    │   │  Annotator   │
            │  Views       │   │  Views       │   │  Views       │
            │  (overview,  │   │  (task       │   │  (agreement, │
            │   model,     │   │   detail,    │   │   behavior)  │
            │   metric)    │   │   per type)  │   │              │
            └──────────────┘   └──────────────┘   └──────────────┘

Directory Structure

InspectorRAGet/
├── src/
│   ├── app/                    # Next.js App Router — thin page shells
│   │   ├── layout.tsx          # Root: ThemeProvider → NotificationProvider → DataStoreProvider
│   │   ├── page.tsx            # / — Home landing page
│   │   ├── visualize/          # /visualize — Upload and analyze
│   │   ├── examples/           # /examples — Browse pre-loaded datasets
│   │   │   └── [example_id]/   # /examples/:id — Specific dataset analysis
│   │
│   ├── views/                  # Page-level container components
│   │   ├── home/               # Landing page cards
│   │   ├── on-board/           # Multi-step upload wizard (instructions → upload → verify)
│   │   ├── example/            # Main analysis hub — 7-tab interface
│   │   ├── examples/           # Grid of dataset tiles
│   │   ├── visualization/      # Onboard → Example router
│   │   ├── task/               # Instance-level task viewer (modal overlay)
│   │   │   └── Task.tsx        # Dispatches to type-specific TaskView via registry
│   │   ├── performance-overview/   # Aggregate metric tables + charts
│   │   ├── model-behavior/     # Per-metric distribution analysis
│   │   ├── metric-behavior/    # Cross-metric correlation
│   │   ├── model-comparator/   # Head-to-head model comparison
│   │   ├── data-characteristics/   # Dataset statistics
│   │   ├── annotator-behavior/ # Inter-annotator agreement
│   │   ├── predictions-table/  # Filterable evaluation table
│   │   ├── tasks-table/        # Task listing with filters
│   │   ├── annotations-table/  # Per-task metric scores
│   │   └── document/           # Document viewer
│   │
│   ├── task-types/             # Vertical slice per evaluation type
│   │   ├── index.ts            # Registry: taskTypeRegistry maps type string → { TaskView, Copier }
│   │   ├── qa/                 # Single-turn QA with retrieved context
│   │   │   ├── types.ts        # RetrievedDocument, RetrievedDocumentAnnotation
│   │   │   ├── TaskView.tsx    # Input + contexts + per-model response + evaluations/steps tabs
│   │   │   └── Copier.tsx
│   │   ├── generation/         # Open-ended text/JSON generation
│   │   │   ├── TaskView.tsx    # Input + per-model response + evaluations/steps tabs
│   │   │   └── Copier.tsx
│   │   ├── rag/                # Multi-turn retrieval conversation
│   │   │   ├── types.ts        # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, …)
│   │   │   ├── TaskView.tsx    # Conversation thread + per-model response + evaluations/steps tabs
│   │   │   ├── Copier.tsx
│   │   │   └── components/
│   │   │       ├── ChatLine.tsx    # Renders a single OpenAI-format message (status ring, retries footer)
│   │   │       ├── Avatar.tsx      # Role avatar with status ring (pass/warn/fail outline)
│   │   │       └── DocumentsViewer.tsx
│   │   ├── tool_calling/       # Function/tool calling evaluation
│   │   │   ├── types.ts        # ToolDefinition (OpenAI JSON Schema format)
│   │   │   ├── TaskView.tsx    # Conversation + available tools panel + prediction/target/evaluations/steps
│   │   │   └── Copier.tsx
│   │   └── agentic/            # Goal-directed multi-turn agent execution
│   │       ├── TaskView.tsx    # Goal + initial state + target state + execution thread + evaluations/steps
│   │       └── Copier.tsx
│   │
│   ├── components/             # Reusable UI components
│   │   ├── header/             # App header with nav and theme toggle
│   │   ├── filters/            # Generic filter controls
│   │   ├── expression-builder/ # Advanced filter expression builder
│   │   ├── selectors/          # Model, Metric, Aggregator selectors
│   │   ├── evaluations/        # EvaluationsPanel — shared human + algorithmic score tables
│   │   ├── trace/              # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events)
│   │   ├── comments/           # Task commenting system (see Comment System section below)
│   │   ├── notification/       # Toast notifications (context provider)
│   │   ├── avatar/             # User/agent avatars
│   │   ├── task-tile/          # Task summary card
│   │   ├── example-tile/       # Dataset summary card
│   │   └── disabled/           # Disabled feature placeholder
│   │
│   ├── hooks/
│   │   ├── useBackButton.ts    # Browser back navigation
│   │   ├── useStorage.ts       # localStorage persistence
│   │   └── usePrevious.ts      # Previous render value
│   │
│   ├── utilities/
│   │   ├── strings.ts          # Hashing, truncation, search matching
│   │   ├── colors.ts           # Color scale generation
│   │   ├── objects.ts          # camelCase/snakeCase key conversion
│   │   ├── aggregators.ts      # Mean, median, majority, weighted aggregators
│   │   ├── metrics.ts          # Metric helper functions
│   │   ├── selectors.ts        # Mouse selection extraction
│   │   ├── expressions.ts      # Expression evaluation for advanced filters
│   │   ├── correlation.ts      # Statistical correlation
│   │   ├── significance.ts     # Statistical significance tests
│   │   ├── highlighter.ts      # Text overlap highlighting
│   │   └── time.ts             # Duration calculation
│   │
│   ├── workers/
│   │   └── filter.ts           # Web Worker for background data filtering
│   │
│   ├── types.ts                # Core TypeScript interfaces (re-exports task-type-specific types)
│   ├── store.tsx               # DataStoreProvider (React Context)
│   ├── migrator.ts             # Versioned schema migration chain (v1 → v2 → …)
│   ├── processor.ts            # Data qualification pipeline
│   ├── exporter.ts             # Export pipeline (split from processor.ts)
│   ├── validators.ts           # Input schema validation
│   ├── dataloader.ts           # Server-side data/ directory loader
│   └── theme.tsx               # ThemeProvider (Carbon g10/g90)
│
├── converters/                 # Dataset converters
│   ├── bfcl/                   # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4)
│   └── acebench/               # ACEBench (tool-calling and agentic categories)
├── data/                       # Pre-loaded example datasets (JSON, schema v2)
├── notebooks/                  # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL)
├── public/                     # Static assets (favicon, license)
└── docs/                       # Documentation (this file)

Core Data Model

Defined in src/types.ts. The input JSON (schema v2) has this structure:

RawData
├── schema_version?: number            # 2 = current; absent or 1 = legacy (auto-migrated)
├── name?: string
├── models: Model[]                    # LLMs being evaluated
│   └── { modelId, name, owner, ... }
├── metrics: Metric[]                  # Evaluation criteria
│   └── { name, type: numerical|categorical|text, author: human|algorithm, ... }
├── documents?: RetrievedDocument[]    # Corpus documents (QA/RAG tasks)
│   └── { documentId, text, title?, url?, score? }
├── filters?: string[]                 # Task fields available for filtering
├── tasks: Task[]                      # Individual evaluation instances
│   └── { taskId, taskType: qa|generation|rag|tool_calling|agentic,
│          input, targets?: TaskTarget[], tools?: ToolDefinition[],
│          flagged?, comments?: TaskComment[], annotations? }
└── results: ModelResult[]             # Model outputs + metric scores
    └── { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } },
           contexts?, comments?: TaskComment[] }

output is always a Message[]. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For agentic tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as message.trace, not at the result level.

Key type unions

Message — OpenAI-compatible message shape:

role: 'system' | 'user' | 'assistant' | 'tool'
content?: string — text response
tool_calls?: ToolCallRecord[] — tool-calling output (on assistant messages)
trace?: TraceEvent[] — execution trace attached to the assistant message that produced it; each event is discriminated on type: invocation | tool_execution | observation
retries?: MessageRetry[] — intermediate retry attempts before final output
metadata?: Record<string, unknown> — benchmark-supplied metadata; known keys: status ('pass' | 'fail' | 'warn') rendered as a coloured badge in the chat footer, and statusDefinition (string) shown as a hover tooltip on that badge

ModelResult carries an optional metadata?: Record<string, unknown> bag for benchmark-supplied per-result diagnostics. Known key: error — { kind: 'text' | 'structured', context: unknown } used by the BFCL agentic converter to surface structured state-diff details.

TaskTarget — discriminated on type:

{ type: 'text'; value: string } — most task types
{ type: 'tool_calls'; calls: ToolCallRecord[] } — tool-calling ground truth
{ type: 'state'; value: Record<string, unknown> } — agentic expected final environment state
{ type: 'image'; url: string } — multimodal (future)

TraceEvent — discriminated on type:

invocation — an intermediate LLM call within a turn (before the accepted output)
tool_execution — environment response(s) following an intermediate invocation
observation — runner feedback after a decode failure, empty response, or forced termination
All events carry an optional label (e.g. "step_2" matching the inference log key) and content string

Schema migration

migrator.ts runs before validators.ts on every load. The migration chain is:

v1 → v2: renames legacy task types (rag single-turn → qa, rag multi-turn → rag, text_generation/json_generation → generation, chat → rag); wraps model_response string → output: [{ role: 'assistant', content }]; renames annotations → scores; renames evaluations array → results

Exported files are always stamped with schema_version: CURRENT_SCHEMA_VERSION.

After processing (processor.ts), tasks are qualified or disqualified based on:

Whether all plottable metrics have scores
Whether results exist for all specified models
Whether score values are non-empty

The qualified data becomes the Data interface (extends TileData), stored in DataStore context.

State Management

Global state is React Context, not Redux:

DataStoreProvider (store.tsx): holds Data, taskMap: Map<taskId, Task>, and resultsMap: Map<"taskId::modelId", ModelResult>
- updateTask(taskId, update) — immutable Map update for task-level changes (flags, task comments)
- updateResult(taskId, modelId, update) — immutable Map update for model-result-level changes (model comments)
ThemeProvider (theme.tsx): Carbon theme toggle (light g10 / dark g90)
NotificationProvider (components/notification/): toast messages

Local state: each view manages its own filters, selections, and UI state via useState.

Web Workers: ModelBehavior view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread.

Task-Type Registry

src/task-types/index.ts exports taskTypeRegistry:

const taskTypeRegistry: Record<
  string,
  { TaskView: ComponentType; Copier: ComponentType }
> = {
  qa: { TaskView: QATaskView, Copier: QACopier },
  generation: { TaskView: GenerationTaskView, Copier: GenerationCopier },
  rag: { TaskView: RAGTaskView, Copier: RAGCopier },
  tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier },
  agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier },
};

Task.tsx and TaskCopier.tsx look up the component via taskTypeRegistry[task.taskType] — no if/else chains. Unknown task types degrade gracefully to null.

Comment System

Comments live at two levels:

task.comments — task-level observations shared across all models
result.comments — per-model observations (e.g. noting an acceptable-but-different tool call)

Task.tsx routes a new comment to the correct level by inspecting the provenance component string: any component containing :: is model-scoped (written to updateResult); others are task-scoped (written to updateTask).

Provenance component string convention

Component string pattern	Meaning	Scope
`input` / `messages`	Input or conversation area	Task
`document_{id}`	Retrieved context document	Task
`target`	Ground-truth target area	Task
`{modelId}::evaluation::response`	Model response text	Model
`{modelId}::evaluation::prediction`	Model prediction (tool calling)	Model
`{modelId}::evaluation::scores::{metric}::{annotator}`	Specific score cell	Model
`{modelId}::steps::{stepId}`	Specific execution step	Model

Floating selection button

After any mouseup on the .taskViewWrapper div, Task.tsx captures viewport coordinates. If provenance is also set (text was selected), SelectionCommentButton renders as a position: fixed button near the cursor. Clicking it opens AddCommentModal with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button.

`provenanceTag.ts`

Single source of truth for deriving display pills from a provenance component string. Returns { primary: [label, carbonType], detail?: [label1, label2] }. The detail pair is only set for score cells (metric + annotator) and step references ("step" + stepId). All three comment modals (AddCommentModal, EditCommentModal, CommentsViewer) import from here.

`CommentFinding`

Optional structured annotation attached to a TaskComment. Discriminated on type:

tool_call — points to the correct function name/arguments
query — records the correct retrieval query
output — records a corrected reference output
note — free-form structured note

CommentFindingEditor renders type-appropriate fields filtered by task.taskType. Findings are stored in the comment but editing them post-creation is out of scope (display-only in EditCommentModal).

Routing

Route	Server/Client	What it does
`/`	Server	Renders Home with navigation cards
`/visualize`	Client	Upload wizard → analysis view
`/examples`	Server	Loads `data/` dir, renders dataset grid
`/examples/[id]`	Server	Loads specific dataset, renders analysis

The analysis hub (Example view) provides 7 tabs:

Data Characteristics — dataset statistics, distributions
Predictions Table — filterable evaluation records
Annotator Behavior — inter-annotator agreement heatmap
Performance Overview — aggregate metrics with rankings
Model Behavior — per-metric distribution analysis
Model Comparator — head-to-head comparison
Metric Behavior — cross-metric correlation

Clicking a task in any table opens a Task modal overlay (views/task/Task.tsx) which dispatches to the type-specific TaskView from the registry.

Key Technical Details

Data Processing Pipeline

migrator.ts → migrateData(raw) → validators.ts → validateInputData(data) → processor.ts → processData(raw) returns [Data, DisqualifiedTasks, Notification[]]

Migration runs first, before camelCaseKeys, so it operates on raw snake_case fields
Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges
bin() in utilities/metrics.ts maps a numeric value to its [start, end] bucket string using the metric's range: [min, max, step]. Values below min map to <min and values above max map to >max — both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers

Input Validation

validators.ts → validateInputData(data) returns { valid, reasons[] }

Checks required fields on models, metrics, tasks, results
Validates metric type constraints (categorical needs values, numerical can't use majority)
QA tasks must reference documents
Every value in a categorical metric must carry a numeric_value (camelCase: numericValue)

Why numeric_value is required on categorical metric values

The entire aggregation and sorting pipeline operates on numbers, not label strings. castToNumber() in utilities/metrics.ts maps a string label to its numericValue so that mean, median, and inter-annotator agreement distance can be computed arithmetically. computeMajority() uses Math.abs(castToNumber(a) - castToNumber(b)) to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). sortMetricValues() in processor.ts sorts the metric's value list by numericValue so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. compareMetricAggregatedValues() in metrics.ts uses numericValue to order chart bars for majority-aggregated metrics.

Without numericValue, every one of these paths falls back to parseFloat(label), which returns NaN for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing numericValue on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations.

Convention: assign numericValue so that higher = better. sortMetricValues() sorts ascending by numericValue, so values[0] becomes minValue (worst) and values[last] becomes maxValue (best). PerformanceOverview normalises scores as (score - minValue) / (maxValue - minValue) and ranks models with higher scores first. Example: { value: "poor", numeric_value: 0 }, { value: "acceptable", numeric_value: 1 }, { value: "good", numeric_value: 2 }.

JSON Key Convention

Input files use snake_case. The app converts to camelCase on load (camelCaseKeys in objects.ts) and back to snake_case on export (snakeCaseKeys). Migration runs before camelCaseKeys.

Styling

SCSS Modules: one .module.scss per component, co-located
Carbon Design tokens for spacing, colors, typography
Global styles in src/app/global.scss
Theme: Carbon g10 (light) and g90 (dark), toggled via header
Sass uses @use (not @import) for Carbon v11 / Turbopack compatibility
Carbon font-face disabled via $css--font-face: false in global.scss (Turbopack can't resolve ~ prefix)

Carbon Component Gotchas

TabPanel renders all panels simultaneously (hidden, not unmounted)

Carbon's TabPanel uses the HTML hidden attribute to hide inactive panels — it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences:

Any component with a non-unique id prop will have duplicate DOM IDs across tabs. This silently breaks components that rely on id for internal DOM wiring (labels, aria associations, focus management). The browser uses the first matching element — clicks on a selector in a later tab quietly target the hidden first tab's element instead.
Rule: every Carbon component that takes an id prop and is used in multiple tabs must have a globally unique id. Pattern used here: {view-name}-{component-name}, e.g. model-behavior-model-selector, metric-behavior-model-selector.
Components confirmed affected: FilterableMultiSelect, Toggle, Select. Assume all interactive Carbon components are affected.

FilterableMultiSelect — controlled vs uncontrolled

Use selectedItems (controlled) rather than initialSelectedItems (uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path fires onChange via a useEffect guarded by an isMounted ref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive.
Always add a null guard to itemToString: (item) => (item ? item.name : ''). Carbon passes null to itemToString in some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state.

@carbon/charts-react/styles.css import

Import this stylesheet once, at the highest shared layout level (e.g. app/layout.tsx or global.scss), not per-component. All rules are scoped under .cds--chart-holder so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles.

Deployment

Dockerfile for containerized deployment
next.config.js sets output: 'standalone' for minimal Docker images
Security headers configured (CSP, HSTS, X-Frame-Options)
Can also deploy on HuggingFace Spaces

Known Technical Debt

Type safety gaps — any types on Task.input and Model.trainingDetails; task.annotations is untyped (context quality scores on RAG/QA documents, distinct from result.scores — see TODO comment in types.ts)
Pre-existing ESLint warnings — 19 errors and 15 warnings from react-hooks/exhaustive-deps, react-compiler rules, and setState-in-effect patterns; need case-by-case review
ESLint 10 blocked — eslint-plugin-react (bundled by eslint-config-next) uses deprecated getFilename API removed in ESLint 10; pinned to v9 until upstream fixes
No client-side component tests — all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually