Spaces:
Running
InspectorRAGet β Architecture
Living document. Captures the current state of the codebase. Update as architecture evolves.
Overview
InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments β it is purely analytical.
Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System.
High-Level Data Flow
βββββββββββββββββββββββ
β JSON Input File β
β (user upload or β
β data/ directory) β
βββββββββββ¬ββββββββββββ
β
βββββββββββΌββββββββββββ
β migrator.ts β
β Schema migration β
β (v1 β v2, etc.) β
βββββββββββ¬ββββββββββββ
β
βββββββββββΌββββββββββββ
β validators.ts β
β Schema validation β
βββββββββββ¬ββββββββββββ
β
βββββββββββΌββββββββββββ
β processor.ts β
β Qualify/disqualify β
β tasks by metric β
β completeness β
βββββββββββ¬ββββββββββββ
β
βββββββββββΌββββββββββββ
β DataStore context β
β (store.tsx) β
β Data + taskMap β
β + resultsMap β
βββββββββββ¬ββββββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
βββββββββΌβββββββ βββββββββΌβββββββ βββββββββΌβββββββ
β Aggregate β β Instance β β Annotator β
β Views β β Views β β Views β
β (overview, β β (task β β (agreement, β
β model, β β detail, β β behavior) β
β metric) β β per type) β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Directory Structure
InspectorRAGet/
βββ src/
β βββ app/ # Next.js App Router β thin page shells
β β βββ layout.tsx # Root: ThemeProvider β NotificationProvider β DataStoreProvider
β β βββ page.tsx # / β Home landing page
β β βββ visualize/ # /visualize β Upload and analyze
β β βββ examples/ # /examples β Browse pre-loaded datasets
β β β βββ [example_id]/ # /examples/:id β Specific dataset analysis
β β
β βββ views/ # Page-level container components
β β βββ home/ # Landing page cards
β β βββ on-board/ # Multi-step upload wizard (instructions β upload β verify)
β β βββ example/ # Main analysis hub β 7-tab interface
β β βββ examples/ # Grid of dataset tiles
β β βββ visualization/ # Onboard β Example router
β β βββ task/ # Instance-level task viewer (modal overlay)
β β β βββ Task.tsx # Dispatches to type-specific TaskView via registry
β β βββ performance-overview/ # Aggregate metric tables + charts
β β βββ model-behavior/ # Per-metric distribution analysis
β β βββ metric-behavior/ # Cross-metric correlation
β β βββ model-comparator/ # Head-to-head model comparison
β β βββ data-characteristics/ # Dataset statistics
β β βββ annotator-behavior/ # Inter-annotator agreement
β β βββ predictions-table/ # Filterable evaluation table
β β βββ tasks-table/ # Task listing with filters
β β βββ annotations-table/ # Per-task metric scores
β β βββ document/ # Document viewer
β β
β βββ task-types/ # Vertical slice per evaluation type
β β βββ index.ts # Registry: taskTypeRegistry maps type string β { TaskView, Copier }
β β βββ qa/ # Single-turn QA with retrieved context
β β β βββ types.ts # RetrievedDocument, RetrievedDocumentAnnotation
β β β βββ TaskView.tsx # Input + contexts + per-model response + evaluations/steps tabs
β β β βββ Copier.tsx
β β βββ generation/ # Open-ended text/JSON generation
β β β βββ TaskView.tsx # Input + per-model response + evaluations/steps tabs
β β β βββ Copier.tsx
β β βββ rag/ # Multi-turn retrieval conversation
β β β βββ types.ts # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, β¦)
β β β βββ TaskView.tsx # Conversation thread + per-model response + evaluations/steps tabs
β β β βββ Copier.tsx
β β β βββ components/
β β β βββ ChatLine.tsx # Renders a single OpenAI-format message (status ring, retries footer)
β β β βββ Avatar.tsx # Role avatar with status ring (pass/warn/fail outline)
β β β βββ DocumentsViewer.tsx
β β βββ tool_calling/ # Function/tool calling evaluation
β β β βββ types.ts # ToolDefinition (OpenAI JSON Schema format)
β β β βββ TaskView.tsx # Conversation + available tools panel + prediction/target/evaluations/steps
β β β βββ Copier.tsx
β β βββ agentic/ # Goal-directed multi-turn agent execution
β β βββ TaskView.tsx # Goal + initial state + target state + execution thread + evaluations/steps
β β βββ Copier.tsx
β β
β βββ components/ # Reusable UI components
β β βββ header/ # App header with nav and theme toggle
β β βββ filters/ # Generic filter controls
β β βββ expression-builder/ # Advanced filter expression builder
β β βββ selectors/ # Model, Metric, Aggregator selectors
β β βββ evaluations/ # EvaluationsPanel β shared human + algorithmic score tables
β β βββ trace/ # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events)
β β βββ comments/ # Task commenting system (see Comment System section below)
β β βββ notification/ # Toast notifications (context provider)
β β βββ avatar/ # User/agent avatars
β β βββ task-tile/ # Task summary card
β β βββ example-tile/ # Dataset summary card
β β βββ disabled/ # Disabled feature placeholder
β β
β βββ hooks/
β β βββ useBackButton.ts # Browser back navigation
β β βββ useStorage.ts # localStorage persistence
β β βββ usePrevious.ts # Previous render value
β β
β βββ utilities/
β β βββ strings.ts # Hashing, truncation, search matching
β β βββ colors.ts # Color scale generation
β β βββ objects.ts # camelCase/snakeCase key conversion
β β βββ aggregators.ts # Mean, median, majority, weighted aggregators
β β βββ metrics.ts # Metric helper functions
β β βββ selectors.ts # Mouse selection extraction
β β βββ expressions.ts # Expression evaluation for advanced filters
β β βββ correlation.ts # Statistical correlation
β β βββ significance.ts # Statistical significance tests
β β βββ highlighter.ts # Text overlap highlighting
β β βββ time.ts # Duration calculation
β β
β βββ workers/
β β βββ filter.ts # Web Worker for background data filtering
β β
β βββ types.ts # Core TypeScript interfaces (re-exports task-type-specific types)
β βββ store.tsx # DataStoreProvider (React Context)
β βββ migrator.ts # Versioned schema migration chain (v1 β v2 β β¦)
β βββ processor.ts # Data qualification pipeline
β βββ exporter.ts # Export pipeline (split from processor.ts)
β βββ validators.ts # Input schema validation
β βββ dataloader.ts # Server-side data/ directory loader
β βββ theme.tsx # ThemeProvider (Carbon g10/g90)
β
βββ converters/ # Dataset converters
β βββ bfcl/ # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4)
β βββ acebench/ # ACEBench (tool-calling and agentic categories)
βββ data/ # Pre-loaded example datasets (JSON, schema v2)
βββ notebooks/ # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL)
βββ public/ # Static assets (favicon, license)
βββ docs/ # Documentation (this file)
Core Data Model
Defined in src/types.ts. The input JSON (schema v2) has this structure:
RawData
βββ schema_version?: number # 2 = current; absent or 1 = legacy (auto-migrated)
βββ name?: string
βββ models: Model[] # LLMs being evaluated
β βββ { modelId, name, owner, ... }
βββ metrics: Metric[] # Evaluation criteria
β βββ { name, type: numerical|categorical|text, author: human|algorithm, ... }
βββ documents?: RetrievedDocument[] # Corpus documents (QA/RAG tasks)
β βββ { documentId, text, title?, url?, score? }
βββ filters?: string[] # Task fields available for filtering
βββ tasks: Task[] # Individual evaluation instances
β βββ { taskId, taskType: qa|generation|rag|tool_calling|agentic,
β input, targets?: TaskTarget[], tools?: ToolDefinition[],
β flagged?, comments?: TaskComment[], annotations? }
βββ results: ModelResult[] # Model outputs + metric scores
βββ { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } },
contexts?, comments?: TaskComment[] }
output is always a Message[]. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For agentic tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as message.trace, not at the result level.
Key type unions
Message β OpenAI-compatible message shape:
role: 'system' | 'user' | 'assistant' | 'tool'content?: stringβ text responsetool_calls?: ToolCallRecord[]β tool-calling output (on assistant messages)trace?: TraceEvent[]β execution trace attached to the assistant message that produced it; each event is discriminated ontype:invocation | tool_execution | observationretries?: MessageRetry[]β intermediate retry attempts before final outputmetadata?: Record<string, unknown>β benchmark-supplied metadata; known keys:status('pass' | 'fail' | 'warn') rendered as a coloured badge in the chat footer, andstatusDefinition(string) shown as a hover tooltip on that badge
ModelResult carries an optional metadata?: Record<string, unknown> bag for benchmark-supplied per-result diagnostics. Known key: error β { kind: 'text' | 'structured', context: unknown } used by the BFCL agentic converter to surface structured state-diff details.
TaskTarget β discriminated on type:
{ type: 'text'; value: string }β most task types{ type: 'tool_calls'; calls: ToolCallRecord[] }β tool-calling ground truth{ type: 'state'; value: Record<string, unknown> }β agentic expected final environment state{ type: 'image'; url: string }β multimodal (future)
TraceEvent β discriminated on type:
invocationβ an intermediate LLM call within a turn (before the accepted output)tool_executionβ environment response(s) following an intermediate invocationobservationβ runner feedback after a decode failure, empty response, or forced termination- All events carry an optional
label(e.g."step_2"matching the inference log key) andcontentstring
Schema migration
migrator.ts runs before validators.ts on every load. The migration chain is:
- v1 β v2: renames legacy task types (
ragsingle-turn βqa,ragmulti-turn βrag,text_generation/json_generationβgeneration,chatβrag); wrapsmodel_responsestring βoutput: [{ role: 'assistant', content }]; renamesannotationsβscores; renamesevaluationsarray βresults
Exported files are always stamped with schema_version: CURRENT_SCHEMA_VERSION.
After processing (processor.ts), tasks are qualified or disqualified based on:
- Whether all plottable metrics have scores
- Whether results exist for all specified models
- Whether score values are non-empty
The qualified data becomes the Data interface (extends TileData), stored in DataStore context.
State Management
Global state is React Context, not Redux:
DataStoreProvider(store.tsx): holdsData,taskMap: Map<taskId, Task>, andresultsMap: Map<"taskId::modelId", ModelResult>updateTask(taskId, update)β immutable Map update for task-level changes (flags, task comments)updateResult(taskId, modelId, update)β immutable Map update for model-result-level changes (model comments)
ThemeProvider(theme.tsx): Carbon theme toggle (light g10 / dark g90)NotificationProvider(components/notification/): toast messages
Local state: each view manages its own filters, selections, and UI state via useState.
Web Workers: ModelBehavior view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread.
Task-Type Registry
src/task-types/index.ts exports taskTypeRegistry:
const taskTypeRegistry: Record<
string,
{ TaskView: ComponentType; Copier: ComponentType }
> = {
qa: { TaskView: QATaskView, Copier: QACopier },
generation: { TaskView: GenerationTaskView, Copier: GenerationCopier },
rag: { TaskView: RAGTaskView, Copier: RAGCopier },
tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier },
agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier },
};
Task.tsx and TaskCopier.tsx look up the component via taskTypeRegistry[task.taskType] β no if/else chains. Unknown task types degrade gracefully to null.
Comment System
Comments live at two levels:
task.commentsβ task-level observations shared across all modelsresult.commentsβ per-model observations (e.g. noting an acceptable-but-different tool call)
Task.tsx routes a new comment to the correct level by inspecting the provenance component string: any component containing :: is model-scoped (written to updateResult); others are task-scoped (written to updateTask).
Provenance component string convention
| Component string pattern | Meaning | Scope |
|---|---|---|
input / messages |
Input or conversation area | Task |
document_{id} |
Retrieved context document | Task |
target |
Ground-truth target area | Task |
{modelId}::evaluation::response |
Model response text | Model |
{modelId}::evaluation::prediction |
Model prediction (tool calling) | Model |
{modelId}::evaluation::scores::{metric}::{annotator} |
Specific score cell | Model |
{modelId}::steps::{stepId} |
Specific execution step | Model |
Floating selection button
After any mouseup on the .taskViewWrapper div, Task.tsx captures viewport coordinates. If provenance is also set (text was selected), SelectionCommentButton renders as a position: fixed button near the cursor. Clicking it opens AddCommentModal with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button.
provenanceTag.ts
Single source of truth for deriving display pills from a provenance component string. Returns { primary: [label, carbonType], detail?: [label1, label2] }. The detail pair is only set for score cells (metric + annotator) and step references ("step" + stepId). All three comment modals (AddCommentModal, EditCommentModal, CommentsViewer) import from here.
CommentFinding
Optional structured annotation attached to a TaskComment. Discriminated on type:
tool_callβ points to the correct function name/argumentsqueryβ records the correct retrieval queryoutputβ records a corrected reference outputnoteβ free-form structured note
CommentFindingEditor renders type-appropriate fields filtered by task.taskType. Findings are stored in the comment but editing them post-creation is out of scope (display-only in EditCommentModal).
Routing
| Route | Server/Client | What it does |
|---|---|---|
/ |
Server | Renders Home with navigation cards |
/visualize |
Client | Upload wizard β analysis view |
/examples |
Server | Loads data/ dir, renders dataset grid |
/examples/[id] |
Server | Loads specific dataset, renders analysis |
The analysis hub (Example view) provides 7 tabs:
- Data Characteristics β dataset statistics, distributions
- Predictions Table β filterable evaluation records
- Annotator Behavior β inter-annotator agreement heatmap
- Performance Overview β aggregate metrics with rankings
- Model Behavior β per-metric distribution analysis
- Model Comparator β head-to-head comparison
- Metric Behavior β cross-metric correlation
Clicking a task in any table opens a Task modal overlay (views/task/Task.tsx) which dispatches to the type-specific TaskView from the registry.
Key Technical Details
Data Processing Pipeline
migrator.ts β migrateData(raw) β validators.ts β validateInputData(data) β processor.ts β processData(raw) returns [Data, DisqualifiedTasks, Notification[]]
- Migration runs first, before
camelCaseKeys, so it operates on raw snake_case fields - Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges
bin()inutilities/metrics.tsmaps a numeric value to its[start, end]bucket string using the metric'srange: [min, max, step]. Values belowminmap to<minand values abovemaxmap to>maxβ both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers
Input Validation
validators.ts β validateInputData(data) returns { valid, reasons[] }
- Checks required fields on models, metrics, tasks, results
- Validates metric type constraints (categorical needs values, numerical can't use majority)
- QA tasks must reference documents
- Every value in a categorical metric must carry a
numeric_value(camelCase:numericValue)
Why numeric_value is required on categorical metric values
The entire aggregation and sorting pipeline operates on numbers, not label strings. castToNumber() in utilities/metrics.ts maps a string label to its numericValue so that mean, median, and inter-annotator agreement distance can be computed arithmetically. computeMajority() uses Math.abs(castToNumber(a) - castToNumber(b)) to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). sortMetricValues() in processor.ts sorts the metric's value list by numericValue so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. compareMetricAggregatedValues() in metrics.ts uses numericValue to order chart bars for majority-aggregated metrics.
Without numericValue, every one of these paths falls back to parseFloat(label), which returns NaN for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing numericValue on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations.
Convention: assign numericValue so that higher = better. sortMetricValues() sorts ascending by numericValue, so values[0] becomes minValue (worst) and values[last] becomes maxValue (best). PerformanceOverview normalises scores as (score - minValue) / (maxValue - minValue) and ranks models with higher scores first. Example: { value: "poor", numeric_value: 0 }, { value: "acceptable", numeric_value: 1 }, { value: "good", numeric_value: 2 }.
JSON Key Convention
Input files use snake_case. The app converts to camelCase on load (camelCaseKeys in objects.ts) and back to snake_case on export (snakeCaseKeys). Migration runs before camelCaseKeys.
Styling
- SCSS Modules: one
.module.scssper component, co-located - Carbon Design tokens for spacing, colors, typography
- Global styles in
src/app/global.scss - Theme: Carbon
g10(light) andg90(dark), toggled via header - Sass uses
@use(not@import) for Carbon v11 / Turbopack compatibility - Carbon font-face disabled via
$css--font-face: falsein global.scss (Turbopack can't resolve~prefix)
Carbon Component Gotchas
TabPanel renders all panels simultaneously (hidden, not unmounted)
Carbon's TabPanel uses the HTML hidden attribute to hide inactive panels β it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences:
- Any component with a non-unique
idprop will have duplicate DOM IDs across tabs. This silently breaks components that rely onidfor internal DOM wiring (labels, aria associations, focus management). The browser uses the first matching element β clicks on a selector in a later tab quietly target the hidden first tab's element instead. - Rule: every Carbon component that takes an
idprop and is used in multiple tabs must have a globally uniqueid. Pattern used here:{view-name}-{component-name}, e.g.model-behavior-model-selector,metric-behavior-model-selector. - Components confirmed affected:
FilterableMultiSelect,Toggle,Select. Assume all interactive Carbon components are affected.
FilterableMultiSelect β controlled vs uncontrolled
- Use
selectedItems(controlled) rather thaninitialSelectedItems(uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path firesonChangevia auseEffectguarded by anisMountedref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive. - Always add a null guard to
itemToString:(item) => (item ? item.name : ''). Carbon passesnulltoitemToStringin some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state.
@carbon/charts-react/styles.css import
Import this stylesheet once, at the highest shared layout level (e.g. app/layout.tsx or global.scss), not per-component. All rules are scoped under .cds--chart-holder so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles.
Deployment
- Dockerfile for containerized deployment
next.config.jssetsoutput: 'standalone'for minimal Docker images- Security headers configured (CSP, HSTS, X-Frame-Options)
- Can also deploy on HuggingFace Spaces
Known Technical Debt
- Type safety gaps β
anytypes onTask.inputandModel.trainingDetails;task.annotationsis untyped (context quality scores on RAG/QA documents, distinct fromresult.scoresβ see TODO comment intypes.ts) - Pre-existing ESLint warnings β 19 errors and 15 warnings from
react-hooks/exhaustive-deps,react-compilerrules, andsetState-in-effect patterns; need case-by-case review - ESLint 10 blocked β
eslint-plugin-react(bundled byeslint-config-next) uses deprecatedgetFilenameAPI removed in ESLint 10; pinned to v9 until upstream fixes - No client-side component tests β all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually