Spaces:
Running
Running
| # InspectorRAGet β Architecture | |
| > Living document. Captures the current state of the codebase. Update as architecture evolves. | |
| ## Overview | |
| InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments β it is purely analytical. | |
| Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System. | |
| ## High-Level Data Flow | |
| ``` | |
| βββββββββββββββββββββββ | |
| β JSON Input File β | |
| β (user upload or β | |
| β data/ directory) β | |
| βββββββββββ¬ββββββββββββ | |
| β | |
| βββββββββββΌββββββββββββ | |
| β migrator.ts β | |
| β Schema migration β | |
| β (v1 β v2, etc.) β | |
| βββββββββββ¬ββββββββββββ | |
| β | |
| βββββββββββΌββββββββββββ | |
| β validators.ts β | |
| β Schema validation β | |
| βββββββββββ¬ββββββββββββ | |
| β | |
| βββββββββββΌββββββββββββ | |
| β processor.ts β | |
| β Qualify/disqualify β | |
| β tasks by metric β | |
| β completeness β | |
| βββββββββββ¬ββββββββββββ | |
| β | |
| βββββββββββΌββββββββββββ | |
| β DataStore context β | |
| β (store.tsx) β | |
| β Data + taskMap β | |
| β + resultsMap β | |
| βββββββββββ¬ββββββββββββ | |
| β | |
| βββββββββββββββββββββΌββββββββββββββββββββ | |
| β β β | |
| βββββββββΌβββββββ βββββββββΌβββββββ βββββββββΌβββββββ | |
| β Aggregate β β Instance β β Annotator β | |
| β Views β β Views β β Views β | |
| β (overview, β β (task β β (agreement, β | |
| β model, β β detail, β β behavior) β | |
| β metric) β β per type) β β β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| ``` | |
| ## Directory Structure | |
| ``` | |
| InspectorRAGet/ | |
| βββ src/ | |
| β βββ app/ # Next.js App Router β thin page shells | |
| β β βββ layout.tsx # Root: ThemeProvider β NotificationProvider β DataStoreProvider | |
| β β βββ page.tsx # / β Home landing page | |
| β β βββ visualize/ # /visualize β Upload and analyze | |
| β β βββ examples/ # /examples β Browse pre-loaded datasets | |
| β β β βββ [example_id]/ # /examples/:id β Specific dataset analysis | |
| β β | |
| β βββ views/ # Page-level container components | |
| β β βββ home/ # Landing page cards | |
| β β βββ on-board/ # Multi-step upload wizard (instructions β upload β verify) | |
| β β βββ example/ # Main analysis hub β 7-tab interface | |
| β β βββ examples/ # Grid of dataset tiles | |
| β β βββ visualization/ # Onboard β Example router | |
| β β βββ task/ # Instance-level task viewer (modal overlay) | |
| β β β βββ Task.tsx # Dispatches to type-specific TaskView via registry | |
| β β βββ performance-overview/ # Aggregate metric tables + charts | |
| β β βββ model-behavior/ # Per-metric distribution analysis | |
| β β βββ metric-behavior/ # Cross-metric correlation | |
| β β βββ model-comparator/ # Head-to-head model comparison | |
| β β βββ data-characteristics/ # Dataset statistics | |
| β β βββ annotator-behavior/ # Inter-annotator agreement | |
| β β βββ predictions-table/ # Filterable evaluation table | |
| β β βββ tasks-table/ # Task listing with filters | |
| β β βββ annotations-table/ # Per-task metric scores | |
| β β βββ document/ # Document viewer | |
| β β | |
| β βββ task-types/ # Vertical slice per evaluation type | |
| β β βββ index.ts # Registry: taskTypeRegistry maps type string β { TaskView, Copier } | |
| β β βββ qa/ # Single-turn QA with retrieved context | |
| β β β βββ types.ts # RetrievedDocument, RetrievedDocumentAnnotation | |
| β β β βββ TaskView.tsx # Input + contexts + per-model response + evaluations/steps tabs | |
| β β β βββ Copier.tsx | |
| β β βββ generation/ # Open-ended text/JSON generation | |
| β β β βββ TaskView.tsx # Input + per-model response + evaluations/steps tabs | |
| β β β βββ Copier.tsx | |
| β β βββ rag/ # Multi-turn retrieval conversation | |
| β β β βββ types.ts # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, β¦) | |
| β β β βββ TaskView.tsx # Conversation thread + per-model response + evaluations/steps tabs | |
| β β β βββ Copier.tsx | |
| β β β βββ components/ | |
| β β β βββ ChatLine.tsx # Renders a single OpenAI-format message (status ring, retries footer) | |
| β β β βββ Avatar.tsx # Role avatar with status ring (pass/warn/fail outline) | |
| β β β βββ DocumentsViewer.tsx | |
| β β βββ tool_calling/ # Function/tool calling evaluation | |
| β β β βββ types.ts # ToolDefinition (OpenAI JSON Schema format) | |
| β β β βββ TaskView.tsx # Conversation + available tools panel + prediction/target/evaluations/steps | |
| β β β βββ Copier.tsx | |
| β β βββ agentic/ # Goal-directed multi-turn agent execution | |
| β β βββ TaskView.tsx # Goal + initial state + target state + execution thread + evaluations/steps | |
| β β βββ Copier.tsx | |
| β β | |
| β βββ components/ # Reusable UI components | |
| β β βββ header/ # App header with nav and theme toggle | |
| β β βββ filters/ # Generic filter controls | |
| β β βββ expression-builder/ # Advanced filter expression builder | |
| β β βββ selectors/ # Model, Metric, Aggregator selectors | |
| β β βββ evaluations/ # EvaluationsPanel β shared human + algorithmic score tables | |
| β β βββ trace/ # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events) | |
| β β βββ comments/ # Task commenting system (see Comment System section below) | |
| β β βββ notification/ # Toast notifications (context provider) | |
| β β βββ avatar/ # User/agent avatars | |
| β β βββ task-tile/ # Task summary card | |
| β β βββ example-tile/ # Dataset summary card | |
| β β βββ disabled/ # Disabled feature placeholder | |
| β β | |
| β βββ hooks/ | |
| β β βββ useBackButton.ts # Browser back navigation | |
| β β βββ useStorage.ts # localStorage persistence | |
| β β βββ usePrevious.ts # Previous render value | |
| β β | |
| β βββ utilities/ | |
| β β βββ strings.ts # Hashing, truncation, search matching | |
| β β βββ colors.ts # Color scale generation | |
| β β βββ objects.ts # camelCase/snakeCase key conversion | |
| β β βββ aggregators.ts # Mean, median, majority, weighted aggregators | |
| β β βββ metrics.ts # Metric helper functions | |
| β β βββ selectors.ts # Mouse selection extraction | |
| β β βββ expressions.ts # Expression evaluation for advanced filters | |
| β β βββ correlation.ts # Statistical correlation | |
| β β βββ significance.ts # Statistical significance tests | |
| β β βββ highlighter.ts # Text overlap highlighting | |
| β β βββ time.ts # Duration calculation | |
| β β | |
| β βββ workers/ | |
| β β βββ filter.ts # Web Worker for background data filtering | |
| β β | |
| β βββ types.ts # Core TypeScript interfaces (re-exports task-type-specific types) | |
| β βββ store.tsx # DataStoreProvider (React Context) | |
| β βββ migrator.ts # Versioned schema migration chain (v1 β v2 β β¦) | |
| β βββ processor.ts # Data qualification pipeline | |
| β βββ exporter.ts # Export pipeline (split from processor.ts) | |
| β βββ validators.ts # Input schema validation | |
| β βββ dataloader.ts # Server-side data/ directory loader | |
| β βββ theme.tsx # ThemeProvider (Carbon g10/g90) | |
| β | |
| βββ converters/ # Dataset converters | |
| β βββ bfcl/ # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4) | |
| β βββ acebench/ # ACEBench (tool-calling and agentic categories) | |
| βββ data/ # Pre-loaded example datasets (JSON, schema v2) | |
| βββ notebooks/ # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL) | |
| βββ public/ # Static assets (favicon, license) | |
| βββ docs/ # Documentation (this file) | |
| ``` | |
| ## Core Data Model | |
| Defined in `src/types.ts`. The input JSON (schema v2) has this structure: | |
| ``` | |
| RawData | |
| βββ schema_version?: number # 2 = current; absent or 1 = legacy (auto-migrated) | |
| βββ name?: string | |
| βββ models: Model[] # LLMs being evaluated | |
| β βββ { modelId, name, owner, ... } | |
| βββ metrics: Metric[] # Evaluation criteria | |
| β βββ { name, type: numerical|categorical|text, author: human|algorithm, ... } | |
| βββ documents?: RetrievedDocument[] # Corpus documents (QA/RAG tasks) | |
| β βββ { documentId, text, title?, url?, score? } | |
| βββ filters?: string[] # Task fields available for filtering | |
| βββ tasks: Task[] # Individual evaluation instances | |
| β βββ { taskId, taskType: qa|generation|rag|tool_calling|agentic, | |
| β input, targets?: TaskTarget[], tools?: ToolDefinition[], | |
| β flagged?, comments?: TaskComment[], annotations? } | |
| βββ results: ModelResult[] # Model outputs + metric scores | |
| βββ { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } }, | |
| contexts?, comments?: TaskComment[] } | |
| ``` | |
| `output` is always a `Message[]`. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For `agentic` tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as `message.trace`, not at the result level. | |
| ### Key type unions | |
| **`Message`** β OpenAI-compatible message shape: | |
| - `role: 'system' | 'user' | 'assistant' | 'tool'` | |
| - `content?: string` β text response | |
| - `tool_calls?: ToolCallRecord[]` β tool-calling output (on assistant messages) | |
| - `trace?: TraceEvent[]` β execution trace attached to the assistant message that produced it; each event is discriminated on `type`: `invocation | tool_execution | observation` | |
| - `retries?: MessageRetry[]` β intermediate retry attempts before final output | |
| - `metadata?: Record<string, unknown>` β benchmark-supplied metadata; known keys: `status` (`'pass' | 'fail' | 'warn'`) rendered as a coloured badge in the chat footer, and `statusDefinition` (string) shown as a hover tooltip on that badge | |
| **`ModelResult`** carries an optional `metadata?: Record<string, unknown>` bag for benchmark-supplied per-result diagnostics. Known key: `error` β `{ kind: 'text' | 'structured', context: unknown }` used by the BFCL agentic converter to surface structured state-diff details. | |
| **`TaskTarget`** β discriminated on `type`: | |
| - `{ type: 'text'; value: string }` β most task types | |
| - `{ type: 'tool_calls'; calls: ToolCallRecord[] }` β tool-calling ground truth | |
| - `{ type: 'state'; value: Record<string, unknown> }` β agentic expected final environment state | |
| - `{ type: 'image'; url: string }` β multimodal (future) | |
| **`TraceEvent`** β discriminated on `type`: | |
| - `invocation` β an intermediate LLM call within a turn (before the accepted output) | |
| - `tool_execution` β environment response(s) following an intermediate invocation | |
| - `observation` β runner feedback after a decode failure, empty response, or forced termination | |
| - All events carry an optional `label` (e.g. `"step_2"` matching the inference log key) and `content` string | |
| ### Schema migration | |
| `migrator.ts` runs before `validators.ts` on every load. The migration chain is: | |
| - **v1 β v2:** renames legacy task types (`rag` single-turn β `qa`, `rag` multi-turn β `rag`, `text_generation`/`json_generation` β `generation`, `chat` β `rag`); wraps `model_response` string β `output: [{ role: 'assistant', content }]`; renames `annotations` β `scores`; renames `evaluations` array β `results` | |
| Exported files are always stamped with `schema_version: CURRENT_SCHEMA_VERSION`. | |
| After processing (`processor.ts`), tasks are qualified or disqualified based on: | |
| 1. Whether all plottable metrics have scores | |
| 2. Whether results exist for all specified models | |
| 3. Whether score values are non-empty | |
| The qualified data becomes the `Data` interface (extends `TileData`), stored in `DataStore` context. | |
| ## State Management | |
| **Global state** is React Context, not Redux: | |
| - `DataStoreProvider` (`store.tsx`): holds `Data`, `taskMap: Map<taskId, Task>`, and `resultsMap: Map<"taskId::modelId", ModelResult>` | |
| - `updateTask(taskId, update)` β immutable Map update for task-level changes (flags, task comments) | |
| - `updateResult(taskId, modelId, update)` β immutable Map update for model-result-level changes (model comments) | |
| - `ThemeProvider` (`theme.tsx`): Carbon theme toggle (light g10 / dark g90) | |
| - `NotificationProvider` (`components/notification/`): toast messages | |
| **Local state**: each view manages its own filters, selections, and UI state via `useState`. | |
| **Web Workers**: `ModelBehavior` view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread. | |
| ## Task-Type Registry | |
| `src/task-types/index.ts` exports `taskTypeRegistry`: | |
| ```typescript | |
| const taskTypeRegistry: Record< | |
| string, | |
| { TaskView: ComponentType; Copier: ComponentType } | |
| > = { | |
| qa: { TaskView: QATaskView, Copier: QACopier }, | |
| generation: { TaskView: GenerationTaskView, Copier: GenerationCopier }, | |
| rag: { TaskView: RAGTaskView, Copier: RAGCopier }, | |
| tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier }, | |
| agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier }, | |
| }; | |
| ``` | |
| `Task.tsx` and `TaskCopier.tsx` look up the component via `taskTypeRegistry[task.taskType]` β no if/else chains. Unknown task types degrade gracefully to null. | |
| ## Comment System | |
| Comments live at two levels: | |
| - **`task.comments`** β task-level observations shared across all models | |
| - **`result.comments`** β per-model observations (e.g. noting an acceptable-but-different tool call) | |
| `Task.tsx` routes a new comment to the correct level by inspecting the provenance component string: any component containing `::` is model-scoped (written to `updateResult`); others are task-scoped (written to `updateTask`). | |
| ### Provenance component string convention | |
| | Component string pattern | Meaning | Scope | | |
| | ------------------------------------------------------ | ------------------------------- | ----- | | |
| | `input` / `messages` | Input or conversation area | Task | | |
| | `document_{id}` | Retrieved context document | Task | | |
| | `target` | Ground-truth target area | Task | | |
| | `{modelId}::evaluation::response` | Model response text | Model | | |
| | `{modelId}::evaluation::prediction` | Model prediction (tool calling) | Model | | |
| | `{modelId}::evaluation::scores::{metric}::{annotator}` | Specific score cell | Model | | |
| | `{modelId}::steps::{stepId}` | Specific execution step | Model | | |
| ### Floating selection button | |
| After any `mouseup` on the `.taskViewWrapper` div, `Task.tsx` captures viewport coordinates. If provenance is also set (text was selected), `SelectionCommentButton` renders as a `position: fixed` button near the cursor. Clicking it opens `AddCommentModal` with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button. | |
| ### `provenanceTag.ts` | |
| Single source of truth for deriving display pills from a provenance component string. Returns `{ primary: [label, carbonType], detail?: [label1, label2] }`. The `detail` pair is only set for score cells (metric + annotator) and step references (\"step\" + stepId). All three comment modals (`AddCommentModal`, `EditCommentModal`, `CommentsViewer`) import from here. | |
| ### `CommentFinding` | |
| Optional structured annotation attached to a `TaskComment`. Discriminated on `type`: | |
| - `tool_call` β points to the correct function name/arguments | |
| - `query` β records the correct retrieval query | |
| - `output` β records a corrected reference output | |
| - `note` β free-form structured note | |
| `CommentFindingEditor` renders type-appropriate fields filtered by `task.taskType`. Findings are stored in the comment but editing them post-creation is out of scope (display-only in `EditCommentModal`). | |
| ## Routing | |
| | Route | Server/Client | What it does | | |
| | ---------------- | ------------- | ---------------------------------------- | | |
| | `/` | Server | Renders Home with navigation cards | | |
| | `/visualize` | Client | Upload wizard β analysis view | | |
| | `/examples` | Server | Loads `data/` dir, renders dataset grid | | |
| | `/examples/[id]` | Server | Loads specific dataset, renders analysis | | |
| The analysis hub (`Example` view) provides 7 tabs: | |
| 1. **Data Characteristics** β dataset statistics, distributions | |
| 2. **Predictions Table** β filterable evaluation records | |
| 3. **Annotator Behavior** β inter-annotator agreement heatmap | |
| 4. **Performance Overview** β aggregate metrics with rankings | |
| 5. **Model Behavior** β per-metric distribution analysis | |
| 6. **Model Comparator** β head-to-head comparison | |
| 7. **Metric Behavior** β cross-metric correlation | |
| Clicking a task in any table opens a **Task modal overlay** (`views/task/Task.tsx`) which dispatches to the type-specific `TaskView` from the registry. | |
| ## Key Technical Details | |
| ### Data Processing Pipeline | |
| `migrator.ts` β `migrateData(raw)` β `validators.ts` β `validateInputData(data)` β `processor.ts` β `processData(raw)` returns `[Data, DisqualifiedTasks, Notification[]]` | |
| - Migration runs first, before `camelCaseKeys`, so it operates on raw snake_case fields | |
| - Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges | |
| - `bin()` in `utilities/metrics.ts` maps a numeric value to its `[start, end]` bucket string using the metric's `range: [min, max, step]`. Values below `min` map to `<min` and values above `max` map to `>max` β both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers | |
| ### Input Validation | |
| `validators.ts` β `validateInputData(data)` returns `{ valid, reasons[] }` | |
| - Checks required fields on models, metrics, tasks, results | |
| - Validates metric type constraints (categorical needs values, numerical can't use majority) | |
| - QA tasks must reference documents | |
| - Every value in a categorical metric must carry a `numeric_value` (camelCase: `numericValue`) | |
| **Why `numeric_value` is required on categorical metric values** | |
| The entire aggregation and sorting pipeline operates on numbers, not label strings. `castToNumber()` in `utilities/metrics.ts` maps a string label to its `numericValue` so that mean, median, and inter-annotator agreement distance can be computed arithmetically. `computeMajority()` uses `Math.abs(castToNumber(a) - castToNumber(b))` to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). `sortMetricValues()` in `processor.ts` sorts the metric's value list by `numericValue` so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. `compareMetricAggregatedValues()` in `metrics.ts` uses `numericValue` to order chart bars for majority-aggregated metrics. | |
| Without `numericValue`, every one of these paths falls back to `parseFloat(label)`, which returns `NaN` for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing `numericValue` on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations. | |
| **Convention:** assign `numericValue` so that **higher = better**. `sortMetricValues()` sorts ascending by `numericValue`, so `values[0]` becomes `minValue` (worst) and `values[last]` becomes `maxValue` (best). `PerformanceOverview` normalises scores as `(score - minValue) / (maxValue - minValue)` and ranks models with higher scores first. Example: `{ value: "poor", numeric_value: 0 }`, `{ value: "acceptable", numeric_value: 1 }`, `{ value: "good", numeric_value: 2 }`. | |
| ### JSON Key Convention | |
| Input files use `snake_case`. The app converts to `camelCase` on load (`camelCaseKeys` in `objects.ts`) and back to `snake_case` on export (`snakeCaseKeys`). Migration runs before `camelCaseKeys`. | |
| ### Styling | |
| - SCSS Modules: one `.module.scss` per component, co-located | |
| - Carbon Design tokens for spacing, colors, typography | |
| - Global styles in `src/app/global.scss` | |
| - Theme: Carbon `g10` (light) and `g90` (dark), toggled via header | |
| - Sass uses `@use` (not `@import`) for Carbon v11 / Turbopack compatibility | |
| - Carbon font-face disabled via `$css--font-face: false` in global.scss (Turbopack can't resolve `~` prefix) | |
| ### Carbon Component Gotchas | |
| **TabPanel renders all panels simultaneously (hidden, not unmounted)** | |
| Carbon's `TabPanel` uses the HTML `hidden` attribute to hide inactive panels β it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences: | |
| - Any component with a non-unique `id` prop will have duplicate DOM IDs across tabs. This silently breaks components that rely on `id` for internal DOM wiring (labels, aria associations, focus management). The browser uses the **first** matching element β clicks on a selector in a later tab quietly target the hidden first tab's element instead. | |
| - **Rule:** every Carbon component that takes an `id` prop and is used in multiple tabs must have a globally unique `id`. Pattern used here: `{view-name}-{component-name}`, e.g. `model-behavior-model-selector`, `metric-behavior-model-selector`. | |
| - Components confirmed affected: `FilterableMultiSelect`, `Toggle`, `Select`. Assume all interactive Carbon components are affected. | |
| **`FilterableMultiSelect` β controlled vs uncontrolled** | |
| - Use `selectedItems` (controlled) rather than `initialSelectedItems` (uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path fires `onChange` via a `useEffect` guarded by an `isMounted` ref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive. | |
| - Always add a null guard to `itemToString`: `(item) => (item ? item.name : '')`. Carbon passes `null` to `itemToString` in some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state. | |
| **`@carbon/charts-react/styles.css` import** | |
| Import this stylesheet once, at the highest shared layout level (e.g. `app/layout.tsx` or `global.scss`), not per-component. All rules are scoped under `.cds--chart-holder` so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles. | |
| ### Deployment | |
| - Dockerfile for containerized deployment | |
| - `next.config.js` sets `output: 'standalone'` for minimal Docker images | |
| - Security headers configured (CSP, HSTS, X-Frame-Options) | |
| - Can also deploy on HuggingFace Spaces | |
| ## Known Technical Debt | |
| 1. **Type safety gaps** β `any` types on `Task.input` and `Model.trainingDetails`; `task.annotations` is untyped (context quality scores on RAG/QA documents, distinct from `result.scores` β see TODO comment in `types.ts`) | |
| 2. **Pre-existing ESLint warnings** β 19 errors and 15 warnings from `react-hooks/exhaustive-deps`, `react-compiler` rules, and `setState`-in-effect patterns; need case-by-case review | |
| 3. **ESLint 10 blocked** β `eslint-plugin-react` (bundled by `eslint-config-next`) uses deprecated `getFilename` API removed in ESLint 10; pinned to v9 until upstream fixes | |
| 4. **No client-side component tests** β all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually | |