InspectorRAGet / docs /ARCHITECTURE.md
kpfadnis's picture
feat (acebench): ACEBench converter, README, and architecture docs
b359fde
|
raw
history blame
27.2 kB

InspectorRAGet β€” Architecture

Living document. Captures the current state of the codebase. Update as architecture evolves.

Overview

InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments β€” it is purely analytical.

Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System.

High-Level Data Flow

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   JSON Input File    β”‚
                              β”‚  (user upload or     β”‚
                              β”‚   data/ directory)   β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   migrator.ts        β”‚
                              β”‚  Schema migration    β”‚
                              β”‚  (v1 β†’ v2, etc.)     β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   validators.ts      β”‚
                              β”‚  Schema validation   β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   processor.ts       β”‚
                              β”‚  Qualify/disqualify  β”‚
                              β”‚  tasks by metric     β”‚
                              β”‚  completeness        β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   DataStore context  β”‚
                              β”‚  (store.tsx)         β”‚
                              β”‚  Data + taskMap      β”‚
                              β”‚       + resultsMap   β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                   β”‚                   β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
            β”‚  Aggregate   β”‚   β”‚  Instance    β”‚   β”‚  Annotator   β”‚
            β”‚  Views       β”‚   β”‚  Views       β”‚   β”‚  Views       β”‚
            β”‚  (overview,  β”‚   β”‚  (task       β”‚   β”‚  (agreement, β”‚
            β”‚   model,     β”‚   β”‚   detail,    β”‚   β”‚   behavior)  β”‚
            β”‚   metric)    β”‚   β”‚   per type)  β”‚   β”‚              β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Directory Structure

InspectorRAGet/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/                    # Next.js App Router β€” thin page shells
β”‚   β”‚   β”œβ”€β”€ layout.tsx          # Root: ThemeProvider β†’ NotificationProvider β†’ DataStoreProvider
β”‚   β”‚   β”œβ”€β”€ page.tsx            # / β€” Home landing page
β”‚   β”‚   β”œβ”€β”€ visualize/          # /visualize β€” Upload and analyze
β”‚   β”‚   β”œβ”€β”€ examples/           # /examples β€” Browse pre-loaded datasets
β”‚   β”‚   β”‚   └── [example_id]/   # /examples/:id β€” Specific dataset analysis
β”‚   β”‚
β”‚   β”œβ”€β”€ views/                  # Page-level container components
β”‚   β”‚   β”œβ”€β”€ home/               # Landing page cards
β”‚   β”‚   β”œβ”€β”€ on-board/           # Multi-step upload wizard (instructions β†’ upload β†’ verify)
β”‚   β”‚   β”œβ”€β”€ example/            # Main analysis hub β€” 7-tab interface
β”‚   β”‚   β”œβ”€β”€ examples/           # Grid of dataset tiles
β”‚   β”‚   β”œβ”€β”€ visualization/      # Onboard β†’ Example router
β”‚   β”‚   β”œβ”€β”€ task/               # Instance-level task viewer (modal overlay)
β”‚   β”‚   β”‚   └── Task.tsx        # Dispatches to type-specific TaskView via registry
β”‚   β”‚   β”œβ”€β”€ performance-overview/   # Aggregate metric tables + charts
β”‚   β”‚   β”œβ”€β”€ model-behavior/     # Per-metric distribution analysis
β”‚   β”‚   β”œβ”€β”€ metric-behavior/    # Cross-metric correlation
β”‚   β”‚   β”œβ”€β”€ model-comparator/   # Head-to-head model comparison
β”‚   β”‚   β”œβ”€β”€ data-characteristics/   # Dataset statistics
β”‚   β”‚   β”œβ”€β”€ annotator-behavior/ # Inter-annotator agreement
β”‚   β”‚   β”œβ”€β”€ predictions-table/  # Filterable evaluation table
β”‚   β”‚   β”œβ”€β”€ tasks-table/        # Task listing with filters
β”‚   β”‚   β”œβ”€β”€ annotations-table/  # Per-task metric scores
β”‚   β”‚   └── document/           # Document viewer
β”‚   β”‚
β”‚   β”œβ”€β”€ task-types/             # Vertical slice per evaluation type
β”‚   β”‚   β”œβ”€β”€ index.ts            # Registry: taskTypeRegistry maps type string β†’ { TaskView, Copier }
β”‚   β”‚   β”œβ”€β”€ qa/                 # Single-turn QA with retrieved context
β”‚   β”‚   β”‚   β”œβ”€β”€ types.ts        # RetrievedDocument, RetrievedDocumentAnnotation
β”‚   β”‚   β”‚   β”œβ”€β”€ TaskView.tsx    # Input + contexts + per-model response + evaluations/steps tabs
β”‚   β”‚   β”‚   └── Copier.tsx
β”‚   β”‚   β”œβ”€β”€ generation/         # Open-ended text/JSON generation
β”‚   β”‚   β”‚   β”œβ”€β”€ TaskView.tsx    # Input + per-model response + evaluations/steps tabs
β”‚   β”‚   β”‚   └── Copier.tsx
β”‚   β”‚   β”œβ”€β”€ rag/                # Multi-turn retrieval conversation
β”‚   β”‚   β”‚   β”œβ”€β”€ types.ts        # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, …)
β”‚   β”‚   β”‚   β”œβ”€β”€ TaskView.tsx    # Conversation thread + per-model response + evaluations/steps tabs
β”‚   β”‚   β”‚   β”œβ”€β”€ Copier.tsx
β”‚   β”‚   β”‚   └── components/
β”‚   β”‚   β”‚       β”œβ”€β”€ ChatLine.tsx    # Renders a single OpenAI-format message (status ring, retries footer)
β”‚   β”‚   β”‚       β”œβ”€β”€ Avatar.tsx      # Role avatar with status ring (pass/warn/fail outline)
β”‚   β”‚   β”‚       └── DocumentsViewer.tsx
β”‚   β”‚   β”œβ”€β”€ tool_calling/       # Function/tool calling evaluation
β”‚   β”‚   β”‚   β”œβ”€β”€ types.ts        # ToolDefinition (OpenAI JSON Schema format)
β”‚   β”‚   β”‚   β”œβ”€β”€ TaskView.tsx    # Conversation + available tools panel + prediction/target/evaluations/steps
β”‚   β”‚   β”‚   └── Copier.tsx
β”‚   β”‚   └── agentic/            # Goal-directed multi-turn agent execution
β”‚   β”‚       β”œβ”€β”€ TaskView.tsx    # Goal + initial state + target state + execution thread + evaluations/steps
β”‚   β”‚       └── Copier.tsx
β”‚   β”‚
β”‚   β”œβ”€β”€ components/             # Reusable UI components
β”‚   β”‚   β”œβ”€β”€ header/             # App header with nav and theme toggle
β”‚   β”‚   β”œβ”€β”€ filters/            # Generic filter controls
β”‚   β”‚   β”œβ”€β”€ expression-builder/ # Advanced filter expression builder
β”‚   β”‚   β”œβ”€β”€ selectors/          # Model, Metric, Aggregator selectors
β”‚   β”‚   β”œβ”€β”€ evaluations/        # EvaluationsPanel β€” shared human + algorithmic score tables
β”‚   β”‚   β”œβ”€β”€ trace/              # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events)
β”‚   β”‚   β”œβ”€β”€ comments/           # Task commenting system (see Comment System section below)
β”‚   β”‚   β”œβ”€β”€ notification/       # Toast notifications (context provider)
β”‚   β”‚   β”œβ”€β”€ avatar/             # User/agent avatars
β”‚   β”‚   β”œβ”€β”€ task-tile/          # Task summary card
β”‚   β”‚   β”œβ”€β”€ example-tile/       # Dataset summary card
β”‚   β”‚   └── disabled/           # Disabled feature placeholder
β”‚   β”‚
β”‚   β”œβ”€β”€ hooks/
β”‚   β”‚   β”œβ”€β”€ useBackButton.ts    # Browser back navigation
β”‚   β”‚   β”œβ”€β”€ useStorage.ts       # localStorage persistence
β”‚   β”‚   └── usePrevious.ts      # Previous render value
β”‚   β”‚
β”‚   β”œβ”€β”€ utilities/
β”‚   β”‚   β”œβ”€β”€ strings.ts          # Hashing, truncation, search matching
β”‚   β”‚   β”œβ”€β”€ colors.ts           # Color scale generation
β”‚   β”‚   β”œβ”€β”€ objects.ts          # camelCase/snakeCase key conversion
β”‚   β”‚   β”œβ”€β”€ aggregators.ts      # Mean, median, majority, weighted aggregators
β”‚   β”‚   β”œβ”€β”€ metrics.ts          # Metric helper functions
β”‚   β”‚   β”œβ”€β”€ selectors.ts        # Mouse selection extraction
β”‚   β”‚   β”œβ”€β”€ expressions.ts      # Expression evaluation for advanced filters
β”‚   β”‚   β”œβ”€β”€ correlation.ts      # Statistical correlation
β”‚   β”‚   β”œβ”€β”€ significance.ts     # Statistical significance tests
β”‚   β”‚   β”œβ”€β”€ highlighter.ts      # Text overlap highlighting
β”‚   β”‚   └── time.ts             # Duration calculation
β”‚   β”‚
β”‚   β”œβ”€β”€ workers/
β”‚   β”‚   └── filter.ts           # Web Worker for background data filtering
β”‚   β”‚
β”‚   β”œβ”€β”€ types.ts                # Core TypeScript interfaces (re-exports task-type-specific types)
β”‚   β”œβ”€β”€ store.tsx               # DataStoreProvider (React Context)
β”‚   β”œβ”€β”€ migrator.ts             # Versioned schema migration chain (v1 β†’ v2 β†’ …)
β”‚   β”œβ”€β”€ processor.ts            # Data qualification pipeline
β”‚   β”œβ”€β”€ exporter.ts             # Export pipeline (split from processor.ts)
β”‚   β”œβ”€β”€ validators.ts           # Input schema validation
β”‚   β”œβ”€β”€ dataloader.ts           # Server-side data/ directory loader
β”‚   └── theme.tsx               # ThemeProvider (Carbon g10/g90)
β”‚
β”œβ”€β”€ converters/                 # Dataset converters
β”‚   β”œβ”€β”€ bfcl/                   # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4)
β”‚   └── acebench/               # ACEBench (tool-calling and agentic categories)
β”œβ”€β”€ data/                       # Pre-loaded example datasets (JSON, schema v2)
β”œβ”€β”€ notebooks/                  # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL)
β”œβ”€β”€ public/                     # Static assets (favicon, license)
└── docs/                       # Documentation (this file)

Core Data Model

Defined in src/types.ts. The input JSON (schema v2) has this structure:

RawData
β”œβ”€β”€ schema_version?: number            # 2 = current; absent or 1 = legacy (auto-migrated)
β”œβ”€β”€ name?: string
β”œβ”€β”€ models: Model[]                    # LLMs being evaluated
β”‚   └── { modelId, name, owner, ... }
β”œβ”€β”€ metrics: Metric[]                  # Evaluation criteria
β”‚   └── { name, type: numerical|categorical|text, author: human|algorithm, ... }
β”œβ”€β”€ documents?: RetrievedDocument[]    # Corpus documents (QA/RAG tasks)
β”‚   └── { documentId, text, title?, url?, score? }
β”œβ”€β”€ filters?: string[]                 # Task fields available for filtering
β”œβ”€β”€ tasks: Task[]                      # Individual evaluation instances
β”‚   └── { taskId, taskType: qa|generation|rag|tool_calling|agentic,
β”‚          input, targets?: TaskTarget[], tools?: ToolDefinition[],
β”‚          flagged?, comments?: TaskComment[], annotations? }
└── results: ModelResult[]             # Model outputs + metric scores
    └── { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } },
           contexts?, comments?: TaskComment[] }

output is always a Message[]. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For agentic tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as message.trace, not at the result level.

Key type unions

Message β€” OpenAI-compatible message shape:

  • role: 'system' | 'user' | 'assistant' | 'tool'
  • content?: string β€” text response
  • tool_calls?: ToolCallRecord[] β€” tool-calling output (on assistant messages)
  • trace?: TraceEvent[] β€” execution trace attached to the assistant message that produced it; each event is discriminated on type: invocation | tool_execution | observation
  • retries?: MessageRetry[] β€” intermediate retry attempts before final output
  • metadata?: Record<string, unknown> β€” benchmark-supplied metadata; known keys: status ('pass' | 'fail' | 'warn') rendered as a coloured badge in the chat footer, and statusDefinition (string) shown as a hover tooltip on that badge

ModelResult carries an optional metadata?: Record<string, unknown> bag for benchmark-supplied per-result diagnostics. Known key: error β€” { kind: 'text' | 'structured', context: unknown } used by the BFCL agentic converter to surface structured state-diff details.

TaskTarget β€” discriminated on type:

  • { type: 'text'; value: string } β€” most task types
  • { type: 'tool_calls'; calls: ToolCallRecord[] } β€” tool-calling ground truth
  • { type: 'state'; value: Record<string, unknown> } β€” agentic expected final environment state
  • { type: 'image'; url: string } β€” multimodal (future)

TraceEvent β€” discriminated on type:

  • invocation β€” an intermediate LLM call within a turn (before the accepted output)
  • tool_execution β€” environment response(s) following an intermediate invocation
  • observation β€” runner feedback after a decode failure, empty response, or forced termination
  • All events carry an optional label (e.g. "step_2" matching the inference log key) and content string

Schema migration

migrator.ts runs before validators.ts on every load. The migration chain is:

  • v1 β†’ v2: renames legacy task types (rag single-turn β†’ qa, rag multi-turn β†’ rag, text_generation/json_generation β†’ generation, chat β†’ rag); wraps model_response string β†’ output: [{ role: 'assistant', content }]; renames annotations β†’ scores; renames evaluations array β†’ results

Exported files are always stamped with schema_version: CURRENT_SCHEMA_VERSION.

After processing (processor.ts), tasks are qualified or disqualified based on:

  1. Whether all plottable metrics have scores
  2. Whether results exist for all specified models
  3. Whether score values are non-empty

The qualified data becomes the Data interface (extends TileData), stored in DataStore context.

State Management

Global state is React Context, not Redux:

  • DataStoreProvider (store.tsx): holds Data, taskMap: Map<taskId, Task>, and resultsMap: Map<"taskId::modelId", ModelResult>
    • updateTask(taskId, update) β€” immutable Map update for task-level changes (flags, task comments)
    • updateResult(taskId, modelId, update) β€” immutable Map update for model-result-level changes (model comments)
  • ThemeProvider (theme.tsx): Carbon theme toggle (light g10 / dark g90)
  • NotificationProvider (components/notification/): toast messages

Local state: each view manages its own filters, selections, and UI state via useState.

Web Workers: ModelBehavior view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread.

Task-Type Registry

src/task-types/index.ts exports taskTypeRegistry:

const taskTypeRegistry: Record<
  string,
  { TaskView: ComponentType; Copier: ComponentType }
> = {
  qa: { TaskView: QATaskView, Copier: QACopier },
  generation: { TaskView: GenerationTaskView, Copier: GenerationCopier },
  rag: { TaskView: RAGTaskView, Copier: RAGCopier },
  tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier },
  agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier },
};

Task.tsx and TaskCopier.tsx look up the component via taskTypeRegistry[task.taskType] β€” no if/else chains. Unknown task types degrade gracefully to null.

Comment System

Comments live at two levels:

  • task.comments β€” task-level observations shared across all models
  • result.comments β€” per-model observations (e.g. noting an acceptable-but-different tool call)

Task.tsx routes a new comment to the correct level by inspecting the provenance component string: any component containing :: is model-scoped (written to updateResult); others are task-scoped (written to updateTask).

Provenance component string convention

Component string pattern Meaning Scope
input / messages Input or conversation area Task
document_{id} Retrieved context document Task
target Ground-truth target area Task
{modelId}::evaluation::response Model response text Model
{modelId}::evaluation::prediction Model prediction (tool calling) Model
{modelId}::evaluation::scores::{metric}::{annotator} Specific score cell Model
{modelId}::steps::{stepId} Specific execution step Model

Floating selection button

After any mouseup on the .taskViewWrapper div, Task.tsx captures viewport coordinates. If provenance is also set (text was selected), SelectionCommentButton renders as a position: fixed button near the cursor. Clicking it opens AddCommentModal with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button.

provenanceTag.ts

Single source of truth for deriving display pills from a provenance component string. Returns { primary: [label, carbonType], detail?: [label1, label2] }. The detail pair is only set for score cells (metric + annotator) and step references ("step" + stepId). All three comment modals (AddCommentModal, EditCommentModal, CommentsViewer) import from here.

CommentFinding

Optional structured annotation attached to a TaskComment. Discriminated on type:

  • tool_call β€” points to the correct function name/arguments
  • query β€” records the correct retrieval query
  • output β€” records a corrected reference output
  • note β€” free-form structured note

CommentFindingEditor renders type-appropriate fields filtered by task.taskType. Findings are stored in the comment but editing them post-creation is out of scope (display-only in EditCommentModal).

Routing

Route Server/Client What it does
/ Server Renders Home with navigation cards
/visualize Client Upload wizard β†’ analysis view
/examples Server Loads data/ dir, renders dataset grid
/examples/[id] Server Loads specific dataset, renders analysis

The analysis hub (Example view) provides 7 tabs:

  1. Data Characteristics β€” dataset statistics, distributions
  2. Predictions Table β€” filterable evaluation records
  3. Annotator Behavior β€” inter-annotator agreement heatmap
  4. Performance Overview β€” aggregate metrics with rankings
  5. Model Behavior β€” per-metric distribution analysis
  6. Model Comparator β€” head-to-head comparison
  7. Metric Behavior β€” cross-metric correlation

Clicking a task in any table opens a Task modal overlay (views/task/Task.tsx) which dispatches to the type-specific TaskView from the registry.

Key Technical Details

Data Processing Pipeline

migrator.ts β†’ migrateData(raw) β†’ validators.ts β†’ validateInputData(data) β†’ processor.ts β†’ processData(raw) returns [Data, DisqualifiedTasks, Notification[]]

  • Migration runs first, before camelCaseKeys, so it operates on raw snake_case fields
  • Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges
  • bin() in utilities/metrics.ts maps a numeric value to its [start, end] bucket string using the metric's range: [min, max, step]. Values below min map to <min and values above max map to >max β€” both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers

Input Validation

validators.ts β†’ validateInputData(data) returns { valid, reasons[] }

  • Checks required fields on models, metrics, tasks, results
  • Validates metric type constraints (categorical needs values, numerical can't use majority)
  • QA tasks must reference documents
  • Every value in a categorical metric must carry a numeric_value (camelCase: numericValue)

Why numeric_value is required on categorical metric values

The entire aggregation and sorting pipeline operates on numbers, not label strings. castToNumber() in utilities/metrics.ts maps a string label to its numericValue so that mean, median, and inter-annotator agreement distance can be computed arithmetically. computeMajority() uses Math.abs(castToNumber(a) - castToNumber(b)) to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). sortMetricValues() in processor.ts sorts the metric's value list by numericValue so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. compareMetricAggregatedValues() in metrics.ts uses numericValue to order chart bars for majority-aggregated metrics.

Without numericValue, every one of these paths falls back to parseFloat(label), which returns NaN for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing numericValue on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations.

Convention: assign numericValue so that higher = better. sortMetricValues() sorts ascending by numericValue, so values[0] becomes minValue (worst) and values[last] becomes maxValue (best). PerformanceOverview normalises scores as (score - minValue) / (maxValue - minValue) and ranks models with higher scores first. Example: { value: "poor", numeric_value: 0 }, { value: "acceptable", numeric_value: 1 }, { value: "good", numeric_value: 2 }.

JSON Key Convention

Input files use snake_case. The app converts to camelCase on load (camelCaseKeys in objects.ts) and back to snake_case on export (snakeCaseKeys). Migration runs before camelCaseKeys.

Styling

  • SCSS Modules: one .module.scss per component, co-located
  • Carbon Design tokens for spacing, colors, typography
  • Global styles in src/app/global.scss
  • Theme: Carbon g10 (light) and g90 (dark), toggled via header
  • Sass uses @use (not @import) for Carbon v11 / Turbopack compatibility
  • Carbon font-face disabled via $css--font-face: false in global.scss (Turbopack can't resolve ~ prefix)

Carbon Component Gotchas

TabPanel renders all panels simultaneously (hidden, not unmounted)

Carbon's TabPanel uses the HTML hidden attribute to hide inactive panels β€” it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences:

  • Any component with a non-unique id prop will have duplicate DOM IDs across tabs. This silently breaks components that rely on id for internal DOM wiring (labels, aria associations, focus management). The browser uses the first matching element β€” clicks on a selector in a later tab quietly target the hidden first tab's element instead.
  • Rule: every Carbon component that takes an id prop and is used in multiple tabs must have a globally unique id. Pattern used here: {view-name}-{component-name}, e.g. model-behavior-model-selector, metric-behavior-model-selector.
  • Components confirmed affected: FilterableMultiSelect, Toggle, Select. Assume all interactive Carbon components are affected.

FilterableMultiSelect β€” controlled vs uncontrolled

  • Use selectedItems (controlled) rather than initialSelectedItems (uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path fires onChange via a useEffect guarded by an isMounted ref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive.
  • Always add a null guard to itemToString: (item) => (item ? item.name : ''). Carbon passes null to itemToString in some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state.

@carbon/charts-react/styles.css import

Import this stylesheet once, at the highest shared layout level (e.g. app/layout.tsx or global.scss), not per-component. All rules are scoped under .cds--chart-holder so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles.

Deployment

  • Dockerfile for containerized deployment
  • next.config.js sets output: 'standalone' for minimal Docker images
  • Security headers configured (CSP, HSTS, X-Frame-Options)
  • Can also deploy on HuggingFace Spaces

Known Technical Debt

  1. Type safety gaps β€” any types on Task.input and Model.trainingDetails; task.annotations is untyped (context quality scores on RAG/QA documents, distinct from result.scores β€” see TODO comment in types.ts)
  2. Pre-existing ESLint warnings β€” 19 errors and 15 warnings from react-hooks/exhaustive-deps, react-compiler rules, and setState-in-effect patterns; need case-by-case review
  3. ESLint 10 blocked β€” eslint-plugin-react (bundled by eslint-config-next) uses deprecated getFilename API removed in ESLint 10; pinned to v9 until upstream fixes
  4. No client-side component tests β€” all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually