InspectorRAGet / docs /ARCHITECTURE.md
kpfadnis's picture
feat (acebench): ACEBench converter, README, and architecture docs
b359fde
|
raw
history blame
27.2 kB
# InspectorRAGet β€” Architecture
> Living document. Captures the current state of the codebase. Update as architecture evolves.
## Overview
InspectorRAGet is a client-side introspection platform for LLM evaluation. Users upload JSON files containing evaluation data (models, metrics, tasks, and model results) and explore results through aggregate and instance-level visualizations. The platform does not execute experiments β€” it is purely analytical.
Built with Next.js 16 (App Router), React 18, TypeScript 5.9, and IBM Carbon Design System.
## High-Level Data Flow
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ JSON Input File β”‚
β”‚ (user upload or β”‚
β”‚ data/ directory) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ migrator.ts β”‚
β”‚ Schema migration β”‚
β”‚ (v1 β†’ v2, etc.) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ validators.ts β”‚
β”‚ Schema validation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ processor.ts β”‚
β”‚ Qualify/disqualify β”‚
β”‚ tasks by metric β”‚
β”‚ completeness β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DataStore context β”‚
β”‚ (store.tsx) β”‚
β”‚ Data + taskMap β”‚
β”‚ + resultsMap β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
β”‚ Aggregate β”‚ β”‚ Instance β”‚ β”‚ Annotator β”‚
β”‚ Views β”‚ β”‚ Views β”‚ β”‚ Views β”‚
β”‚ (overview, β”‚ β”‚ (task β”‚ β”‚ (agreement, β”‚
β”‚ model, β”‚ β”‚ detail, β”‚ β”‚ behavior) β”‚
β”‚ metric) β”‚ β”‚ per type) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Directory Structure
```
InspectorRAGet/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ app/ # Next.js App Router β€” thin page shells
β”‚ β”‚ β”œβ”€β”€ layout.tsx # Root: ThemeProvider β†’ NotificationProvider β†’ DataStoreProvider
β”‚ β”‚ β”œβ”€β”€ page.tsx # / β€” Home landing page
β”‚ β”‚ β”œβ”€β”€ visualize/ # /visualize β€” Upload and analyze
β”‚ β”‚ β”œβ”€β”€ examples/ # /examples β€” Browse pre-loaded datasets
β”‚ β”‚ β”‚ └── [example_id]/ # /examples/:id β€” Specific dataset analysis
β”‚ β”‚
β”‚ β”œβ”€β”€ views/ # Page-level container components
β”‚ β”‚ β”œβ”€β”€ home/ # Landing page cards
β”‚ β”‚ β”œβ”€β”€ on-board/ # Multi-step upload wizard (instructions β†’ upload β†’ verify)
β”‚ β”‚ β”œβ”€β”€ example/ # Main analysis hub β€” 7-tab interface
β”‚ β”‚ β”œβ”€β”€ examples/ # Grid of dataset tiles
β”‚ β”‚ β”œβ”€β”€ visualization/ # Onboard β†’ Example router
β”‚ β”‚ β”œβ”€β”€ task/ # Instance-level task viewer (modal overlay)
β”‚ β”‚ β”‚ └── Task.tsx # Dispatches to type-specific TaskView via registry
β”‚ β”‚ β”œβ”€β”€ performance-overview/ # Aggregate metric tables + charts
β”‚ β”‚ β”œβ”€β”€ model-behavior/ # Per-metric distribution analysis
β”‚ β”‚ β”œβ”€β”€ metric-behavior/ # Cross-metric correlation
β”‚ β”‚ β”œβ”€β”€ model-comparator/ # Head-to-head model comparison
β”‚ β”‚ β”œβ”€β”€ data-characteristics/ # Dataset statistics
β”‚ β”‚ β”œβ”€β”€ annotator-behavior/ # Inter-annotator agreement
β”‚ β”‚ β”œβ”€β”€ predictions-table/ # Filterable evaluation table
β”‚ β”‚ β”œβ”€β”€ tasks-table/ # Task listing with filters
β”‚ β”‚ β”œβ”€β”€ annotations-table/ # Per-task metric scores
β”‚ β”‚ └── document/ # Document viewer
β”‚ β”‚
β”‚ β”œβ”€β”€ task-types/ # Vertical slice per evaluation type
β”‚ β”‚ β”œβ”€β”€ index.ts # Registry: taskTypeRegistry maps type string β†’ { TaskView, Copier }
β”‚ β”‚ β”œβ”€β”€ qa/ # Single-turn QA with retrieved context
β”‚ β”‚ β”‚ β”œβ”€β”€ types.ts # RetrievedDocument, RetrievedDocumentAnnotation
β”‚ β”‚ β”‚ β”œβ”€β”€ TaskView.tsx # Input + contexts + per-model response + evaluations/steps tabs
β”‚ β”‚ β”‚ └── Copier.tsx
β”‚ β”‚ β”œβ”€β”€ generation/ # Open-ended text/JSON generation
β”‚ β”‚ β”‚ β”œβ”€β”€ TaskView.tsx # Input + per-model response + evaluations/steps tabs
β”‚ β”‚ β”‚ └── Copier.tsx
β”‚ β”‚ β”œβ”€β”€ rag/ # Multi-turn retrieval conversation
β”‚ β”‚ β”‚ β”œβ”€β”€ types.ts # Message union (SystemMessage, UserMessage, AssistantMessage, ToolMessage, …)
β”‚ β”‚ β”‚ β”œβ”€β”€ TaskView.tsx # Conversation thread + per-model response + evaluations/steps tabs
β”‚ β”‚ β”‚ β”œβ”€β”€ Copier.tsx
β”‚ β”‚ β”‚ └── components/
β”‚ β”‚ β”‚ β”œβ”€β”€ ChatLine.tsx # Renders a single OpenAI-format message (status ring, retries footer)
β”‚ β”‚ β”‚ β”œβ”€β”€ Avatar.tsx # Role avatar with status ring (pass/warn/fail outline)
β”‚ β”‚ β”‚ └── DocumentsViewer.tsx
β”‚ β”‚ β”œβ”€β”€ tool_calling/ # Function/tool calling evaluation
β”‚ β”‚ β”‚ β”œβ”€β”€ types.ts # ToolDefinition (OpenAI JSON Schema format)
β”‚ β”‚ β”‚ β”œβ”€β”€ TaskView.tsx # Conversation + available tools panel + prediction/target/evaluations/steps
β”‚ β”‚ β”‚ └── Copier.tsx
β”‚ β”‚ └── agentic/ # Goal-directed multi-turn agent execution
β”‚ β”‚ β”œβ”€β”€ TaskView.tsx # Goal + initial state + target state + execution thread + evaluations/steps
β”‚ β”‚ └── Copier.tsx
β”‚ β”‚
β”‚ β”œβ”€β”€ components/ # Reusable UI components
β”‚ β”‚ β”œβ”€β”€ header/ # App header with nav and theme toggle
β”‚ β”‚ β”œβ”€β”€ filters/ # Generic filter controls
β”‚ β”‚ β”œβ”€β”€ expression-builder/ # Advanced filter expression builder
β”‚ β”‚ β”œβ”€β”€ selectors/ # Model, Metric, Aggregator selectors
β”‚ β”‚ β”œβ”€β”€ evaluations/ # EvaluationsPanel β€” shared human + algorithmic score tables
β”‚ β”‚ β”œβ”€β”€ trace/ # Execution trace: TraceGroup + TraceItem (collapsible cards for invocation/tool_execution/observation events)
β”‚ β”‚ β”œβ”€β”€ comments/ # Task commenting system (see Comment System section below)
β”‚ β”‚ β”œβ”€β”€ notification/ # Toast notifications (context provider)
β”‚ β”‚ β”œβ”€β”€ avatar/ # User/agent avatars
β”‚ β”‚ β”œβ”€β”€ task-tile/ # Task summary card
β”‚ β”‚ β”œβ”€β”€ example-tile/ # Dataset summary card
β”‚ β”‚ └── disabled/ # Disabled feature placeholder
β”‚ β”‚
β”‚ β”œβ”€β”€ hooks/
β”‚ β”‚ β”œβ”€β”€ useBackButton.ts # Browser back navigation
β”‚ β”‚ β”œβ”€β”€ useStorage.ts # localStorage persistence
β”‚ β”‚ └── usePrevious.ts # Previous render value
β”‚ β”‚
β”‚ β”œβ”€β”€ utilities/
β”‚ β”‚ β”œβ”€β”€ strings.ts # Hashing, truncation, search matching
β”‚ β”‚ β”œβ”€β”€ colors.ts # Color scale generation
β”‚ β”‚ β”œβ”€β”€ objects.ts # camelCase/snakeCase key conversion
β”‚ β”‚ β”œβ”€β”€ aggregators.ts # Mean, median, majority, weighted aggregators
β”‚ β”‚ β”œβ”€β”€ metrics.ts # Metric helper functions
β”‚ β”‚ β”œβ”€β”€ selectors.ts # Mouse selection extraction
β”‚ β”‚ β”œβ”€β”€ expressions.ts # Expression evaluation for advanced filters
β”‚ β”‚ β”œβ”€β”€ correlation.ts # Statistical correlation
β”‚ β”‚ β”œβ”€β”€ significance.ts # Statistical significance tests
β”‚ β”‚ β”œβ”€β”€ highlighter.ts # Text overlap highlighting
β”‚ β”‚ └── time.ts # Duration calculation
β”‚ β”‚
β”‚ β”œβ”€β”€ workers/
β”‚ β”‚ └── filter.ts # Web Worker for background data filtering
β”‚ β”‚
β”‚ β”œβ”€β”€ types.ts # Core TypeScript interfaces (re-exports task-type-specific types)
β”‚ β”œβ”€β”€ store.tsx # DataStoreProvider (React Context)
β”‚ β”œβ”€β”€ migrator.ts # Versioned schema migration chain (v1 β†’ v2 β†’ …)
β”‚ β”œβ”€β”€ processor.ts # Data qualification pipeline
β”‚ β”œβ”€β”€ exporter.ts # Export pipeline (split from processor.ts)
β”‚ β”œβ”€β”€ validators.ts # Input schema validation
β”‚ β”œβ”€β”€ dataloader.ts # Server-side data/ directory loader
β”‚ └── theme.tsx # ThemeProvider (Carbon g10/g90)
β”‚
β”œβ”€β”€ converters/ # Dataset converters
β”‚ β”œβ”€β”€ bfcl/ # Berkeley Function Calling Leaderboard (single-turn and multi-turn, V3/V4)
β”‚ └── acebench/ # ACEBench (tool-calling and agentic categories)
β”œβ”€β”€ data/ # Pre-loaded example datasets (JSON, schema v2)
β”œβ”€β”€ notebooks/ # Integration notebooks (Ragas, LM Eval, HuggingFace, BFCL)
β”œβ”€β”€ public/ # Static assets (favicon, license)
└── docs/ # Documentation (this file)
```
## Core Data Model
Defined in `src/types.ts`. The input JSON (schema v2) has this structure:
```
RawData
β”œβ”€β”€ schema_version?: number # 2 = current; absent or 1 = legacy (auto-migrated)
β”œβ”€β”€ name?: string
β”œβ”€β”€ models: Model[] # LLMs being evaluated
β”‚ └── { modelId, name, owner, ... }
β”œβ”€β”€ metrics: Metric[] # Evaluation criteria
β”‚ └── { name, type: numerical|categorical|text, author: human|algorithm, ... }
β”œβ”€β”€ documents?: RetrievedDocument[] # Corpus documents (QA/RAG tasks)
β”‚ └── { documentId, text, title?, url?, score? }
β”œβ”€β”€ filters?: string[] # Task fields available for filtering
β”œβ”€β”€ tasks: Task[] # Individual evaluation instances
β”‚ └── { taskId, taskType: qa|generation|rag|tool_calling|agentic,
β”‚ input, targets?: TaskTarget[], tools?: ToolDefinition[],
β”‚ flagged?, comments?: TaskComment[], annotations? }
└── results: ModelResult[] # Model outputs + metric scores
└── { taskId, modelId, output: Message[], scores: { [metric]: { [annotator]: { value } } },
contexts?, comments?: TaskComment[] }
```
`output` is always a `Message[]`. For single-inference task types (qa, generation, rag, tool_calling) it is a one-element array. For `agentic` tasks it is the full execution thread: interleaved user, assistant, and tool messages across all turns. Trace events live on individual messages as `message.trace`, not at the result level.
### Key type unions
**`Message`** β€” OpenAI-compatible message shape:
- `role: 'system' | 'user' | 'assistant' | 'tool'`
- `content?: string` β€” text response
- `tool_calls?: ToolCallRecord[]` β€” tool-calling output (on assistant messages)
- `trace?: TraceEvent[]` β€” execution trace attached to the assistant message that produced it; each event is discriminated on `type`: `invocation | tool_execution | observation`
- `retries?: MessageRetry[]` β€” intermediate retry attempts before final output
- `metadata?: Record<string, unknown>` β€” benchmark-supplied metadata; known keys: `status` (`'pass' | 'fail' | 'warn'`) rendered as a coloured badge in the chat footer, and `statusDefinition` (string) shown as a hover tooltip on that badge
**`ModelResult`** carries an optional `metadata?: Record<string, unknown>` bag for benchmark-supplied per-result diagnostics. Known key: `error` β€” `{ kind: 'text' | 'structured', context: unknown }` used by the BFCL agentic converter to surface structured state-diff details.
**`TaskTarget`** β€” discriminated on `type`:
- `{ type: 'text'; value: string }` β€” most task types
- `{ type: 'tool_calls'; calls: ToolCallRecord[] }` β€” tool-calling ground truth
- `{ type: 'state'; value: Record<string, unknown> }` β€” agentic expected final environment state
- `{ type: 'image'; url: string }` β€” multimodal (future)
**`TraceEvent`** β€” discriminated on `type`:
- `invocation` β€” an intermediate LLM call within a turn (before the accepted output)
- `tool_execution` β€” environment response(s) following an intermediate invocation
- `observation` β€” runner feedback after a decode failure, empty response, or forced termination
- All events carry an optional `label` (e.g. `"step_2"` matching the inference log key) and `content` string
### Schema migration
`migrator.ts` runs before `validators.ts` on every load. The migration chain is:
- **v1 β†’ v2:** renames legacy task types (`rag` single-turn β†’ `qa`, `rag` multi-turn β†’ `rag`, `text_generation`/`json_generation` β†’ `generation`, `chat` β†’ `rag`); wraps `model_response` string β†’ `output: [{ role: 'assistant', content }]`; renames `annotations` β†’ `scores`; renames `evaluations` array β†’ `results`
Exported files are always stamped with `schema_version: CURRENT_SCHEMA_VERSION`.
After processing (`processor.ts`), tasks are qualified or disqualified based on:
1. Whether all plottable metrics have scores
2. Whether results exist for all specified models
3. Whether score values are non-empty
The qualified data becomes the `Data` interface (extends `TileData`), stored in `DataStore` context.
## State Management
**Global state** is React Context, not Redux:
- `DataStoreProvider` (`store.tsx`): holds `Data`, `taskMap: Map<taskId, Task>`, and `resultsMap: Map<"taskId::modelId", ModelResult>`
- `updateTask(taskId, update)` β€” immutable Map update for task-level changes (flags, task comments)
- `updateResult(taskId, modelId, update)` β€” immutable Map update for model-result-level changes (model comments)
- `ThemeProvider` (`theme.tsx`): Carbon theme toggle (light g10 / dark g90)
- `NotificationProvider` (`components/notification/`): toast messages
**Local state**: each view manages its own filters, selections, and UI state via `useState`.
**Web Workers**: `ModelBehavior` view spawns a filter worker for expensive filtering operations to avoid blocking the UI thread.
## Task-Type Registry
`src/task-types/index.ts` exports `taskTypeRegistry`:
```typescript
const taskTypeRegistry: Record<
string,
{ TaskView: ComponentType; Copier: ComponentType }
> = {
qa: { TaskView: QATaskView, Copier: QACopier },
generation: { TaskView: GenerationTaskView, Copier: GenerationCopier },
rag: { TaskView: RAGTaskView, Copier: RAGCopier },
tool_calling: { TaskView: ToolCallingTaskView, Copier: ToolCallingCopier },
agentic: { TaskView: AgenticTaskView, Copier: AgenticCopier },
};
```
`Task.tsx` and `TaskCopier.tsx` look up the component via `taskTypeRegistry[task.taskType]` β€” no if/else chains. Unknown task types degrade gracefully to null.
## Comment System
Comments live at two levels:
- **`task.comments`** β€” task-level observations shared across all models
- **`result.comments`** β€” per-model observations (e.g. noting an acceptable-but-different tool call)
`Task.tsx` routes a new comment to the correct level by inspecting the provenance component string: any component containing `::` is model-scoped (written to `updateResult`); others are task-scoped (written to `updateTask`).
### Provenance component string convention
| Component string pattern | Meaning | Scope |
| ------------------------------------------------------ | ------------------------------- | ----- |
| `input` / `messages` | Input or conversation area | Task |
| `document_{id}` | Retrieved context document | Task |
| `target` | Ground-truth target area | Task |
| `{modelId}::evaluation::response` | Model response text | Model |
| `{modelId}::evaluation::prediction` | Model prediction (tool calling) | Model |
| `{modelId}::evaluation::scores::{metric}::{annotator}` | Specific score cell | Model |
| `{modelId}::steps::{stepId}` | Specific execution step | Model |
### Floating selection button
After any `mouseup` on the `.taskViewWrapper` div, `Task.tsx` captures viewport coordinates. If provenance is also set (text was selected), `SelectionCommentButton` renders as a `position: fixed` button near the cursor. Clicking it opens `AddCommentModal` with provenance pre-filled. Clicking anywhere else clears the coords and dismisses the button.
### `provenanceTag.ts`
Single source of truth for deriving display pills from a provenance component string. Returns `{ primary: [label, carbonType], detail?: [label1, label2] }`. The `detail` pair is only set for score cells (metric + annotator) and step references (\"step\" + stepId). All three comment modals (`AddCommentModal`, `EditCommentModal`, `CommentsViewer`) import from here.
### `CommentFinding`
Optional structured annotation attached to a `TaskComment`. Discriminated on `type`:
- `tool_call` β€” points to the correct function name/arguments
- `query` β€” records the correct retrieval query
- `output` β€” records a corrected reference output
- `note` β€” free-form structured note
`CommentFindingEditor` renders type-appropriate fields filtered by `task.taskType`. Findings are stored in the comment but editing them post-creation is out of scope (display-only in `EditCommentModal`).
## Routing
| Route | Server/Client | What it does |
| ---------------- | ------------- | ---------------------------------------- |
| `/` | Server | Renders Home with navigation cards |
| `/visualize` | Client | Upload wizard β†’ analysis view |
| `/examples` | Server | Loads `data/` dir, renders dataset grid |
| `/examples/[id]` | Server | Loads specific dataset, renders analysis |
The analysis hub (`Example` view) provides 7 tabs:
1. **Data Characteristics** β€” dataset statistics, distributions
2. **Predictions Table** β€” filterable evaluation records
3. **Annotator Behavior** β€” inter-annotator agreement heatmap
4. **Performance Overview** β€” aggregate metrics with rankings
5. **Model Behavior** β€” per-metric distribution analysis
6. **Model Comparator** β€” head-to-head comparison
7. **Metric Behavior** β€” cross-metric correlation
Clicking a task in any table opens a **Task modal overlay** (`views/task/Task.tsx`) which dispatches to the type-specific `TaskView` from the registry.
## Key Technical Details
### Data Processing Pipeline
`migrator.ts` β†’ `migrateData(raw)` β†’ `validators.ts` β†’ `validateInputData(data)` β†’ `processor.ts` β†’ `processData(raw)` returns `[Data, DisqualifiedTasks, Notification[]]`
- Migration runs first, before `camelCaseKeys`, so it operates on raw snake_case fields
- Processor validates every result has scores for all plottable metrics, ensures every task has results from all specified models, sorts categorical metric values, computes metric ranges
- `bin()` in `utilities/metrics.ts` maps a numeric value to its `[start, end]` bucket string using the metric's `range: [min, max, step]`. Values below `min` map to `<min` and values above `max` map to `>max` β€” both render as normal category bars in Carbon Charts, preventing unbounded raw-value bins from outliers
### Input Validation
`validators.ts` β†’ `validateInputData(data)` returns `{ valid, reasons[] }`
- Checks required fields on models, metrics, tasks, results
- Validates metric type constraints (categorical needs values, numerical can't use majority)
- QA tasks must reference documents
- Every value in a categorical metric must carry a `numeric_value` (camelCase: `numericValue`)
**Why `numeric_value` is required on categorical metric values**
The entire aggregation and sorting pipeline operates on numbers, not label strings. `castToNumber()` in `utilities/metrics.ts` maps a string label to its `numericValue` so that mean, median, and inter-annotator agreement distance can be computed arithmetically. `computeMajority()` uses `Math.abs(castToNumber(a) - castToNumber(b))` to decide whether the top-two annotator choices are "close" (high agreement) or "far apart" (no agreement). `sortMetricValues()` in `processor.ts` sorts the metric's value list by `numericValue` so UI dropdowns, chart axes, and filter ranges reflect the researcher's intended ordering. `compareMetricAggregatedValues()` in `metrics.ts` uses `numericValue` to order chart bars for majority-aggregated metrics.
Without `numericValue`, every one of these paths falls back to `parseFloat(label)`, which returns `NaN` for any non-numeric string. That silently corrupts aggregate statistics, chart orderings, and agreement calculations. The validator rejects files with missing `numericValue` on categorical entries so the problem is surfaced at load time rather than producing incorrect visualizations.
**Convention:** assign `numericValue` so that **higher = better**. `sortMetricValues()` sorts ascending by `numericValue`, so `values[0]` becomes `minValue` (worst) and `values[last]` becomes `maxValue` (best). `PerformanceOverview` normalises scores as `(score - minValue) / (maxValue - minValue)` and ranks models with higher scores first. Example: `{ value: "poor", numeric_value: 0 }`, `{ value: "acceptable", numeric_value: 1 }`, `{ value: "good", numeric_value: 2 }`.
### JSON Key Convention
Input files use `snake_case`. The app converts to `camelCase` on load (`camelCaseKeys` in `objects.ts`) and back to `snake_case` on export (`snakeCaseKeys`). Migration runs before `camelCaseKeys`.
### Styling
- SCSS Modules: one `.module.scss` per component, co-located
- Carbon Design tokens for spacing, colors, typography
- Global styles in `src/app/global.scss`
- Theme: Carbon `g10` (light) and `g90` (dark), toggled via header
- Sass uses `@use` (not `@import`) for Carbon v11 / Turbopack compatibility
- Carbon font-face disabled via `$css--font-face: false` in global.scss (Turbopack can't resolve `~` prefix)
### Carbon Component Gotchas
**TabPanel renders all panels simultaneously (hidden, not unmounted)**
Carbon's `TabPanel` uses the HTML `hidden` attribute to hide inactive panels β€” it does NOT lazy-mount or unmount them. All tab panels and their full component trees are live in the DOM at all times. Consequences:
- Any component with a non-unique `id` prop will have duplicate DOM IDs across tabs. This silently breaks components that rely on `id` for internal DOM wiring (labels, aria associations, focus management). The browser uses the **first** matching element β€” clicks on a selector in a later tab quietly target the hidden first tab's element instead.
- **Rule:** every Carbon component that takes an `id` prop and is used in multiple tabs must have a globally unique `id`. Pattern used here: `{view-name}-{component-name}`, e.g. `model-behavior-model-selector`, `metric-behavior-model-selector`.
- Components confirmed affected: `FilterableMultiSelect`, `Toggle`, `Select`. Assume all interactive Carbon components are affected.
**`FilterableMultiSelect` β€” controlled vs uncontrolled**
- Use `selectedItems` (controlled) rather than `initialSelectedItems` (uncontrolled) when the parent needs to own the selection state (e.g. to filter data). The uncontrolled path fires `onChange` via a `useEffect` guarded by an `isMounted` ref; under React StrictMode's double-invoke behaviour this guard can leave the component unresponsive.
- Always add a null guard to `itemToString`: `(item) => (item ? item.name : '')`. Carbon passes `null` to `itemToString` in some internal code paths (e.g. when clearing the filter input); without the guard this throws and corrupts Downshift's internal state.
**`@carbon/charts-react/styles.css` import**
Import this stylesheet once, at the highest shared layout level (e.g. `app/layout.tsx` or `global.scss`), not per-component. All rules are scoped under `.cds--chart-holder` so there is no global style pollution, but importing it in multiple component files creates redundant CSS bundles.
### Deployment
- Dockerfile for containerized deployment
- `next.config.js` sets `output: 'standalone'` for minimal Docker images
- Security headers configured (CSP, HSTS, X-Frame-Options)
- Can also deploy on HuggingFace Spaces
## Known Technical Debt
1. **Type safety gaps** β€” `any` types on `Task.input` and `Model.trainingDetails`; `task.annotations` is untyped (context quality scores on RAG/QA documents, distinct from `result.scores` β€” see TODO comment in `types.ts`)
2. **Pre-existing ESLint warnings** β€” 19 errors and 15 warnings from `react-hooks/exhaustive-deps`, `react-compiler` rules, and `setState`-in-effect patterns; need case-by-case review
3. **ESLint 10 blocked** β€” `eslint-plugin-react` (bundled by `eslint-config-next`) uses deprecated `getFilename` API removed in ESLint 10; pinned to v9 until upstream fixes
4. **No client-side component tests** β€” all tests are pure utility/logic tests; interactive Carbon component bugs are only caught manually