## Goal Rebuild the Gradio `dashboard.py` as a browser-based inspector for the Explainer OpenEnv at `https://kgdrathan-explainer-env.hf.space`. No Python, no Gradio, no `config.yaml`, no environment/provider/model dropdowns. The dashboard runs episodes (reset → explore → generate → repair → done), shows everything, and supports single-step and auto-run. ## Runtime config (server-side only, no UI selectors) Environment variables (read inside server functions, never in the browser bundle): - `ENV_BASE_URL` — default `https://kgdrathan-explainer-env.hf.space` - `API_BASE_URL` — default `https://router.huggingface.co/v1` - `HF_TOKEN` — required, stored as a Lovable Cloud secret - `MODEL_NAME` — default `Qwen/Qwen2.5-72B-Instruct` The dashboard shows `ENV_BASE_URL` and `MODEL_NAME` as a small read-only metadata strip (not editable). ## Architecture ```text Browser (React inspector) │ useServerFn ▼ TanStack server functions ├─ envReset({ seed?, episode_id? }) ├─ envStep({ action }) ── proxies POST /step on ENV_BASE_URL ├─ envSchema() ── GET /schema (cached) └─ llmCall({ phase, obs, prior })── builds prompt, calls API_BASE_URL with HF_TOKEN, returns { raw, parsed action } ``` All env and LLM HTTP traffic goes through server functions. The browser never sees `HF_TOKEN`. CORS is irrelevant because calls are same-origin RPC. ## Explainer env contract (verified from `/schema`) - POST `/reset` → `{ observation, done }` where `observation` is an `ExplainerObservation` (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward). - POST `/step` body: `{ action: ExplainerAction }`. Action shape: - `action_type`: `"explore" | "generate" | "repair"` - explore: `tool` (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), `query`, `intent` - generate / repair: `format` (`"marimo" | "manim"`), `code`, `narration`, `repair_notes` (repair only) - GET `/metadata`, GET `/schema` for header info and tool list. The episode is "done" when `observation.done === true` or `phase === "done"`. ## LLM logic (port of `inference.py` to TypeScript) Reimplement as pure TS in `src/server/llm/`: - `buildExplorePrompt(obs, accumulatedContext)` - `buildGeneratePrompt(obs, accumulatedContext)` - `buildRepairPrompt(obs, lastCode, lastErrors)` - `parseExploreResponse(text)` → `{ tool, query, intent }` or `"SKIP"` - `parseGenerateResponse(text)` → `{ format, code, narration }` - `callLLM(messages)` → OpenAI-compatible `POST {API_BASE_URL}/chat/completions` with `Authorization: Bearer ${HF_TOKEN}` and `model: MODEL_NAME`. Phase routing inside `runStep`: - `phase === "explore"` → explore prompt; on `SKIP`, force a generate step instead. - `phase === "generate"` → generate prompt; if env returns `phase === "repair"`, surface errors. - `phase === "repair"` → repair prompt seeded with `last_errors` + previous code. ## Episode state (single React store, e.g. Zustand) ```text { sessionId, episodeId, envUrl, modelName, obs, // latest ExplainerObservation phase, step, done, score, status, task: { topic, tier, difficulty, keywords, content, dataAvailable }, research: { exploredContext, topChunks: [], lastSearchResults }, generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction }, rewards: [], // per-step total rewardDetails: [], // per-step component breakdown log: [], // [START]/[LLM]/[STEP]/[END]/[WARN] entries autoRunning: false } ``` A "session" reset just generates a new `episodeId` and calls `/reset` — there is no long-lived server-side handle, so each `envStep` call is stateless toward the env (the env tracks state internally per episode_id). ## Task bank Port `ALL_TASKS` from the Python `task_bank` to a TS constant `TASKS = [{ topic, difficulty, tier }, ...]`. Dropdown shows `(random)` plus `topic [difficulty, tier]`. Picking a task passes `topic` to `/reset` via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time). ## UI (single page at `/`) Inspector layout, dark theme, monospace accents — not a marketing page. ```text ┌─ Header ─────────────────────────────────────────────────────────┐ │ Topic · Tier/Difficulty · Phase badge · Step n · Score · Status │ │ env=ENV_BASE_URL model=MODEL_NAME │ ├─ Controls ───────────────────────────────────────────────────────┤ │ [Task ▼ (random)…] [Reset Episode] [Next Step] [Auto Run ▶/■] │ ├─ Left column ─────────────────┬─ Right column ──────────────────┤ │ Observation │ LLM panel │ │ • topic / content (collapsed) │ • raw response │ │ • keywords, data_available │ • parsed action (JSON) │ │ • feedback (latest) │ • generated code (syntax hl.) │ │ │ │ │ Research │ Rewards │ │ • last search_results │ • per-step total summary │ │ • Top 5 chunks table │ • component breakdown table │ │ (rank, source, title, │ │ │ score, url, snippet) │ │ ├──────────────────────────────────────────────────────────────────┤ │ Timeline / log (scrollable, color-coded by tag) │ └──────────────────────────────────────────────────────────────────┘ ``` Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter. ## Behavior - **Reset Episode**: clears state, calls `envReset`, populates task fields from observation, logs `[START]`. - **Next Step**: reads `phase`, calls `llmCall` for that phase, logs `[LLM]` with raw + parsed, calls `envStep`, merges observation, appends reward + components, logs `[STEP]`. Disabled when `done`. - **Auto Run**: loops Next Step with a small delay until `done`, repair attempts exhausted, or user hits Stop. Logs `[END]` with success / score / rewards. - Errors from the env or LLM go into the log as `[WARN]` / `[ERROR]` and surface as a toast; the run halts but state is preserved. ## Reward handling Port the Python helpers: - `rewardComponents(obsMetadata, feedback)` → filtered numeric components (uses the explore/generate/repair allow-lists). - `parseRewardComponentsFromFeedback(feedback)` as a fallback for old observations. - Total per phase: `explore_total | generate_total | repair_total` if present, else sum of visible components. - Final episode score: `normalized_episode_score(rewards)` ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); `success = score >= SUCCESS_SCORE_THRESHOLD` (constant ported from `explainer_env/constants.py`, default 0.6 — confirmed during implementation). ## Technical details - **Files added**: - `src/routes/index.tsx` — dashboard page, replaces placeholder. - `src/server/env.functions.ts` — `envReset`, `envStep`, `envMetadata`, `envSchema` server fns calling `${process.env.ENV_BASE_URL}`. - `src/server/llm/prompts.ts`, `src/server/llm/parse.ts`, `src/server/llm/client.ts` — port of inference.py prompt/parse/call. - `src/server/llm.functions.ts` — `runLlmStep({ phase, obs, prior })` server fn. - `src/server/config.functions.ts` — `getRuntimeConfig()` returning `{ envUrl, modelName }` (no secrets). - `src/lib/tasks.ts` — ported task bank. - `src/lib/rewards.ts` — reward parsing/normalization. - `src/lib/types.ts` — `ExplainerObservation`, `ExplainerAction`, etc. - `src/store/episode.ts` — Zustand store. - `src/components/inspector/*` — Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log. - **Secret**: `HF_TOKEN` added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it. - **Stack stays** TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only `zustand` and a small syntax highlighter (`shiki` or `highlight.js`) — pick the smaller at implementation time. - **Out of scope**: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector. ## Open items resolved during implementation - Confirm whether `/reset` accepts a `topic` argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs. - Confirm exact `SUCCESS_SCORE_THRESHOLD` and `normalized_episode_score` formula from `explainer_env/constants.py` (you can paste it, otherwise default to mean-of-totals ≥ 0.6).