Spaces:
Sleeping
Sleeping
| ## Goal | |
| Rebuild the Gradio `dashboard.py` as a browser-based inspector for the Explainer OpenEnv at `https://kgdrathan-explainer-env.hf.space`. No Python, no Gradio, no `config.yaml`, no environment/provider/model dropdowns. The dashboard runs episodes (reset β explore β generate β repair β done), shows everything, and supports single-step and auto-run. | |
| ## Runtime config (server-side only, no UI selectors) | |
| Environment variables (read inside server functions, never in the browser bundle): | |
| - `ENV_BASE_URL` β default `https://kgdrathan-explainer-env.hf.space` | |
| - `API_BASE_URL` β default `https://router.huggingface.co/v1` | |
| - `HF_TOKEN` β required, stored as a Lovable Cloud secret | |
| - `MODEL_NAME` β default `Qwen/Qwen2.5-72B-Instruct` | |
| The dashboard shows `ENV_BASE_URL` and `MODEL_NAME` as a small read-only metadata strip (not editable). | |
| ## Architecture | |
| ```text | |
| Browser (React inspector) | |
| β useServerFn | |
| βΌ | |
| TanStack server functions | |
| ββ envReset({ seed?, episode_id? }) | |
| ββ envStep({ action }) ββ proxies POST /step on ENV_BASE_URL | |
| ββ envSchema() ββ GET /schema (cached) | |
| ββ llmCall({ phase, obs, prior })ββ builds prompt, calls API_BASE_URL with HF_TOKEN, | |
| returns { raw, parsed action } | |
| ``` | |
| All env and LLM HTTP traffic goes through server functions. The browser never sees `HF_TOKEN`. CORS is irrelevant because calls are same-origin RPC. | |
| ## Explainer env contract (verified from `/schema`) | |
| - POST `/reset` β `{ observation, done }` where `observation` is an `ExplainerObservation` (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward). | |
| - POST `/step` body: `{ action: ExplainerAction }`. Action shape: | |
| - `action_type`: `"explore" | "generate" | "repair"` | |
| - explore: `tool` (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), `query`, `intent` | |
| - generate / repair: `format` (`"marimo" | "manim"`), `code`, `narration`, `repair_notes` (repair only) | |
| - GET `/metadata`, GET `/schema` for header info and tool list. | |
| The episode is "done" when `observation.done === true` or `phase === "done"`. | |
| ## LLM logic (port of `inference.py` to TypeScript) | |
| Reimplement as pure TS in `src/server/llm/`: | |
| - `buildExplorePrompt(obs, accumulatedContext)` | |
| - `buildGeneratePrompt(obs, accumulatedContext)` | |
| - `buildRepairPrompt(obs, lastCode, lastErrors)` | |
| - `parseExploreResponse(text)` β `{ tool, query, intent }` or `"SKIP"` | |
| - `parseGenerateResponse(text)` β `{ format, code, narration }` | |
| - `callLLM(messages)` β OpenAI-compatible `POST {API_BASE_URL}/chat/completions` with `Authorization: Bearer ${HF_TOKEN}` and `model: MODEL_NAME`. | |
| Phase routing inside `runStep`: | |
| - `phase === "explore"` β explore prompt; on `SKIP`, force a generate step instead. | |
| - `phase === "generate"` β generate prompt; if env returns `phase === "repair"`, surface errors. | |
| - `phase === "repair"` β repair prompt seeded with `last_errors` + previous code. | |
| ## Episode state (single React store, e.g. Zustand) | |
| ```text | |
| { | |
| sessionId, episodeId, | |
| envUrl, modelName, | |
| obs, // latest ExplainerObservation | |
| phase, step, done, score, status, | |
| task: { topic, tier, difficulty, keywords, content, dataAvailable }, | |
| research: { exploredContext, topChunks: [], lastSearchResults }, | |
| generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction }, | |
| rewards: [], // per-step total | |
| rewardDetails: [], // per-step component breakdown | |
| log: [], // [START]/[LLM]/[STEP]/[END]/[WARN] entries | |
| autoRunning: false | |
| } | |
| ``` | |
| A "session" reset just generates a new `episodeId` and calls `/reset` β there is no long-lived server-side handle, so each `envStep` call is stateless toward the env (the env tracks state internally per episode_id). | |
| ## Task bank | |
| Port `ALL_TASKS` from the Python `task_bank` to a TS constant `TASKS = [{ topic, difficulty, tier }, ...]`. Dropdown shows `(random)` plus `topic [difficulty, tier]`. Picking a task passes `topic` to `/reset` via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time). | |
| ## UI (single page at `/`) | |
| Inspector layout, dark theme, monospace accents β not a marketing page. | |
| ```text | |
| ββ Header ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Topic Β· Tier/Difficulty Β· Phase badge Β· Step n Β· Score Β· Status β | |
| β env=ENV_BASE_URL model=MODEL_NAME β | |
| ββ Controls ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β [Task βΌ (random)β¦] [Reset Episode] [Next Step] [Auto Run βΆ/β ] β | |
| ββ Left column ββββββββββββββββββ¬β Right column βββββββββββββββββββ€ | |
| β Observation β LLM panel β | |
| β β’ topic / content (collapsed) β β’ raw response β | |
| β β’ keywords, data_available β β’ parsed action (JSON) β | |
| β β’ feedback (latest) β β’ generated code (syntax hl.) β | |
| β β β | |
| β Research β Rewards β | |
| β β’ last search_results β β’ per-step total summary β | |
| β β’ Top 5 chunks table β β’ component breakdown table β | |
| β (rank, source, title, β β | |
| β score, url, snippet) β β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Timeline / log (scrollable, color-coded by tag) β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter. | |
| ## Behavior | |
| - **Reset Episode**: clears state, calls `envReset`, populates task fields from observation, logs `[START]`. | |
| - **Next Step**: reads `phase`, calls `llmCall` for that phase, logs `[LLM]` with raw + parsed, calls `envStep`, merges observation, appends reward + components, logs `[STEP]`. Disabled when `done`. | |
| - **Auto Run**: loops Next Step with a small delay until `done`, repair attempts exhausted, or user hits Stop. Logs `[END]` with success / score / rewards. | |
| - Errors from the env or LLM go into the log as `[WARN]` / `[ERROR]` and surface as a toast; the run halts but state is preserved. | |
| ## Reward handling | |
| Port the Python helpers: | |
| - `rewardComponents(obsMetadata, feedback)` β filtered numeric components (uses the explore/generate/repair allow-lists). | |
| - `parseRewardComponentsFromFeedback(feedback)` as a fallback for old observations. | |
| - Total per phase: `explore_total | generate_total | repair_total` if present, else sum of visible components. | |
| - Final episode score: `normalized_episode_score(rewards)` ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); `success = score >= SUCCESS_SCORE_THRESHOLD` (constant ported from `explainer_env/constants.py`, default 0.6 β confirmed during implementation). | |
| ## Technical details | |
| - **Files added**: | |
| - `src/routes/index.tsx` β dashboard page, replaces placeholder. | |
| - `src/server/env.functions.ts` β `envReset`, `envStep`, `envMetadata`, `envSchema` server fns calling `${process.env.ENV_BASE_URL}`. | |
| - `src/server/llm/prompts.ts`, `src/server/llm/parse.ts`, `src/server/llm/client.ts` β port of inference.py prompt/parse/call. | |
| - `src/server/llm.functions.ts` β `runLlmStep({ phase, obs, prior })` server fn. | |
| - `src/server/config.functions.ts` β `getRuntimeConfig()` returning `{ envUrl, modelName }` (no secrets). | |
| - `src/lib/tasks.ts` β ported task bank. | |
| - `src/lib/rewards.ts` β reward parsing/normalization. | |
| - `src/lib/types.ts` β `ExplainerObservation`, `ExplainerAction`, etc. | |
| - `src/store/episode.ts` β Zustand store. | |
| - `src/components/inspector/*` β Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log. | |
| - **Secret**: `HF_TOKEN` added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it. | |
| - **Stack stays** TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only `zustand` and a small syntax highlighter (`shiki` or `highlight.js`) β pick the smaller at implementation time. | |
| - **Out of scope**: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector. | |
| ## Open items resolved during implementation | |
| - Confirm whether `/reset` accepts a `topic` argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs. | |
| - Confirm exact `SUCCESS_SCORE_THRESHOLD` and `normalized_episode_score` formula from `explainer_env/constants.py` (you can paste it, otherwise default to mean-of-totals β₯ 0.6). | |