Gnan Deep Rathan K
Deploy dashboard without binary lockfile
1b83e76

Goal

Rebuild the Gradio dashboard.py as a browser-based inspector for the Explainer OpenEnv at https://kgdrathan-explainer-env.hf.space. No Python, no Gradio, no config.yaml, no environment/provider/model dropdowns. The dashboard runs episodes (reset β†’ explore β†’ generate β†’ repair β†’ done), shows everything, and supports single-step and auto-run.

Runtime config (server-side only, no UI selectors)

Environment variables (read inside server functions, never in the browser bundle):

  • ENV_BASE_URL β€” default https://kgdrathan-explainer-env.hf.space
  • API_BASE_URL β€” default https://router.huggingface.co/v1
  • HF_TOKEN β€” required, stored as a Lovable Cloud secret
  • MODEL_NAME β€” default Qwen/Qwen2.5-72B-Instruct

The dashboard shows ENV_BASE_URL and MODEL_NAME as a small read-only metadata strip (not editable).

Architecture

Browser (React inspector)
        β”‚  useServerFn
        β–Ό
TanStack server functions
  β”œβ”€ envReset({ seed?, episode_id? })
  β”œβ”€ envStep({ action })          ── proxies POST /step on ENV_BASE_URL
  β”œβ”€ envSchema()                  ── GET /schema (cached)
  └─ llmCall({ phase, obs, prior })── builds prompt, calls API_BASE_URL with HF_TOKEN,
                                       returns { raw, parsed action }

All env and LLM HTTP traffic goes through server functions. The browser never sees HF_TOKEN. CORS is irrelevant because calls are same-origin RPC.

Explainer env contract (verified from /schema)

  • POST /reset β†’ { observation, done } where observation is an ExplainerObservation (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward).
  • POST /step body: { action: ExplainerAction }. Action shape:
    • action_type: "explore" | "generate" | "repair"
    • explore: tool (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), query, intent
    • generate / repair: format ("marimo" | "manim"), code, narration, repair_notes (repair only)
  • GET /metadata, GET /schema for header info and tool list.

The episode is "done" when observation.done === true or phase === "done".

LLM logic (port of inference.py to TypeScript)

Reimplement as pure TS in src/server/llm/:

  • buildExplorePrompt(obs, accumulatedContext)
  • buildGeneratePrompt(obs, accumulatedContext)
  • buildRepairPrompt(obs, lastCode, lastErrors)
  • parseExploreResponse(text) β†’ { tool, query, intent } or "SKIP"
  • parseGenerateResponse(text) β†’ { format, code, narration }
  • callLLM(messages) β†’ OpenAI-compatible POST {API_BASE_URL}/chat/completions with Authorization: Bearer ${HF_TOKEN} and model: MODEL_NAME.

Phase routing inside runStep:

  • phase === "explore" β†’ explore prompt; on SKIP, force a generate step instead.
  • phase === "generate" β†’ generate prompt; if env returns phase === "repair", surface errors.
  • phase === "repair" β†’ repair prompt seeded with last_errors + previous code.

Episode state (single React store, e.g. Zustand)

{
  sessionId, episodeId,
  envUrl, modelName,
  obs,                   // latest ExplainerObservation
  phase, step, done, score, status,
  task: { topic, tier, difficulty, keywords, content, dataAvailable },
  research: { exploredContext, topChunks: [], lastSearchResults },
  generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
  rewards: [],           // per-step total
  rewardDetails: [],     // per-step component breakdown
  log: [],               // [START]/[LLM]/[STEP]/[END]/[WARN] entries
  autoRunning: false
}

A "session" reset just generates a new episodeId and calls /reset β€” there is no long-lived server-side handle, so each envStep call is stateless toward the env (the env tracks state internally per episode_id).

Task bank

Port ALL_TASKS from the Python task_bank to a TS constant TASKS = [{ topic, difficulty, tier }, ...]. Dropdown shows (random) plus topic [difficulty, tier]. Picking a task passes topic to /reset via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).

UI (single page at /)

Inspector layout, dark theme, monospace accents β€” not a marketing page.

β”Œβ”€ Header ─────────────────────────────────────────────────────────┐
β”‚ Topic Β· Tier/Difficulty Β· Phase badge Β· Step n Β· Score Β· Status  β”‚
β”‚ env=ENV_BASE_URL  model=MODEL_NAME                               β”‚
β”œβ”€ Controls ────────────────────────────────────────────────────────
β”‚ [Task β–Ό (random)…] [Reset Episode] [Next Step] [Auto Run β–Ά/β– ]  β”‚
β”œβ”€ Left column ─────────────────┬─ Right column ───────────────────
β”‚ Observation                    β”‚ LLM panel                       β”‚
β”‚  β€’ topic / content (collapsed) β”‚  β€’ raw response                 β”‚
β”‚  β€’ keywords, data_available    β”‚  β€’ parsed action (JSON)         β”‚
β”‚  β€’ feedback (latest)           β”‚  β€’ generated code (syntax hl.)  β”‚
β”‚                                β”‚                                 β”‚
β”‚ Research                       β”‚ Rewards                         β”‚
β”‚  β€’ last search_results         β”‚  β€’ per-step total summary       β”‚
β”‚  β€’ Top 5 chunks table          β”‚  β€’ component breakdown table    β”‚
β”‚    (rank, source, title,       β”‚                                 β”‚
β”‚     score, url, snippet)       β”‚                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Timeline / log (scrollable, color-coded by tag)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.

Behavior

  • Reset Episode: clears state, calls envReset, populates task fields from observation, logs [START].
  • Next Step: reads phase, calls llmCall for that phase, logs [LLM] with raw + parsed, calls envStep, merges observation, appends reward + components, logs [STEP]. Disabled when done.
  • Auto Run: loops Next Step with a small delay until done, repair attempts exhausted, or user hits Stop. Logs [END] with success / score / rewards.
  • Errors from the env or LLM go into the log as [WARN] / [ERROR] and surface as a toast; the run halts but state is preserved.

Reward handling

Port the Python helpers:

  • rewardComponents(obsMetadata, feedback) β†’ filtered numeric components (uses the explore/generate/repair allow-lists).
  • parseRewardComponentsFromFeedback(feedback) as a fallback for old observations.
  • Total per phase: explore_total | generate_total | repair_total if present, else sum of visible components.
  • Final episode score: normalized_episode_score(rewards) ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); success = score >= SUCCESS_SCORE_THRESHOLD (constant ported from explainer_env/constants.py, default 0.6 β€” confirmed during implementation).

Technical details

  • Files added:
    • src/routes/index.tsx β€” dashboard page, replaces placeholder.
    • src/server/env.functions.ts β€” envReset, envStep, envMetadata, envSchema server fns calling ${process.env.ENV_BASE_URL}.
    • src/server/llm/prompts.ts, src/server/llm/parse.ts, src/server/llm/client.ts β€” port of inference.py prompt/parse/call.
    • src/server/llm.functions.ts β€” runLlmStep({ phase, obs, prior }) server fn.
    • src/server/config.functions.ts β€” getRuntimeConfig() returning { envUrl, modelName } (no secrets).
    • src/lib/tasks.ts β€” ported task bank.
    • src/lib/rewards.ts β€” reward parsing/normalization.
    • src/lib/types.ts β€” ExplainerObservation, ExplainerAction, etc.
    • src/store/episode.ts β€” Zustand store.
    • src/components/inspector/* β€” Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
  • Secret: HF_TOKEN added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it.
  • Stack stays TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only zustand and a small syntax highlighter (shiki or highlight.js) β€” pick the smaller at implementation time.
  • Out of scope: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.

Open items resolved during implementation

  • Confirm whether /reset accepts a topic argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs.
  • Confirm exact SUCCESS_SCORE_THRESHOLD and normalized_episode_score formula from explainer_env/constants.py (you can paste it, otherwise default to mean-of-totals β‰₯ 0.6).