Spaces:
Sleeping
Goal
Rebuild the Gradio dashboard.py as a browser-based inspector for the Explainer OpenEnv at https://kgdrathan-explainer-env.hf.space. No Python, no Gradio, no config.yaml, no environment/provider/model dropdowns. The dashboard runs episodes (reset β explore β generate β repair β done), shows everything, and supports single-step and auto-run.
Runtime config (server-side only, no UI selectors)
Environment variables (read inside server functions, never in the browser bundle):
ENV_BASE_URLβ defaulthttps://kgdrathan-explainer-env.hf.spaceAPI_BASE_URLβ defaulthttps://router.huggingface.co/v1HF_TOKENβ required, stored as a Lovable Cloud secretMODEL_NAMEβ defaultQwen/Qwen2.5-72B-Instruct
The dashboard shows ENV_BASE_URL and MODEL_NAME as a small read-only metadata strip (not editable).
Architecture
Browser (React inspector)
β useServerFn
βΌ
TanStack server functions
ββ envReset({ seed?, episode_id? })
ββ envStep({ action }) ββ proxies POST /step on ENV_BASE_URL
ββ envSchema() ββ GET /schema (cached)
ββ llmCall({ phase, obs, prior })ββ builds prompt, calls API_BASE_URL with HF_TOKEN,
returns { raw, parsed action }
All env and LLM HTTP traffic goes through server functions. The browser never sees HF_TOKEN. CORS is irrelevant because calls are same-origin RPC.
Explainer env contract (verified from /schema)
- POST
/resetβ{ observation, done }whereobservationis anExplainerObservation(topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward). - POST
/stepbody:{ action: ExplainerAction }. Action shape:action_type:"explore" | "generate" | "repair"- explore:
tool(one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub),query,intent - generate / repair:
format("marimo" | "manim"),code,narration,repair_notes(repair only)
- GET
/metadata, GET/schemafor header info and tool list.
The episode is "done" when observation.done === true or phase === "done".
LLM logic (port of inference.py to TypeScript)
Reimplement as pure TS in src/server/llm/:
buildExplorePrompt(obs, accumulatedContext)buildGeneratePrompt(obs, accumulatedContext)buildRepairPrompt(obs, lastCode, lastErrors)parseExploreResponse(text)β{ tool, query, intent }or"SKIP"parseGenerateResponse(text)β{ format, code, narration }callLLM(messages)β OpenAI-compatiblePOST {API_BASE_URL}/chat/completionswithAuthorization: Bearer ${HF_TOKEN}andmodel: MODEL_NAME.
Phase routing inside runStep:
phase === "explore"β explore prompt; onSKIP, force a generate step instead.phase === "generate"β generate prompt; if env returnsphase === "repair", surface errors.phase === "repair"β repair prompt seeded withlast_errors+ previous code.
Episode state (single React store, e.g. Zustand)
{
sessionId, episodeId,
envUrl, modelName,
obs, // latest ExplainerObservation
phase, step, done, score, status,
task: { topic, tier, difficulty, keywords, content, dataAvailable },
research: { exploredContext, topChunks: [], lastSearchResults },
generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
rewards: [], // per-step total
rewardDetails: [], // per-step component breakdown
log: [], // [START]/[LLM]/[STEP]/[END]/[WARN] entries
autoRunning: false
}
A "session" reset just generates a new episodeId and calls /reset β there is no long-lived server-side handle, so each envStep call is stateless toward the env (the env tracks state internally per episode_id).
Task bank
Port ALL_TASKS from the Python task_bank to a TS constant TASKS = [{ topic, difficulty, tier }, ...]. Dropdown shows (random) plus topic [difficulty, tier]. Picking a task passes topic to /reset via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).
UI (single page at /)
Inspector layout, dark theme, monospace accents β not a marketing page.
ββ Header ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Topic Β· Tier/Difficulty Β· Phase badge Β· Step n Β· Score Β· Status β
β env=ENV_BASE_URL model=MODEL_NAME β
ββ Controls ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [Task βΌ (random)β¦] [Reset Episode] [Next Step] [Auto Run βΆ/β ] β
ββ Left column ββββββββββββββββββ¬β Right column βββββββββββββββββββ€
β Observation β LLM panel β
β β’ topic / content (collapsed) β β’ raw response β
β β’ keywords, data_available β β’ parsed action (JSON) β
β β’ feedback (latest) β β’ generated code (syntax hl.) β
β β β
β Research β Rewards β
β β’ last search_results β β’ per-step total summary β
β β’ Top 5 chunks table β β’ component breakdown table β
β (rank, source, title, β β
β score, url, snippet) β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Timeline / log (scrollable, color-coded by tag) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.
Behavior
- Reset Episode: clears state, calls
envReset, populates task fields from observation, logs[START]. - Next Step: reads
phase, callsllmCallfor that phase, logs[LLM]with raw + parsed, callsenvStep, merges observation, appends reward + components, logs[STEP]. Disabled whendone. - Auto Run: loops Next Step with a small delay until
done, repair attempts exhausted, or user hits Stop. Logs[END]with success / score / rewards. - Errors from the env or LLM go into the log as
[WARN]/[ERROR]and surface as a toast; the run halts but state is preserved.
Reward handling
Port the Python helpers:
rewardComponents(obsMetadata, feedback)β filtered numeric components (uses the explore/generate/repair allow-lists).parseRewardComponentsFromFeedback(feedback)as a fallback for old observations.- Total per phase:
explore_total | generate_total | repair_totalif present, else sum of visible components. - Final episode score:
normalized_episode_score(rewards)ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]);success = score >= SUCCESS_SCORE_THRESHOLD(constant ported fromexplainer_env/constants.py, default 0.6 β confirmed during implementation).
Technical details
- Files added:
src/routes/index.tsxβ dashboard page, replaces placeholder.src/server/env.functions.tsβenvReset,envStep,envMetadata,envSchemaserver fns calling${process.env.ENV_BASE_URL}.src/server/llm/prompts.ts,src/server/llm/parse.ts,src/server/llm/client.tsβ port of inference.py prompt/parse/call.src/server/llm.functions.tsβrunLlmStep({ phase, obs, prior })server fn.src/server/config.functions.tsβgetRuntimeConfig()returning{ envUrl, modelName }(no secrets).src/lib/tasks.tsβ ported task bank.src/lib/rewards.tsβ reward parsing/normalization.src/lib/types.tsβExplainerObservation,ExplainerAction, etc.src/store/episode.tsβ Zustand store.src/components/inspector/*β Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
- Secret:
HF_TOKENadded via Lovable Cloud secrets after plan approval. The user will be prompted to paste it. - Stack stays TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only
zustandand a small syntax highlighter (shikiorhighlight.js) β pick the smaller at implementation time. - Out of scope: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.
Open items resolved during implementation
- Confirm whether
/resetaccepts atopicargument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs. - Confirm exact
SUCCESS_SCORE_THRESHOLDandnormalized_episode_scoreformula fromexplainer_env/constants.py(you can paste it, otherwise default to mean-of-totals β₯ 0.6).