Spaces:

kgdrathan
/

explainer-env-dashboard

Sleeping

App Files Files Community

explainer-env-dashboard / plan.md

Gnan Deep Rathan K

Deploy dashboard without binary lockfile

1b83e76 23 days ago

preview code

raw

history blame contribute delete

9.7 kB

Goal

Rebuild the Gradio dashboard.py as a browser-based inspector for the Explainer OpenEnv at https://kgdrathan-explainer-env.hf.space. No Python, no Gradio, no config.yaml, no environment/provider/model dropdowns. The dashboard runs episodes (reset → explore → generate → repair → done), shows everything, and supports single-step and auto-run.

Runtime config (server-side only, no UI selectors)

Environment variables (read inside server functions, never in the browser bundle):

ENV_BASE_URL — default https://kgdrathan-explainer-env.hf.space
API_BASE_URL — default https://router.huggingface.co/v1
HF_TOKEN — required, stored as a Lovable Cloud secret
MODEL_NAME — default Qwen/Qwen2.5-72B-Instruct

The dashboard shows ENV_BASE_URL and MODEL_NAME as a small read-only metadata strip (not editable).

Architecture

Browser (React inspector)
        │  useServerFn
        ▼
TanStack server functions
  ├─ envReset({ seed?, episode_id? })
  ├─ envStep({ action })          ── proxies POST /step on ENV_BASE_URL
  ├─ envSchema()                  ── GET /schema (cached)
  └─ llmCall({ phase, obs, prior })── builds prompt, calls API_BASE_URL with HF_TOKEN,
                                       returns { raw, parsed action }

All env and LLM HTTP traffic goes through server functions. The browser never sees HF_TOKEN. CORS is irrelevant because calls are same-origin RPC.

Explainer env contract (verified from `/schema`)

POST /reset → { observation, done } where observation is an ExplainerObservation (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward).
POST /step body: { action: ExplainerAction }. Action shape:
- action_type: "explore" | "generate" | "repair"
- explore: tool (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), query, intent
- generate / repair: format ("marimo" | "manim"), code, narration, repair_notes (repair only)
GET /metadata, GET /schema for header info and tool list.

The episode is "done" when observation.done === true or phase === "done".

LLM logic (port of `inference.py` to TypeScript)

Reimplement as pure TS in src/server/llm/:

buildExplorePrompt(obs, accumulatedContext)
buildGeneratePrompt(obs, accumulatedContext)
buildRepairPrompt(obs, lastCode, lastErrors)
parseExploreResponse(text) → { tool, query, intent } or "SKIP"
parseGenerateResponse(text) → { format, code, narration }
callLLM(messages) → OpenAI-compatible POST {API_BASE_URL}/chat/completions with Authorization: Bearer ${HF_TOKEN} and model: MODEL_NAME.

Phase routing inside runStep:

phase === "explore" → explore prompt; on SKIP, force a generate step instead.
phase === "generate" → generate prompt; if env returns phase === "repair", surface errors.
phase === "repair" → repair prompt seeded with last_errors + previous code.

Episode state (single React store, e.g. Zustand)

{
  sessionId, episodeId,
  envUrl, modelName,
  obs,                   // latest ExplainerObservation
  phase, step, done, score, status,
  task: { topic, tier, difficulty, keywords, content, dataAvailable },
  research: { exploredContext, topChunks: [], lastSearchResults },
  generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
  rewards: [],           // per-step total
  rewardDetails: [],     // per-step component breakdown
  log: [],               // [START]/[LLM]/[STEP]/[END]/[WARN] entries
  autoRunning: false
}

A "session" reset just generates a new episodeId and calls /reset — there is no long-lived server-side handle, so each envStep call is stateless toward the env (the env tracks state internally per episode_id).

Task bank

Port ALL_TASKS from the Python task_bank to a TS constant TASKS = [{ topic, difficulty, tier }, ...]. Dropdown shows (random) plus topic [difficulty, tier]. Picking a task passes topic to /reset via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).

UI (single page at `/`)

Inspector layout, dark theme, monospace accents — not a marketing page.

┌─ Header ─────────────────────────────────────────────────────────┐
│ Topic · Tier/Difficulty · Phase badge · Step n · Score · Status  │
│ env=ENV_BASE_URL  model=MODEL_NAME                               │
├─ Controls ───────────────────────────────────────────────────────┤
│ [Task ▼ (random)…] [Reset Episode] [Next Step] [Auto Run ▶/■]  │
├─ Left column ─────────────────┬─ Right column ──────────────────┤
│ Observation                    │ LLM panel                       │
│  • topic / content (collapsed) │  • raw response                 │
│  • keywords, data_available    │  • parsed action (JSON)         │
│  • feedback (latest)           │  • generated code (syntax hl.)  │
│                                │                                 │
│ Research                       │ Rewards                         │
│  • last search_results         │  • per-step total summary       │
│  • Top 5 chunks table          │  • component breakdown table    │
│    (rank, source, title,       │                                 │
│     score, url, snippet)       │                                 │
├──────────────────────────────────────────────────────────────────┤
│ Timeline / log (scrollable, color-coded by tag)                  │
└──────────────────────────────────────────────────────────────────┘

Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.

Behavior

Reset Episode: clears state, calls envReset, populates task fields from observation, logs [START].
Next Step: reads phase, calls llmCall for that phase, logs [LLM] with raw + parsed, calls envStep, merges observation, appends reward + components, logs [STEP]. Disabled when done.
Auto Run: loops Next Step with a small delay until done, repair attempts exhausted, or user hits Stop. Logs [END] with success / score / rewards.
Errors from the env or LLM go into the log as [WARN] / [ERROR] and surface as a toast; the run halts but state is preserved.

Reward handling

Port the Python helpers:

rewardComponents(obsMetadata, feedback) → filtered numeric components (uses the explore/generate/repair allow-lists).
parseRewardComponentsFromFeedback(feedback) as a fallback for old observations.
Total per phase: explore_total | generate_total | repair_total if present, else sum of visible components.
Final episode score: normalized_episode_score(rewards) ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); success = score >= SUCCESS_SCORE_THRESHOLD (constant ported from explainer_env/constants.py, default 0.6 — confirmed during implementation).

Technical details

Files added:
- src/routes/index.tsx — dashboard page, replaces placeholder.
- src/server/env.functions.ts — envReset, envStep, envMetadata, envSchema server fns calling ${process.env.ENV_BASE_URL}.
- src/server/llm/prompts.ts, src/server/llm/parse.ts, src/server/llm/client.ts — port of inference.py prompt/parse/call.
- src/server/llm.functions.ts — runLlmStep({ phase, obs, prior }) server fn.
- src/server/config.functions.ts — getRuntimeConfig() returning { envUrl, modelName } (no secrets).
- src/lib/tasks.ts — ported task bank.
- src/lib/rewards.ts — reward parsing/normalization.
- src/lib/types.ts — ExplainerObservation, ExplainerAction, etc.
- src/store/episode.ts — Zustand store.
- src/components/inspector/* — Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
Secret: HF_TOKEN added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it.
Stack stays TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only zustand and a small syntax highlighter (shiki or highlight.js) — pick the smaller at implementation time.
Out of scope: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.

Open items resolved during implementation

Confirm whether /reset accepts a topic argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs.
Confirm exact SUCCESS_SCORE_THRESHOLD and normalized_episode_score formula from explainer_env/constants.py (you can paste it, otherwise default to mean-of-totals ≥ 0.6).