Spaces:

kgdrathan
/

explainer-env-dashboard

Sleeping

App Files Files Community

explainer-env-dashboard / plan.md

Gnan Deep Rathan K

Deploy dashboard without binary lockfile

1b83e76 23 days ago

preview code

raw

history blame contribute delete

9.7 kB

	## Goal

	Rebuild the Gradio `dashboard.py` as a browser-based inspector for the Explainer OpenEnv at `https://kgdrathan-explainer-env.hf.space`. No Python, no Gradio, no `config.yaml`, no environment/provider/model dropdowns. The dashboard runs episodes (reset → explore → generate → repair → done), shows everything, and supports single-step and auto-run.

	## Runtime config (server-side only, no UI selectors)

	Environment variables (read inside server functions, never in the browser bundle):

	- `ENV_BASE_URL` — default `https://kgdrathan-explainer-env.hf.space`
	- `API_BASE_URL` — default `https://router.huggingface.co/v1`
	- `HF_TOKEN` — required, stored as a Lovable Cloud secret
	- `MODEL_NAME` — default `Qwen/Qwen2.5-72B-Instruct`

	The dashboard shows `ENV_BASE_URL` and `MODEL_NAME` as a small read-only metadata strip (not editable).

	## Architecture

	```text
	Browser (React inspector)
	│ useServerFn
	▼
	TanStack server functions
	├─ envReset({ seed?, episode_id? })
	├─ envStep({ action }) ── proxies POST /step on ENV_BASE_URL
	├─ envSchema() ── GET /schema (cached)
	└─ llmCall({ phase, obs, prior })── builds prompt, calls API_BASE_URL with HF_TOKEN,
	returns { raw, parsed action }
	```

	All env and LLM HTTP traffic goes through server functions. The browser never sees `HF_TOKEN`. CORS is irrelevant because calls are same-origin RPC.

	## Explainer env contract (verified from `/schema`)

	- POST `/reset` → `{ observation, done }` where `observation` is an `ExplainerObservation` (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward).
	- POST `/step` body: `{ action: ExplainerAction }`. Action shape:
	- `action_type`: `"explore" \| "generate" \| "repair"`
	- explore: `tool` (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), `query`, `intent`
	- generate / repair: `format` (`"marimo" \| "manim"`), `code`, `narration`, `repair_notes` (repair only)
	- GET `/metadata`, GET `/schema` for header info and tool list.

	The episode is "done" when `observation.done === true` or `phase === "done"`.

	## LLM logic (port of `inference.py` to TypeScript)

	Reimplement as pure TS in `src/server/llm/`:

	- `buildExplorePrompt(obs, accumulatedContext)`
	- `buildGeneratePrompt(obs, accumulatedContext)`
	- `buildRepairPrompt(obs, lastCode, lastErrors)`
	- `parseExploreResponse(text)` → `{ tool, query, intent }` or `"SKIP"`
	- `parseGenerateResponse(text)` → `{ format, code, narration }`
	- `callLLM(messages)` → OpenAI-compatible `POST {API_BASE_URL}/chat/completions` with `Authorization: Bearer ${HF_TOKEN}` and `model: MODEL_NAME`.

	Phase routing inside `runStep`:

	- `phase === "explore"` → explore prompt; on `SKIP`, force a generate step instead.
	- `phase === "generate"` → generate prompt; if env returns `phase === "repair"`, surface errors.
	- `phase === "repair"` → repair prompt seeded with `last_errors` + previous code.

	## Episode state (single React store, e.g. Zustand)

	```text
	{
	sessionId, episodeId,
	envUrl, modelName,
	obs, // latest ExplainerObservation
	phase, step, done, score, status,
	task: { topic, tier, difficulty, keywords, content, dataAvailable },
	research: { exploredContext, topChunks: [], lastSearchResults },
	generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
	rewards: [], // per-step total
	rewardDetails: [], // per-step component breakdown
	log: [], // [START]/[LLM]/[STEP]/[END]/[WARN] entries
	autoRunning: false
	}
	```

	A "session" reset just generates a new `episodeId` and calls `/reset` — there is no long-lived server-side handle, so each `envStep` call is stateless toward the env (the env tracks state internally per episode_id).

	## Task bank

	Port `ALL_TASKS` from the Python `task_bank` to a TS constant `TASKS = [{ topic, difficulty, tier }, ...]`. Dropdown shows `(random)` plus `topic [difficulty, tier]`. Picking a task passes `topic` to `/reset` via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).

	## UI (single page at `/`)

	Inspector layout, dark theme, monospace accents — not a marketing page.

	```text
	┌─ Header ─────────────────────────────────────────────────────────┐
	│ Topic · Tier/Difficulty · Phase badge · Step n · Score · Status │
	│ env=ENV_BASE_URL model=MODEL_NAME │
	├─ Controls ───────────────────────────────────────────────────────┤
	│ [Task ▼ (random)…] [Reset Episode] [Next Step] [Auto Run ▶/■] │
	├─ Left column ─────────────────┬─ Right column ──────────────────┤
	│ Observation │ LLM panel │
	│ • topic / content (collapsed) │ • raw response │
	│ • keywords, data_available │ • parsed action (JSON) │
	│ • feedback (latest) │ • generated code (syntax hl.) │
	│ │ │
	│ Research │ Rewards │
	│ • last search_results │ • per-step total summary │
	│ • Top 5 chunks table │ • component breakdown table │
	│ (rank, source, title, │ │
	│ score, url, snippet) │ │
	├──────────────────────────────────────────────────────────────────┤
	│ Timeline / log (scrollable, color-coded by tag) │
	└──────────────────────────────────────────────────────────────────┘
	```

	Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.

	## Behavior

	- Reset Episode: clears state, calls `envReset`, populates task fields from observation, logs `[START]`.
	- Next Step: reads `phase`, calls `llmCall` for that phase, logs `[LLM]` with raw + parsed, calls `envStep`, merges observation, appends reward + components, logs `[STEP]`. Disabled when `done`.
	- Auto Run: loops Next Step with a small delay until `done`, repair attempts exhausted, or user hits Stop. Logs `[END]` with success / score / rewards.
	- Errors from the env or LLM go into the log as `[WARN]` / `[ERROR]` and surface as a toast; the run halts but state is preserved.

	## Reward handling

	Port the Python helpers:

	- `rewardComponents(obsMetadata, feedback)` → filtered numeric components (uses the explore/generate/repair allow-lists).
	- `parseRewardComponentsFromFeedback(feedback)` as a fallback for old observations.
	- Total per phase: `explore_total \| generate_total \| repair_total` if present, else sum of visible components.
	- Final episode score: `normalized_episode_score(rewards)` ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); `success = score >= SUCCESS_SCORE_THRESHOLD` (constant ported from `explainer_env/constants.py`, default 0.6 — confirmed during implementation).

	## Technical details

	- Files added:
	- `src/routes/index.tsx` — dashboard page, replaces placeholder.
	- `src/server/env.functions.ts` — `envReset`, `envStep`, `envMetadata`, `envSchema` server fns calling `${process.env.ENV_BASE_URL}`.
	- `src/server/llm/prompts.ts`, `src/server/llm/parse.ts`, `src/server/llm/client.ts` — port of inference.py prompt/parse/call.
	- `src/server/llm.functions.ts` — `runLlmStep({ phase, obs, prior })` server fn.
	- `src/server/config.functions.ts` — `getRuntimeConfig()` returning `{ envUrl, modelName }` (no secrets).
	- `src/lib/tasks.ts` — ported task bank.
	- `src/lib/rewards.ts` — reward parsing/normalization.
	- `src/lib/types.ts` — `ExplainerObservation`, `ExplainerAction`, etc.
	- `src/store/episode.ts` — Zustand store.
	- `src/components/inspector/*` — Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
	- Secret: `HF_TOKEN` added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it.
	- Stack stays TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only `zustand` and a small syntax highlighter (`shiki` or `highlight.js`) — pick the smaller at implementation time.
	- Out of scope: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.

	## Open items resolved during implementation

	- Confirm whether `/reset` accepts a `topic` argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs.
	- Confirm exact `SUCCESS_SCORE_THRESHOLD` and `normalized_episode_score` formula from `explainer_env/constants.py` (you can paste it, otherwise default to mean-of-totals ≥ 0.6).