Spaces:
Sleeping
Sleeping
File size: 9,695 Bytes
1b83e76 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | ## Goal
Rebuild the Gradio `dashboard.py` as a browser-based inspector for the Explainer OpenEnv at `https://kgdrathan-explainer-env.hf.space`. No Python, no Gradio, no `config.yaml`, no environment/provider/model dropdowns. The dashboard runs episodes (reset β explore β generate β repair β done), shows everything, and supports single-step and auto-run.
## Runtime config (server-side only, no UI selectors)
Environment variables (read inside server functions, never in the browser bundle):
- `ENV_BASE_URL` β default `https://kgdrathan-explainer-env.hf.space`
- `API_BASE_URL` β default `https://router.huggingface.co/v1`
- `HF_TOKEN` β required, stored as a Lovable Cloud secret
- `MODEL_NAME` β default `Qwen/Qwen2.5-72B-Instruct`
The dashboard shows `ENV_BASE_URL` and `MODEL_NAME` as a small read-only metadata strip (not editable).
## Architecture
```text
Browser (React inspector)
β useServerFn
βΌ
TanStack server functions
ββ envReset({ seed?, episode_id? })
ββ envStep({ action }) ββ proxies POST /step on ENV_BASE_URL
ββ envSchema() ββ GET /schema (cached)
ββ llmCall({ phase, obs, prior })ββ builds prompt, calls API_BASE_URL with HF_TOKEN,
returns { raw, parsed action }
```
All env and LLM HTTP traffic goes through server functions. The browser never sees `HF_TOKEN`. CORS is irrelevant because calls are same-origin RPC.
## Explainer env contract (verified from `/schema`)
- POST `/reset` β `{ observation, done }` where `observation` is an `ExplainerObservation` (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward).
- POST `/step` body: `{ action: ExplainerAction }`. Action shape:
- `action_type`: `"explore" | "generate" | "repair"`
- explore: `tool` (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), `query`, `intent`
- generate / repair: `format` (`"marimo" | "manim"`), `code`, `narration`, `repair_notes` (repair only)
- GET `/metadata`, GET `/schema` for header info and tool list.
The episode is "done" when `observation.done === true` or `phase === "done"`.
## LLM logic (port of `inference.py` to TypeScript)
Reimplement as pure TS in `src/server/llm/`:
- `buildExplorePrompt(obs, accumulatedContext)`
- `buildGeneratePrompt(obs, accumulatedContext)`
- `buildRepairPrompt(obs, lastCode, lastErrors)`
- `parseExploreResponse(text)` β `{ tool, query, intent }` or `"SKIP"`
- `parseGenerateResponse(text)` β `{ format, code, narration }`
- `callLLM(messages)` β OpenAI-compatible `POST {API_BASE_URL}/chat/completions` with `Authorization: Bearer ${HF_TOKEN}` and `model: MODEL_NAME`.
Phase routing inside `runStep`:
- `phase === "explore"` β explore prompt; on `SKIP`, force a generate step instead.
- `phase === "generate"` β generate prompt; if env returns `phase === "repair"`, surface errors.
- `phase === "repair"` β repair prompt seeded with `last_errors` + previous code.
## Episode state (single React store, e.g. Zustand)
```text
{
sessionId, episodeId,
envUrl, modelName,
obs, // latest ExplainerObservation
phase, step, done, score, status,
task: { topic, tier, difficulty, keywords, content, dataAvailable },
research: { exploredContext, topChunks: [], lastSearchResults },
generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
rewards: [], // per-step total
rewardDetails: [], // per-step component breakdown
log: [], // [START]/[LLM]/[STEP]/[END]/[WARN] entries
autoRunning: false
}
```
A "session" reset just generates a new `episodeId` and calls `/reset` β there is no long-lived server-side handle, so each `envStep` call is stateless toward the env (the env tracks state internally per episode_id).
## Task bank
Port `ALL_TASKS` from the Python `task_bank` to a TS constant `TASKS = [{ topic, difficulty, tier }, ...]`. Dropdown shows `(random)` plus `topic [difficulty, tier]`. Picking a task passes `topic` to `/reset` via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).
## UI (single page at `/`)
Inspector layout, dark theme, monospace accents β not a marketing page.
```text
ββ Header ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Topic Β· Tier/Difficulty Β· Phase badge Β· Step n Β· Score Β· Status β
β env=ENV_BASE_URL model=MODEL_NAME β
ββ Controls ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [Task βΌ (random)β¦] [Reset Episode] [Next Step] [Auto Run βΆ/β ] β
ββ Left column ββββββββββββββββββ¬β Right column βββββββββββββββββββ€
β Observation β LLM panel β
β β’ topic / content (collapsed) β β’ raw response β
β β’ keywords, data_available β β’ parsed action (JSON) β
β β’ feedback (latest) β β’ generated code (syntax hl.) β
β β β
β Research β Rewards β
β β’ last search_results β β’ per-step total summary β
β β’ Top 5 chunks table β β’ component breakdown table β
β (rank, source, title, β β
β score, url, snippet) β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Timeline / log (scrollable, color-coded by tag) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.
## Behavior
- **Reset Episode**: clears state, calls `envReset`, populates task fields from observation, logs `[START]`.
- **Next Step**: reads `phase`, calls `llmCall` for that phase, logs `[LLM]` with raw + parsed, calls `envStep`, merges observation, appends reward + components, logs `[STEP]`. Disabled when `done`.
- **Auto Run**: loops Next Step with a small delay until `done`, repair attempts exhausted, or user hits Stop. Logs `[END]` with success / score / rewards.
- Errors from the env or LLM go into the log as `[WARN]` / `[ERROR]` and surface as a toast; the run halts but state is preserved.
## Reward handling
Port the Python helpers:
- `rewardComponents(obsMetadata, feedback)` β filtered numeric components (uses the explore/generate/repair allow-lists).
- `parseRewardComponentsFromFeedback(feedback)` as a fallback for old observations.
- Total per phase: `explore_total | generate_total | repair_total` if present, else sum of visible components.
- Final episode score: `normalized_episode_score(rewards)` ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); `success = score >= SUCCESS_SCORE_THRESHOLD` (constant ported from `explainer_env/constants.py`, default 0.6 β confirmed during implementation).
## Technical details
- **Files added**:
- `src/routes/index.tsx` β dashboard page, replaces placeholder.
- `src/server/env.functions.ts` β `envReset`, `envStep`, `envMetadata`, `envSchema` server fns calling `${process.env.ENV_BASE_URL}`.
- `src/server/llm/prompts.ts`, `src/server/llm/parse.ts`, `src/server/llm/client.ts` β port of inference.py prompt/parse/call.
- `src/server/llm.functions.ts` β `runLlmStep({ phase, obs, prior })` server fn.
- `src/server/config.functions.ts` β `getRuntimeConfig()` returning `{ envUrl, modelName }` (no secrets).
- `src/lib/tasks.ts` β ported task bank.
- `src/lib/rewards.ts` β reward parsing/normalization.
- `src/lib/types.ts` β `ExplainerObservation`, `ExplainerAction`, etc.
- `src/store/episode.ts` β Zustand store.
- `src/components/inspector/*` β Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
- **Secret**: `HF_TOKEN` added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it.
- **Stack stays** TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only `zustand` and a small syntax highlighter (`shiki` or `highlight.js`) β pick the smaller at implementation time.
- **Out of scope**: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.
## Open items resolved during implementation
- Confirm whether `/reset` accepts a `topic` argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs.
- Confirm exact `SUCCESS_SCORE_THRESHOLD` and `normalized_episode_score` formula from `explainer_env/constants.py` (you can paste it, otherwise default to mean-of-totals β₯ 0.6).
|