Spaces:

kgdrathan
/

explainer-env-dashboard

Sleeping

File size: 9,695 Bytes

1b83e76

## Goal

Rebuild the Gradio `dashboard.py` as a browser-based inspector for the Explainer OpenEnv at `https://kgdrathan-explainer-env.hf.space`. No Python, no Gradio, no `config.yaml`, no environment/provider/model dropdowns. The dashboard runs episodes (reset → explore → generate → repair → done), shows everything, and supports single-step and auto-run.

## Runtime config (server-side only, no UI selectors)

Environment variables (read inside server functions, never in the browser bundle):

- `ENV_BASE_URL` — default `https://kgdrathan-explainer-env.hf.space`
- `API_BASE_URL` — default `https://router.huggingface.co/v1`
- `HF_TOKEN` — required, stored as a Lovable Cloud secret
- `MODEL_NAME` — default `Qwen/Qwen2.5-72B-Instruct`

The dashboard shows `ENV_BASE_URL` and `MODEL_NAME` as a small read-only metadata strip (not editable).

## Architecture

```text
Browser (React inspector)
        │  useServerFn
        ▼
TanStack server functions
  ├─ envReset({ seed?, episode_id? })
  ├─ envStep({ action })          ── proxies POST /step on ENV_BASE_URL
  ├─ envSchema()                  ── GET /schema (cached)
  └─ llmCall({ phase, obs, prior })── builds prompt, calls API_BASE_URL with HF_TOKEN,
                                       returns { raw, parsed action }
```

All env and LLM HTTP traffic goes through server functions. The browser never sees `HF_TOKEN`. CORS is irrelevant because calls are same-origin RPC.

## Explainer env contract (verified from `/schema`)

- POST `/reset` → `{ observation, done }` where `observation` is an `ExplainerObservation` (topic, content, tier, keywords, phase, feedback, search_results, top_chunks, explored_context, explore_steps_left, repair_attempts_left, last_errors, available_tools, metadata, reward).
- POST `/step` body: `{ action: ExplainerAction }`. Action shape:
  - `action_type`: `"explore" | "generate" | "repair"`
  - explore: `tool` (one of search_wikipedia / search_hf_papers / search_arxiv / search_scholar / fetch_docs / search_hf_hub), `query`, `intent`
  - generate / repair: `format` (`"marimo" | "manim"`), `code`, `narration`, `repair_notes` (repair only)
- GET `/metadata`, GET `/schema` for header info and tool list.

The episode is "done" when `observation.done === true` or `phase === "done"`.

## LLM logic (port of `inference.py` to TypeScript)

Reimplement as pure TS in `src/server/llm/`:

- `buildExplorePrompt(obs, accumulatedContext)`
- `buildGeneratePrompt(obs, accumulatedContext)`
- `buildRepairPrompt(obs, lastCode, lastErrors)`
- `parseExploreResponse(text)` → `{ tool, query, intent }` or `"SKIP"`
- `parseGenerateResponse(text)` → `{ format, code, narration }`
- `callLLM(messages)` → OpenAI-compatible `POST {API_BASE_URL}/chat/completions` with `Authorization: Bearer ${HF_TOKEN}` and `model: MODEL_NAME`.

Phase routing inside `runStep`:

- `phase === "explore"` → explore prompt; on `SKIP`, force a generate step instead.
- `phase === "generate"` → generate prompt; if env returns `phase === "repair"`, surface errors.
- `phase === "repair"` → repair prompt seeded with `last_errors` + previous code.

## Episode state (single React store, e.g. Zustand)

```text
{
  sessionId, episodeId,
  envUrl, modelName,
  obs,                   // latest ExplainerObservation
  phase, step, done, score, status,
  task: { topic, tier, difficulty, keywords, content, dataAvailable },
  research: { exploredContext, topChunks: [], lastSearchResults },
  generation: { lastFormat, lastCode, lastNarration, generatedRaw, parsedAction },
  rewards: [],           // per-step total
  rewardDetails: [],     // per-step component breakdown
  log: [],               // [START]/[LLM]/[STEP]/[END]/[WARN] entries
  autoRunning: false
}
```

A "session" reset just generates a new `episodeId` and calls `/reset` — there is no long-lived server-side handle, so each `envStep` call is stateless toward the env (the env tracks state internally per episode_id).

## Task bank

Port `ALL_TASKS` from the Python `task_bank` to a TS constant `TASKS = [{ topic, difficulty, tier }, ...]`. Dropdown shows `(random)` plus `topic [difficulty, tier]`. Picking a task passes `topic` to `/reset` via the action body if the env supports it; otherwise it's stored as a target hint and the env's own random topic is used (we'll detect from the schema at build time).

## UI (single page at `/`)

Inspector layout, dark theme, monospace accents — not a marketing page.

```text
┌─ Header ─────────────────────────────────────────────────────────┐
│ Topic · Tier/Difficulty · Phase badge · Step n · Score · Status  │
│ env=ENV_BASE_URL  model=MODEL_NAME                               │
├─ Controls ───────────────────────────────────────────────────────┤
│ [Task ▼ (random)…] [Reset Episode] [Next Step] [Auto Run ▶/■]  │
├─ Left column ─────────────────┬─ Right column ──────────────────┤
│ Observation                    │ LLM panel                       │
│  • topic / content (collapsed) │  • raw response                 │
│  • keywords, data_available    │  • parsed action (JSON)         │
│  • feedback (latest)           │  • generated code (syntax hl.)  │
│                                │                                 │
│ Research                       │ Rewards                         │
│  • last search_results         │  • per-step total summary       │
│  • Top 5 chunks table          │  • component breakdown table    │
│    (rank, source, title,       │                                 │
│     score, url, snippet)       │                                 │
├──────────────────────────────────────────────────────────────────┤
│ Timeline / log (scrollable, color-coded by tag)                  │
└──────────────────────────────────────────────────────────────────┘
```

Shadcn primitives only: Card, Table, Badge, Button, Select, ScrollArea, Tabs (for code/raw/parsed), Separator. Code blocks use a lightweight highlighter.

## Behavior

- **Reset Episode**: clears state, calls `envReset`, populates task fields from observation, logs `[START]`.
- **Next Step**: reads `phase`, calls `llmCall` for that phase, logs `[LLM]` with raw + parsed, calls `envStep`, merges observation, appends reward + components, logs `[STEP]`. Disabled when `done`.
- **Auto Run**: loops Next Step with a small delay until `done`, repair attempts exhausted, or user hits Stop. Logs `[END]` with success / score / rewards.
- Errors from the env or LLM go into the log as `[WARN]` / `[ERROR]` and surface as a toast; the run halts but state is preserved.

## Reward handling

Port the Python helpers:

- `rewardComponents(obsMetadata, feedback)` → filtered numeric components (uses the explore/generate/repair allow-lists).
- `parseRewardComponentsFromFeedback(feedback)` as a fallback for old observations.
- Total per phase: `explore_total | generate_total | repair_total` if present, else sum of visible components.
- Final episode score: `normalized_episode_score(rewards)` ported as a simple TS function (mean of generate+repair totals, clamped to [0,1]); `success = score >= SUCCESS_SCORE_THRESHOLD` (constant ported from `explainer_env/constants.py`, default 0.6 — confirmed during implementation).

## Technical details

- **Files added**:
  - `src/routes/index.tsx` — dashboard page, replaces placeholder.
  - `src/server/env.functions.ts` — `envReset`, `envStep`, `envMetadata`, `envSchema` server fns calling `${process.env.ENV_BASE_URL}`.
  - `src/server/llm/prompts.ts`, `src/server/llm/parse.ts`, `src/server/llm/client.ts` — port of inference.py prompt/parse/call.
  - `src/server/llm.functions.ts` — `runLlmStep({ phase, obs, prior })` server fn.
  - `src/server/config.functions.ts` — `getRuntimeConfig()` returning `{ envUrl, modelName }` (no secrets).
  - `src/lib/tasks.ts` — ported task bank.
  - `src/lib/rewards.ts` — reward parsing/normalization.
  - `src/lib/types.ts` — `ExplainerObservation`, `ExplainerAction`, etc.
  - `src/store/episode.ts` — Zustand store.
  - `src/components/inspector/*` — Header, Controls, ObservationPanel, LlmPanel, ResearchPanel, RewardsPanel, Log.
- **Secret**: `HF_TOKEN` added via Lovable Cloud secrets after plan approval. The user will be prompted to paste it.
- **Stack stays** TanStack Start + React + Tailwind + shadcn. No new heavy deps; add only `zustand` and a small syntax highlighter (`shiki` or `highlight.js`) — pick the smaller at implementation time.
- **Out of scope**: persistence across reloads, multi-user sessions, auth, charts. Pure single-tab inspector.

## Open items resolved during implementation

- Confirm whether `/reset` accepts a `topic` argument; if not, we still show the task picker but treat it as a display filter and surface a warning when the returned topic differs.
- Confirm exact `SUCCESS_SCORE_THRESHOLD` and `normalized_episode_score` formula from `explainer_env/constants.py` (you can paste it, otherwise default to mean-of-totals ≥ 0.6).