Spaces:
Running
Running
Person D Notes
Use this file for working notes and short-term reminders.
Durable deviations belong in docs/changes.md.
2026-03-08 demo-flow refinement
- Dashboard now frames the product as
paper -> brief -> negotiate -> judge -> train. - Episode page now foregrounds the source paper and explicitly connects the terminal judge result to the training loop.
- Controls now read as replication setup instead of generic episode controls.
- Compare page is positioned as a seeded evaluation bench rather than the primary training-results story.
- The frontend default step action is now scenario-aware, so the live episode path produces valid judged runs instead of immediate invalid-action penalties on ML cases.
- The negotiation panel now shows an explicit
Advance First RoundCTA so a newly reset episode no longer looks frozen at0 messages. - The dashboard
Replicate a PaperCTA now launches a seeded live demo automatically: reset, first proposal, autoplay, and judged completion all happen without extra clicks. - The replication setup card now performs a backend health check up front and surfaces a concrete startup command instead of the opaque browser-level
Failed to fetchmessage when the API server is down.
2026-03-08 three-outcome live demo
- The live demo now has three seeded story modes on the dashboard:
fast-agreement,learning-opportunity, andno-agreement. - Each mode runs against the real backend with deterministic episode data and renders a post-episode results report instead of stopping at a generic terminal state.
- The results report now shows executed rounds, disagreement count, replicability score, paper reliability quality, reward and score charts, training interpretation, and next-tool suggestions.
- Verified backend-driven outputs for the current seeded ML demo cases:
fast-agreement-> round2, verdictaccept, cumulative reward2.906845learning-opportunity-> round6, verdictaccept, cumulative reward4.537097no-agreement-> round6, verdicttimeout, cumulative reward0.366529
2026-03-08 training page with real artifacts
- Added a dedicated
/trainingpage instead of relying on the old packaged dashboard card. - The new page is backed by real artifact values from the existing outputs:
- local deterministic baseline summary
- live ART/OpenEnv scientist checkpoints
- seeded hold-out compare summary
- scientist and lab-manager preview summaries
- The training story is now explicit and honest:
- the training pipeline works
- live reward moved positive by later checkpoints
- hold-out compare still shows the trained Scientist underperforming baseline
- more training and parser/invalid-action cleanup are still needed
- Header nav now includes
Training, dashboard training CTA points there, and the dashboard training teaser uses the same artifact-backed data.
2026-03-08 automated demo video build
- Added
scripts/build_demo_video.pyto synthesize an ElevenLabs voiceover from.env, capture clean frontend screenshots, generate captioned slides, and build the final mp4 withffmpeg. - Added
docs/demo_video_script_60s.mdas the canonical one-minute narration and shot list. - Generated the current outputs under
replicalab/outputs/demo_video/:audio/voiceover.mp3replicalab_demo_60s.mp4text/voiceover.txttext/voiceover.srt
2026-03-08 Hugging Face Space redeploy
- Investigated the public Space after it showed only the backend landing page instead of the React app.
- Confirmed the repo already had the correct multi-stage Dockerfile and SPA-serving
server/app.py, but the runtime SHA was still pinned to an older backend-only container. - Synced the current app files to
ayushozha/replicalabthrough the Hugging Face API, restarted the Space, and waited for the runtime SHA to advance to the new repo revision. - Reverified:
https://ayushozha-replicalab.hf.space/now serves the React frontendhttps://ayushozha-replicalab.hf.space/episode?...returns200https://ayushozha-replicalab.hf.space/healthstill reports{\"status\":\"ok\",\"env\":\"real\",\"version\":\"0.1.0\"}
2026-03-08 policy-results clarification page
- Added a dedicated
/policiesfrontend route for the question: baseline vs trained vs oracle. - The new page makes the current runtime explicit:
/compareis still the seeded deterministic benchmark bench- the public app is not currently mounting the trained Scientist adapter
- the public app is not currently mounting the Anthropic oracle path
- the Judge remains deterministic
- Updated
/comparewith a callout so it no longer implies that it is already comparing live mounted model policies.
2026-03-08 localhost model-backed Scientist mode
- Added live runtime detection to the episode flow through
/runtime. - Non-demo localhost episodes now prefer the backend
/agent-steproute instead of the frontend default action builder when a model runtime is available. - The episode page now surfaces the current Scientist runtime directly in the UI so it is clear whether localhost is using baseline or a model-backed path.
- Current live localhost mode is
ollamawithglm-5:cloud. - Anthropic-backed Scientist mode exists in code, but the current Anthropic account cannot run live due to insufficient API credits, so localhost falls back to the Ollama runtime for real model-driven stepping.
2026-03-08 dynamic live-run and judge-caveat cleanup
- The main dashboard CTA no longer launches the same fixed seeded flow every time.
Replicate a Random Papernow generates a fresh seeded route with a random scenario family, difficulty, and seed, then autostarts the live episode path.- The three fixed cards remain available, but are now labeled as scripted outcomes rather than the default live experience.
- Accepted verdicts that still carry weak-component reasons are now shown as
Accept with caveatsin the judge-facing UI instead ofAcceptplus a contradictoryFailure Reasonsblock. - The results page now reports those cases as conditional replication candidates rather than clean wins.
- The stage animation and completion toast now treat accepted-with-caveats runs as partial wins instead of full celebratory successes.
- Live reset verification confirmed the random path can surface distinct paper briefs across scenario families, including CIFAR-10 replication and offline mean-reversion backtest cases.