Spaces:

openenv-community
/

replicalab

Running

Initial HF Spaces deployment

80d8c84 2 days ago

6.5 kB

Person D Notes

Use this file for working notes and short-term reminders.

Durable deviations belong in docs/changes.md.

Dashboard now frames the product as paper -> brief -> negotiate -> judge -> train.
Episode page now foregrounds the source paper and explicitly connects the terminal judge result to the training loop.
Controls now read as replication setup instead of generic episode controls.
Compare page is positioned as a seeded evaluation bench rather than the primary training-results story.
The frontend default step action is now scenario-aware, so the live episode path produces valid judged runs instead of immediate invalid-action penalties on ML cases.
The negotiation panel now shows an explicit Advance First Round CTA so a newly reset episode no longer looks frozen at 0 messages.
The dashboard Replicate a Paper CTA now launches a seeded live demo automatically: reset, first proposal, autoplay, and judged completion all happen without extra clicks.
The replication setup card now performs a backend health check up front and surfaces a concrete startup command instead of the opaque browser-level Failed to fetch message when the API server is down.

The live demo now has three seeded story modes on the dashboard: fast-agreement, learning-opportunity, and no-agreement.
Each mode runs against the real backend with deterministic episode data and renders a post-episode results report instead of stopping at a generic terminal state.
The results report now shows executed rounds, disagreement count, replicability score, paper reliability quality, reward and score charts, training interpretation, and next-tool suggestions.
Verified backend-driven outputs for the current seeded ML demo cases:
- fast-agreement -> round 2, verdict accept, cumulative reward 2.906845
- learning-opportunity -> round 6, verdict accept, cumulative reward 4.537097
- no-agreement -> round 6, verdict timeout, cumulative reward 0.366529

Added a dedicated /training page instead of relying on the old packaged dashboard card.
The new page is backed by real artifact values from the existing outputs:
- local deterministic baseline summary
- live ART/OpenEnv scientist checkpoints
- seeded hold-out compare summary
- scientist and lab-manager preview summaries
The training story is now explicit and honest:
- the training pipeline works
- live reward moved positive by later checkpoints
- hold-out compare still shows the trained Scientist underperforming baseline
- more training and parser/invalid-action cleanup are still needed
Header nav now includes Training, dashboard training CTA points there, and the dashboard training teaser uses the same artifact-backed data.

Added scripts/build_demo_video.py to synthesize an ElevenLabs voiceover from .env, capture clean frontend screenshots, generate captioned slides, and build the final mp4 with ffmpeg.
Added docs/demo_video_script_60s.md as the canonical one-minute narration and shot list.
Generated the current outputs under replicalab/outputs/demo_video/:
- audio/voiceover.mp3
- replicalab_demo_60s.mp4
- text/voiceover.txt
- text/voiceover.srt

Investigated the public Space after it showed only the backend landing page instead of the React app.
Confirmed the repo already had the correct multi-stage Dockerfile and SPA-serving server/app.py, but the runtime SHA was still pinned to an older backend-only container.
Synced the current app files to ayushozha/replicalab through the Hugging Face API, restarted the Space, and waited for the runtime SHA to advance to the new repo revision.
Reverified:
- https://ayushozha-replicalab.hf.space/ now serves the React frontend
- https://ayushozha-replicalab.hf.space/episode?... returns 200
- https://ayushozha-replicalab.hf.space/health still reports {\"status\":\"ok\",\"env\":\"real\",\"version\":\"0.1.0\"}

Added a dedicated /policies frontend route for the question: baseline vs trained vs oracle.
The new page makes the current runtime explicit:
- /compare is still the seeded deterministic benchmark bench
- the public app is not currently mounting the trained Scientist adapter
- the public app is not currently mounting the Anthropic oracle path
- the Judge remains deterministic
Updated /compare with a callout so it no longer implies that it is already comparing live mounted model policies.

Added live runtime detection to the episode flow through /runtime.
Non-demo localhost episodes now prefer the backend /agent-step route instead of the frontend default action builder when a model runtime is available.
The episode page now surfaces the current Scientist runtime directly in the UI so it is clear whether localhost is using baseline or a model-backed path.
Current live localhost mode is ollama with glm-5:cloud.
Anthropic-backed Scientist mode exists in code, but the current Anthropic account cannot run live due to insufficient API credits, so localhost falls back to the Ollama runtime for real model-driven stepping.

The main dashboard CTA no longer launches the same fixed seeded flow every time.
Replicate a Random Paper now generates a fresh seeded route with a random scenario family, difficulty, and seed, then autostarts the live episode path.
The three fixed cards remain available, but are now labeled as scripted outcomes rather than the default live experience.
Accepted verdicts that still carry weak-component reasons are now shown as Accept with caveats in the judge-facing UI instead of Accept plus a contradictory Failure Reasons block.
The results page now reports those cases as conditional replication candidates rather than clean wins.
The stage animation and completion toast now treat accepted-with-caveats runs as partial wins instead of full celebratory successes.
Live reset verification confirmed the random path can surface distinct paper briefs across scenario families, including CIFAR-10 replication and offline mean-reversion backtest cases.