feat: prompt-comparison section + retry logic + scale-up to 15 tasks × {mini,nano} × 3 prompts 6ca9a91 verified TheUnicat commited on 12 days ago
fix: re-judge experiment rollouts after credit-exhaustion; retry logic in batched + single-criterion judge 8410720 verified TheUnicat commited on 12 days ago
feat: V₀=0.5 baseline + det ceilings, prompt-pill UI, 60 experiment rollouts 2cd2802 verified TheUnicat commited on 12 days ago
feat: per-turn state value + turn score on Demo (state_v1 trajectories baked into 228 rollouts) e7cdebc verified TheUnicat commited on 12 days ago
stream rollout messages per-turn via env_response hook (no more end-of-rollout replay) 59ea1c7 verified TheUnicat commited on 14 days ago
purge stale standalone rollouts, mirror to current 228 97cbf70 verified TheUnicat commited on 14 days ago
default judge → opus 4.7; expose configured judge via /api/health 9578447 verified TheUnicat commited on 14 days ago
fix: move RUNS_DIR to /app/runs (HF persistent /data hides image-baked content) 5b90296 verified TheUnicat commited on 14 days ago