# Ayush Notes Use this file for short-lived working notes, reminders, and handoff details. Do not use this file for durable deviations from the original plan. Put those in `docs/changes.md`. Current local training-data note: - A 50-paper experiment-design corpus now exists under `data/papers/`. - Use `data/papers/manifest.json` for the full scenario-to-paper mapping. - Most entries are marked `alternative` because many scenario titles in `ReplicaLab_50_Scenarios_Training_Plan.md` are synthetic summaries rather than directly downloadable published paper titles. Current V2 training architecture note: - The reusable training stack now lives under `replicalab/training/`. - `notebooks/train_minimal_colab.ipynb` is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL. - `notebooks/train_colab.ipynb` is the judged notebook driver, but heavy runs are expected to use the `replicalab-train` entrypoint on Northflank H100. - The primary shared base is now `Qwen/Qwen3.5-9B` with separate Scientist GRPO and Lab Manager SFT adapters. - The reduced-scale fallback is `Qwen/Qwen3.5-4B`. - The audit-only judge candidate is `Qwen/Qwen3.5-122B-A10B`. - The deterministic rubric remains the only training reward source even when Anthropic-backed oracle features are enabled for V2 overlays. - `docs/training_goals.md` now defines the current model goals and the separation between metric improvements and the larger execution-env redesign. - A March 9 operational check found that the current Hugging Face token is valid for Hub auth but belongs to a non-billable personal account (`canPay=false`, no orgs), so it is not currently enough to provision paid large-model hosting on Hugging Face. - The current Northflank manual job `replicalab-train` still has runtime env values, but `northflank start job run` returns `409 No deployment configured`, so the job cannot launch until a runnable image/deployment is attached. - The live Northflank service on the same `nf-gpu-hack-16-64` plan does not currently expose `nvidia-smi` or `/dev/nvidia*` inside the container, so GPU availability should be treated as unverified until the runtime is fixed and a direct hardware probe succeeds. Current Northflank notebook note: - The dedicated notebook service now lives in project `notebook-openport` as service `jupyter-pytorch`. - The pasted notebook hostname `app--jupyter-pytorch--h74j66w224jx.code.run` is stale; the live public notebook endpoint on 2026-03-09 is `app--jupyter-pytorch--9y6g97v7czb9.code.run`. - The notebook runtime does expose a real `NVIDIA H100 80GB HBM3` GPU. - `/home/jovyan/replicalab-ai` and `/home/jovyan/replicalab-qwen3.5-grpo` already exist in that notebook, with saved adapter checkpoints through `checkpoint-200`. - The saved `grpo_training.log` shows the notebook ran on H100 but did not complete cleanly: baseline eval emitted `string indices must be integers, not 'str'`, and the final inference cell failed in `tokenizer.apply_chat_template(...)` with the same content-structure issue. Current ART/OpenEnv runtime note: - The active live Scientist RL path is now `art-scientist-train` in `replicalab/training/cli.py`. - Fresh-runtime smoke validation completed on 2026-03-08 for: - `scientist-preview-smoke-20260308b` - `lab-manager-preview-smoke-20260308b` - `art-scientist-smoke-20260308b` - `art-scientist-compare-smoke-20260308b` - The live ART Scientist checkpoint reached `step7`, but the current trained checkpoint still underperforms the deterministic baseline on held-out comparison. - The main remaining work is experiment quality iteration, not missing training infrastructure. - Evaluation summaries now track `paper_understanding` and `communication_quality`, and the shared benchmark-history plots live under `replicalab/outputs/training/history/`. Current localhost model-runtime note: - `server/app.py` now exposes `/runtime` and `/agent-step` so the local app can run a backend-selected Scientist policy instead of the frontend stub. - Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low. - Localhost therefore currently runs in `ollama` mode with `glm-5:cloud` as the working model-backed Scientist path. - The server applies a small deterministic safety adapter to model outputs before env stepping: - trims controls to fit sample size - aligns equipment and reagent requests to the available inventory - clamps duration to the current lab time limit - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`. Current March 9 H100 benchmark note: - The full multi-round `scientist-local-compare-eval` path is live on the Northflank H100 notebook, but the current notebook image is missing the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large sharded rollout sweeps did not flush artifacts on a practical same-turn timescale. - A fallback live H100 first-step benchmark was run on 2026-03-09 instead: `250` shared reset cases with both baseline and trained Scientist first-step actions, for `500` total simulations. - The merged artifact root is `replicalab/outputs/training/h100-one-step-500-20260309/`. - The benchmark spans `34` trainable papers. - Summary result: - baseline average first-step paper understanding: `0.61692084` - trained average first-step paper understanding: `0.063866752` - baseline average first-step reward: `0.3` - trained average first-step reward: `0.05` - trained request-info rate: `1.0` - invalid-action rate stayed `0.0` for both labels - Scenario-level understanding: - baseline `finance_trading`: `0.596033` - trained `finance_trading`: `0.018182` - baseline `ml_benchmark`: `0.633333` - trained `ml_benchmark`: `0.099762` - Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is materially worse than the deterministic baseline on first-step paper grounding and currently behaves like a universal `request_info` policy under a fast decode budget.