Spaces:
Sleeping
Sleeping
| # Ayush Notes | |
| Use this file for short-lived working notes, reminders, and handoff details. | |
| Do not use this file for durable deviations from the original plan. Put those in `docs/changes.md`. | |
| Current local training-data note: | |
| - A 50-paper experiment-design corpus now exists under `data/papers/`. | |
| - Use `data/papers/manifest.json` for the full scenario-to-paper mapping. | |
| - Most entries are marked `alternative` because many scenario titles in | |
| `ReplicaLab_50_Scenarios_Training_Plan.md` are synthetic summaries rather | |
| than directly downloadable published paper titles. | |
| Current V2 training architecture note: | |
| - The reusable training stack now lives under `replicalab/training/`. | |
| - `notebooks/train_minimal_colab.ipynb` is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL. | |
| - `notebooks/train_colab.ipynb` is the judged notebook driver, but heavy runs | |
| are expected to use the `replicalab-train` entrypoint on Northflank H100. | |
| - The primary shared base is now `Qwen/Qwen3.5-9B` with separate Scientist | |
| GRPO and Lab Manager SFT adapters. | |
| - The reduced-scale fallback is `Qwen/Qwen3.5-4B`. | |
| - The audit-only judge candidate is `Qwen/Qwen3.5-122B-A10B`. | |
| - The deterministic rubric remains the only training reward source even when | |
| Anthropic-backed oracle features are enabled for V2 overlays. | |
| - `docs/training_goals.md` now defines the current model goals and the | |
| separation between metric improvements and the larger execution-env redesign. | |
| - A March 9 operational check found that the current Hugging Face token is | |
| valid for Hub auth but belongs to a non-billable personal account | |
| (`canPay=false`, no orgs), so it is not currently enough to provision paid | |
| large-model hosting on Hugging Face. | |
| - The current Northflank manual job `replicalab-train` still has runtime env | |
| values, but `northflank start job run` returns `409 No deployment | |
| configured`, so the job cannot launch until a runnable image/deployment is | |
| attached. | |
| - The live Northflank service on the same `nf-gpu-hack-16-64` plan does not | |
| currently expose `nvidia-smi` or `/dev/nvidia*` inside the container, so GPU | |
| availability should be treated as unverified until the runtime is fixed and a | |
| direct hardware probe succeeds. | |
| Current Northflank notebook note: | |
| - The dedicated notebook service now lives in project `notebook-openport` as | |
| service `jupyter-pytorch`. | |
| - The pasted notebook hostname `app--jupyter-pytorch--h74j66w224jx.code.run` | |
| is stale; the live public notebook endpoint on 2026-03-09 is | |
| `app--jupyter-pytorch--9y6g97v7czb9.code.run`. | |
| - The notebook runtime does expose a real `NVIDIA H100 80GB HBM3` GPU. | |
| - `/home/jovyan/replicalab-ai` and `/home/jovyan/replicalab-qwen3.5-grpo` | |
| already exist in that notebook, with saved adapter checkpoints through | |
| `checkpoint-200`. | |
| - The saved `grpo_training.log` shows the notebook ran on H100 but did not | |
| complete cleanly: baseline eval emitted `string indices must be integers, not | |
| 'str'`, and the final inference cell failed in | |
| `tokenizer.apply_chat_template(...)` with the same content-structure issue. | |
| Current ART/OpenEnv runtime note: | |
| - The active live Scientist RL path is now `art-scientist-train` in | |
| `replicalab/training/cli.py`. | |
| - Fresh-runtime smoke validation completed on 2026-03-08 for: | |
| - `scientist-preview-smoke-20260308b` | |
| - `lab-manager-preview-smoke-20260308b` | |
| - `art-scientist-smoke-20260308b` | |
| - `art-scientist-compare-smoke-20260308b` | |
| - The live ART Scientist checkpoint reached `step7`, but the current trained | |
| checkpoint still underperforms the deterministic baseline on held-out | |
| comparison. | |
| - The main remaining work is experiment quality iteration, not missing training | |
| infrastructure. | |
| - Evaluation summaries now track `paper_understanding` and | |
| `communication_quality`, and the shared benchmark-history plots live under | |
| `replicalab/outputs/training/history/`. | |
| Current localhost model-runtime note: | |
| - `server/app.py` now exposes `/runtime` and `/agent-step` so the local app can run a backend-selected Scientist policy instead of the frontend stub. | |
| - Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low. | |
| - Localhost therefore currently runs in `ollama` mode with `glm-5:cloud` as the working model-backed Scientist path. | |
| - The server applies a small deterministic safety adapter to model outputs before env stepping: | |
| - trims controls to fit sample size | |
| - aligns equipment and reagent requests to the available inventory | |
| - clamps duration to the current lab time limit | |
| - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`. | |
| Current March 9 H100 benchmark note: | |
| - The full multi-round `scientist-local-compare-eval` path is live on the | |
| Northflank H100 notebook, but the current notebook image is missing the fast | |
| linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large | |
| sharded rollout sweeps did not flush artifacts on a practical same-turn | |
| timescale. | |
| - A fallback live H100 first-step benchmark was run on 2026-03-09 instead: | |
| `250` shared reset cases with both baseline and trained Scientist first-step | |
| actions, for `500` total simulations. | |
| - The merged artifact root is | |
| `replicalab/outputs/training/h100-one-step-500-20260309/`. | |
| - The benchmark spans `34` trainable papers. | |
| - Summary result: | |
| - baseline average first-step paper understanding: `0.61692084` | |
| - trained average first-step paper understanding: `0.063866752` | |
| - baseline average first-step reward: `0.3` | |
| - trained average first-step reward: `0.05` | |
| - trained request-info rate: `1.0` | |
| - invalid-action rate stayed `0.0` for both labels | |
| - Scenario-level understanding: | |
| - baseline `finance_trading`: `0.596033` | |
| - trained `finance_trading`: `0.018182` | |
| - baseline `ml_benchmark`: `0.633333` | |
| - trained `ml_benchmark`: `0.099762` | |
| - Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is | |
| materially worse than the deterministic baseline on first-step paper | |
| grounding and currently behaves like a universal `request_info` policy under | |
| a fast decode budget. | |