Spaces:

openenv-community
/

replicalab

Running

File size: 6,243 Bytes

80d8c84

# Ayush Notes

Use this file for short-lived working notes, reminders, and handoff details.

Do not use this file for durable deviations from the original plan. Put those in `docs/changes.md`.

Current local training-data note:

- A 50-paper experiment-design corpus now exists under `data/papers/`.
- Use `data/papers/manifest.json` for the full scenario-to-paper mapping.
- Most entries are marked `alternative` because many scenario titles in
  `ReplicaLab_50_Scenarios_Training_Plan.md` are synthetic summaries rather
  than directly downloadable published paper titles.

Current V2 training architecture note:

- The reusable training stack now lives under `replicalab/training/`.
- `notebooks/train_minimal_colab.ipynb` is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL.
- `notebooks/train_colab.ipynb` is the judged notebook driver, but heavy runs
  are expected to use the `replicalab-train` entrypoint on Northflank H100.
- The primary shared base is now `Qwen/Qwen3.5-9B` with separate Scientist
  GRPO and Lab Manager SFT adapters.
- The reduced-scale fallback is `Qwen/Qwen3.5-4B`.
- The audit-only judge candidate is `Qwen/Qwen3.5-122B-A10B`.
- The deterministic rubric remains the only training reward source even when
  Anthropic-backed oracle features are enabled for V2 overlays.
- `docs/training_goals.md` now defines the current model goals and the
  separation between metric improvements and the larger execution-env redesign.
- A March 9 operational check found that the current Hugging Face token is
  valid for Hub auth but belongs to a non-billable personal account
  (`canPay=false`, no orgs), so it is not currently enough to provision paid
  large-model hosting on Hugging Face.
- The current Northflank manual job `replicalab-train` still has runtime env
  values, but `northflank start job run` returns `409 No deployment
  configured`, so the job cannot launch until a runnable image/deployment is
  attached.
- The live Northflank service on the same `nf-gpu-hack-16-64` plan does not
  currently expose `nvidia-smi` or `/dev/nvidia*` inside the container, so GPU
  availability should be treated as unverified until the runtime is fixed and a
  direct hardware probe succeeds.

Current Northflank notebook note:

- The dedicated notebook service now lives in project `notebook-openport` as
  service `jupyter-pytorch`.
- The pasted notebook hostname `app--jupyter-pytorch--h74j66w224jx.code.run`
  is stale; the live public notebook endpoint on 2026-03-09 is
  `app--jupyter-pytorch--9y6g97v7czb9.code.run`.
- The notebook runtime does expose a real `NVIDIA H100 80GB HBM3` GPU.
- `/home/jovyan/replicalab-ai` and `/home/jovyan/replicalab-qwen3.5-grpo`
  already exist in that notebook, with saved adapter checkpoints through
  `checkpoint-200`.
- The saved `grpo_training.log` shows the notebook ran on H100 but did not
  complete cleanly: baseline eval emitted `string indices must be integers, not
  'str'`, and the final inference cell failed in
  `tokenizer.apply_chat_template(...)` with the same content-structure issue.

Current ART/OpenEnv runtime note:

- The active live Scientist RL path is now `art-scientist-train` in
  `replicalab/training/cli.py`.
- Fresh-runtime smoke validation completed on 2026-03-08 for:
  - `scientist-preview-smoke-20260308b`
  - `lab-manager-preview-smoke-20260308b`
  - `art-scientist-smoke-20260308b`
  - `art-scientist-compare-smoke-20260308b`
- The live ART Scientist checkpoint reached `step7`, but the current trained
  checkpoint still underperforms the deterministic baseline on held-out
  comparison.
- The main remaining work is experiment quality iteration, not missing training
  infrastructure.
- Evaluation summaries now track `paper_understanding` and
  `communication_quality`, and the shared benchmark-history plots live under
  `replicalab/outputs/training/history/`.

Current localhost model-runtime note:

- `server/app.py` now exposes `/runtime` and `/agent-step` so the local app can run a backend-selected Scientist policy instead of the frontend stub.
- Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low.
- Localhost therefore currently runs in `ollama` mode with `glm-5:cloud` as the working model-backed Scientist path.
- The server applies a small deterministic safety adapter to model outputs before env stepping:
  - trims controls to fit sample size
  - aligns equipment and reagent requests to the available inventory
  - clamps duration to the current lab time limit
- If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.

Current March 9 H100 benchmark note:

- The full multi-round `scientist-local-compare-eval` path is live on the
  Northflank H100 notebook, but the current notebook image is missing the fast
  linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large
  sharded rollout sweeps did not flush artifacts on a practical same-turn
  timescale.
- A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
  `250` shared reset cases with both baseline and trained Scientist first-step
  actions, for `500` total simulations.
- The merged artifact root is
  `replicalab/outputs/training/h100-one-step-500-20260309/`.
- The benchmark spans `34` trainable papers.
- Summary result:
  - baseline average first-step paper understanding: `0.61692084`
  - trained average first-step paper understanding: `0.063866752`
  - baseline average first-step reward: `0.3`
  - trained average first-step reward: `0.05`
  - trained request-info rate: `1.0`
  - invalid-action rate stayed `0.0` for both labels
- Scenario-level understanding:
  - baseline `finance_trading`: `0.596033`
  - trained `finance_trading`: `0.018182`
  - baseline `ml_benchmark`: `0.633333`
  - trained `ml_benchmark`: `0.099762`
- Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is
  materially worse than the deterministic baseline on first-step paper
  grounding and currently behaves like a universal `request_info` policy under
  a fast decode budget.