replicalab / docs /ayush /notes.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Ayush Notes

Use this file for short-lived working notes, reminders, and handoff details.

Do not use this file for durable deviations from the original plan. Put those in docs/changes.md.

Current local training-data note:

  • A 50-paper experiment-design corpus now exists under data/papers/.
  • Use data/papers/manifest.json for the full scenario-to-paper mapping.
  • Most entries are marked alternative because many scenario titles in ReplicaLab_50_Scenarios_Training_Plan.md are synthetic summaries rather than directly downloadable published paper titles.

Current V2 training architecture note:

  • The reusable training stack now lives under replicalab/training/.
  • notebooks/train_minimal_colab.ipynb is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL.
  • notebooks/train_colab.ipynb is the judged notebook driver, but heavy runs are expected to use the replicalab-train entrypoint on Northflank H100.
  • The primary shared base is now Qwen/Qwen3.5-9B with separate Scientist GRPO and Lab Manager SFT adapters.
  • The reduced-scale fallback is Qwen/Qwen3.5-4B.
  • The audit-only judge candidate is Qwen/Qwen3.5-122B-A10B.
  • The deterministic rubric remains the only training reward source even when Anthropic-backed oracle features are enabled for V2 overlays.
  • docs/training_goals.md now defines the current model goals and the separation between metric improvements and the larger execution-env redesign.
  • A March 9 operational check found that the current Hugging Face token is valid for Hub auth but belongs to a non-billable personal account (canPay=false, no orgs), so it is not currently enough to provision paid large-model hosting on Hugging Face.
  • The current Northflank manual job replicalab-train still has runtime env values, but northflank start job run returns 409 No deployment configured, so the job cannot launch until a runnable image/deployment is attached.
  • The live Northflank service on the same nf-gpu-hack-16-64 plan does not currently expose nvidia-smi or /dev/nvidia* inside the container, so GPU availability should be treated as unverified until the runtime is fixed and a direct hardware probe succeeds.

Current Northflank notebook note:

  • The dedicated notebook service now lives in project notebook-openport as service jupyter-pytorch.
  • The pasted notebook hostname app--jupyter-pytorch--h74j66w224jx.code.run is stale; the live public notebook endpoint on 2026-03-09 is app--jupyter-pytorch--9y6g97v7czb9.code.run.
  • The notebook runtime does expose a real NVIDIA H100 80GB HBM3 GPU.
  • /home/jovyan/replicalab-ai and /home/jovyan/replicalab-qwen3.5-grpo already exist in that notebook, with saved adapter checkpoints through checkpoint-200.
  • The saved grpo_training.log shows the notebook ran on H100 but did not complete cleanly: baseline eval emitted string indices must be integers, not 'str', and the final inference cell failed in tokenizer.apply_chat_template(...) with the same content-structure issue.

Current ART/OpenEnv runtime note:

  • The active live Scientist RL path is now art-scientist-train in replicalab/training/cli.py.
  • Fresh-runtime smoke validation completed on 2026-03-08 for:
    • scientist-preview-smoke-20260308b
    • lab-manager-preview-smoke-20260308b
    • art-scientist-smoke-20260308b
    • art-scientist-compare-smoke-20260308b
  • The live ART Scientist checkpoint reached step7, but the current trained checkpoint still underperforms the deterministic baseline on held-out comparison.
  • The main remaining work is experiment quality iteration, not missing training infrastructure.
  • Evaluation summaries now track paper_understanding and communication_quality, and the shared benchmark-history plots live under replicalab/outputs/training/history/.

Current localhost model-runtime note:

  • server/app.py now exposes /runtime and /agent-step so the local app can run a backend-selected Scientist policy instead of the frontend stub.
  • Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low.
  • Localhost therefore currently runs in ollama mode with glm-5:cloud as the working model-backed Scientist path.
  • The server applies a small deterministic safety adapter to model outputs before env stepping:
    • trims controls to fit sample size
    • aligns equipment and reagent requests to the available inventory
    • clamps duration to the current lab time limit
  • If the local model stalls or errors, /agent-step falls back to the deterministic baseline Scientist and records that in the step metadata as scientist_runtime=ollama_fallback.

Current March 9 H100 benchmark note:

  • The full multi-round scientist-local-compare-eval path is live on the Northflank H100 notebook, but the current notebook image is missing the fast linear-attention path for the saved unsloth/Qwen3.5-0.8B adapter, so large sharded rollout sweeps did not flush artifacts on a practical same-turn timescale.
  • A fallback live H100 first-step benchmark was run on 2026-03-09 instead: 250 shared reset cases with both baseline and trained Scientist first-step actions, for 500 total simulations.
  • The merged artifact root is replicalab/outputs/training/h100-one-step-500-20260309/.
  • The benchmark spans 34 trainable papers.
  • Summary result:
    • baseline average first-step paper understanding: 0.61692084
    • trained average first-step paper understanding: 0.063866752
    • baseline average first-step reward: 0.3
    • trained average first-step reward: 0.05
    • trained request-info rate: 1.0
    • invalid-action rate stayed 0.0 for both labels
  • Scenario-level understanding:
    • baseline finance_trading: 0.596033
    • trained finance_trading: 0.018182
    • baseline ml_benchmark: 0.633333
    • trained ml_benchmark: 0.099762
  • Current interpretation: the saved replicalab-qwen3.5-grpo adapter is materially worse than the deterministic baseline on first-step paper grounding and currently behaves like a universal request_info policy under a fast decode budget.