Spaces:
Running
Running
Ayush Notes
Use this file for short-lived working notes, reminders, and handoff details.
Do not use this file for durable deviations from the original plan. Put those in docs/changes.md.
Current local training-data note:
- A 50-paper experiment-design corpus now exists under
data/papers/. - Use
data/papers/manifest.jsonfor the full scenario-to-paper mapping. - Most entries are marked
alternativebecause many scenario titles inReplicaLab_50_Scenarios_Training_Plan.mdare synthetic summaries rather than directly downloadable published paper titles.
Current V2 training architecture note:
- The reusable training stack now lives under
replicalab/training/. notebooks/train_minimal_colab.ipynbis now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL.notebooks/train_colab.ipynbis the judged notebook driver, but heavy runs are expected to use thereplicalab-trainentrypoint on Northflank H100.- The primary shared base is now
Qwen/Qwen3.5-9Bwith separate Scientist GRPO and Lab Manager SFT adapters. - The reduced-scale fallback is
Qwen/Qwen3.5-4B. - The audit-only judge candidate is
Qwen/Qwen3.5-122B-A10B. - The deterministic rubric remains the only training reward source even when Anthropic-backed oracle features are enabled for V2 overlays.
docs/training_goals.mdnow defines the current model goals and the separation between metric improvements and the larger execution-env redesign.- A March 9 operational check found that the current Hugging Face token is
valid for Hub auth but belongs to a non-billable personal account
(
canPay=false, no orgs), so it is not currently enough to provision paid large-model hosting on Hugging Face. - The current Northflank manual job
replicalab-trainstill has runtime env values, butnorthflank start job runreturns409 No deployment configured, so the job cannot launch until a runnable image/deployment is attached. - The live Northflank service on the same
nf-gpu-hack-16-64plan does not currently exposenvidia-smior/dev/nvidia*inside the container, so GPU availability should be treated as unverified until the runtime is fixed and a direct hardware probe succeeds.
Current Northflank notebook note:
- The dedicated notebook service now lives in project
notebook-openportas servicejupyter-pytorch. - The pasted notebook hostname
app--jupyter-pytorch--h74j66w224jx.code.runis stale; the live public notebook endpoint on 2026-03-09 isapp--jupyter-pytorch--9y6g97v7czb9.code.run. - The notebook runtime does expose a real
NVIDIA H100 80GB HBM3GPU. /home/jovyan/replicalab-aiand/home/jovyan/replicalab-qwen3.5-grpoalready exist in that notebook, with saved adapter checkpoints throughcheckpoint-200.- The saved
grpo_training.logshows the notebook ran on H100 but did not complete cleanly: baseline eval emittedstring indices must be integers, not 'str', and the final inference cell failed intokenizer.apply_chat_template(...)with the same content-structure issue.
Current ART/OpenEnv runtime note:
- The active live Scientist RL path is now
art-scientist-traininreplicalab/training/cli.py. - Fresh-runtime smoke validation completed on 2026-03-08 for:
scientist-preview-smoke-20260308blab-manager-preview-smoke-20260308bart-scientist-smoke-20260308bart-scientist-compare-smoke-20260308b
- The live ART Scientist checkpoint reached
step7, but the current trained checkpoint still underperforms the deterministic baseline on held-out comparison. - The main remaining work is experiment quality iteration, not missing training infrastructure.
- Evaluation summaries now track
paper_understandingandcommunication_quality, and the shared benchmark-history plots live underreplicalab/outputs/training/history/.
Current localhost model-runtime note:
server/app.pynow exposes/runtimeand/agent-stepso the local app can run a backend-selected Scientist policy instead of the frontend stub.- Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low.
- Localhost therefore currently runs in
ollamamode withglm-5:cloudas the working model-backed Scientist path. - The server applies a small deterministic safety adapter to model outputs before env stepping:
- trims controls to fit sample size
- aligns equipment and reagent requests to the available inventory
- clamps duration to the current lab time limit
- If the local model stalls or errors,
/agent-stepfalls back to the deterministic baseline Scientist and records that in the step metadata asscientist_runtime=ollama_fallback.
Current March 9 H100 benchmark note:
- The full multi-round
scientist-local-compare-evalpath is live on the Northflank H100 notebook, but the current notebook image is missing the fast linear-attention path for the savedunsloth/Qwen3.5-0.8Badapter, so large sharded rollout sweeps did not flush artifacts on a practical same-turn timescale. - A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
250shared reset cases with both baseline and trained Scientist first-step actions, for500total simulations. - The merged artifact root is
replicalab/outputs/training/h100-one-step-500-20260309/. - The benchmark spans
34trainable papers. - Summary result:
- baseline average first-step paper understanding:
0.61692084 - trained average first-step paper understanding:
0.063866752 - baseline average first-step reward:
0.3 - trained average first-step reward:
0.05 - trained request-info rate:
1.0 - invalid-action rate stayed
0.0for both labels
- baseline average first-step paper understanding:
- Scenario-level understanding:
- baseline
finance_trading:0.596033 - trained
finance_trading:0.018182 - baseline
ml_benchmark:0.633333 - trained
ml_benchmark:0.099762
- baseline
- Current interpretation: the saved
replicalab-qwen3.5-grpoadapter is materially worse than the deterministic baseline on first-step paper grounding and currently behaves like a universalrequest_infopolicy under a fast decode budget.