Spaces:

openenv-community
/

replicalab

Running

App Files Files Community

replicalab / docs /ayush /notes.md

maxxie114

Initial HF Spaces deployment

80d8c84 2 days ago

preview code

raw

history blame contribute delete

6.24 kB

Ayush Notes

Use this file for short-lived working notes, reminders, and handoff details.

Do not use this file for durable deviations from the original plan. Put those in docs/changes.md.

Current local training-data note:

A 50-paper experiment-design corpus now exists under data/papers/.
Use data/papers/manifest.json for the full scenario-to-paper mapping.
Most entries are marked alternative because many scenario titles in ReplicaLab_50_Scenarios_Training_Plan.md are synthetic summaries rather than directly downloadable published paper titles.

Current V2 training architecture note:

The reusable training stack now lives under replicalab/training/.
notebooks/train_minimal_colab.ipynb is now the explicit sponsor-facing minimal Colab script using Unsloth + HF TRL.
notebooks/train_colab.ipynb is the judged notebook driver, but heavy runs are expected to use the replicalab-train entrypoint on Northflank H100.
The primary shared base is now Qwen/Qwen3.5-9B with separate Scientist GRPO and Lab Manager SFT adapters.
The reduced-scale fallback is Qwen/Qwen3.5-4B.
The audit-only judge candidate is Qwen/Qwen3.5-122B-A10B.
The deterministic rubric remains the only training reward source even when Anthropic-backed oracle features are enabled for V2 overlays.
docs/training_goals.md now defines the current model goals and the separation between metric improvements and the larger execution-env redesign.
A March 9 operational check found that the current Hugging Face token is valid for Hub auth but belongs to a non-billable personal account (canPay=false, no orgs), so it is not currently enough to provision paid large-model hosting on Hugging Face.
The current Northflank manual job replicalab-train still has runtime env values, but northflank start job run returns 409 No deployment configured, so the job cannot launch until a runnable image/deployment is attached.
The live Northflank service on the same nf-gpu-hack-16-64 plan does not currently expose nvidia-smi or /dev/nvidia* inside the container, so GPU availability should be treated as unverified until the runtime is fixed and a direct hardware probe succeeds.

Current Northflank notebook note:

The dedicated notebook service now lives in project notebook-openport as service jupyter-pytorch.
The pasted notebook hostname app--jupyter-pytorch--h74j66w224jx.code.run is stale; the live public notebook endpoint on 2026-03-09 is app--jupyter-pytorch--9y6g97v7czb9.code.run.
The notebook runtime does expose a real NVIDIA H100 80GB HBM3 GPU.
/home/jovyan/replicalab-ai and /home/jovyan/replicalab-qwen3.5-grpo already exist in that notebook, with saved adapter checkpoints through checkpoint-200.
The saved grpo_training.log shows the notebook ran on H100 but did not complete cleanly: baseline eval emitted string indices must be integers, not 'str', and the final inference cell failed in tokenizer.apply_chat_template(...) with the same content-structure issue.

Current ART/OpenEnv runtime note:

The active live Scientist RL path is now art-scientist-train in replicalab/training/cli.py.
Fresh-runtime smoke validation completed on 2026-03-08 for:
- scientist-preview-smoke-20260308b
- lab-manager-preview-smoke-20260308b
- art-scientist-smoke-20260308b
- art-scientist-compare-smoke-20260308b
The live ART Scientist checkpoint reached step7, but the current trained checkpoint still underperforms the deterministic baseline on held-out comparison.
The main remaining work is experiment quality iteration, not missing training infrastructure.
Evaluation summaries now track paper_understanding and communication_quality, and the shared benchmark-history plots live under replicalab/outputs/training/history/.

Current localhost model-runtime note:

server/app.py now exposes /runtime and /agent-step so the local app can run a backend-selected Scientist policy instead of the frontend stub.
Anthropic-backed Scientist inference was wired, but the current Anthropic account cannot be used live because the API billing balance is too low.
Localhost therefore currently runs in ollama mode with glm-5:cloud as the working model-backed Scientist path.
The server applies a small deterministic safety adapter to model outputs before env stepping:
- trims controls to fit sample size
- aligns equipment and reagent requests to the available inventory
- clamps duration to the current lab time limit
If the local model stalls or errors, /agent-step falls back to the deterministic baseline Scientist and records that in the step metadata as scientist_runtime=ollama_fallback.

Current March 9 H100 benchmark note:

The full multi-round scientist-local-compare-eval path is live on the Northflank H100 notebook, but the current notebook image is missing the fast linear-attention path for the saved unsloth/Qwen3.5-0.8B adapter, so large sharded rollout sweeps did not flush artifacts on a practical same-turn timescale.
A fallback live H100 first-step benchmark was run on 2026-03-09 instead: 250 shared reset cases with both baseline and trained Scientist first-step actions, for 500 total simulations.
The merged artifact root is replicalab/outputs/training/h100-one-step-500-20260309/.
The benchmark spans 34 trainable papers.
Summary result:
- baseline average first-step paper understanding: 0.61692084
- trained average first-step paper understanding: 0.063866752
- baseline average first-step reward: 0.3
- trained average first-step reward: 0.05
- trained request-info rate: 1.0
- invalid-action rate stayed 0.0 for both labels
Scenario-level understanding:
- baseline finance_trading: 0.596033
- trained finance_trading: 0.018182
- baseline ml_benchmark: 0.633333
- trained ml_benchmark: 0.099762
Current interpretation: the saved replicalab-qwen3.5-grpo adapter is materially worse than the deterministic baseline on first-step paper grounding and currently behaves like a universal request_info policy under a fast decode budget.