ayushozha commited on
Commit
a29a83d
·
1 Parent(s): cd197ac

Add local H100 scientist eval tooling

Browse files
docs/ayush/notes.md CHANGED
@@ -84,3 +84,33 @@ Current localhost model-runtime note:
84
  - clamps duration to the current lab time limit
85
  - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  - clamps duration to the current lab time limit
85
  - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
86
 
87
+ Current March 9 H100 benchmark note:
88
+
89
+ - The full multi-round `scientist-local-compare-eval` path is live on the
90
+ Northflank H100 notebook, but the current notebook image is missing the fast
91
+ linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large
92
+ sharded rollout sweeps did not flush artifacts on a practical same-turn
93
+ timescale.
94
+ - A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
95
+ `250` shared reset cases with both baseline and trained Scientist first-step
96
+ actions, for `500` total simulations.
97
+ - The merged artifact root is
98
+ `replicalab/outputs/training/h100-one-step-500-20260309/`.
99
+ - The benchmark spans `34` trainable papers.
100
+ - Summary result:
101
+ - baseline average first-step paper understanding: `0.61692084`
102
+ - trained average first-step paper understanding: `0.063866752`
103
+ - baseline average first-step reward: `0.3`
104
+ - trained average first-step reward: `0.05`
105
+ - trained request-info rate: `1.0`
106
+ - invalid-action rate stayed `0.0` for both labels
107
+ - Scenario-level understanding:
108
+ - baseline `finance_trading`: `0.596033`
109
+ - trained `finance_trading`: `0.018182`
110
+ - baseline `ml_benchmark`: `0.633333`
111
+ - trained `ml_benchmark`: `0.099762`
112
+ - Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is
113
+ materially worse than the deterministic baseline on first-step paper
114
+ grounding and currently behaves like a universal `request_info` policy under
115
+ a fast decode budget.
116
+
docs/changes.md CHANGED
@@ -94,4 +94,5 @@ Rules:
94
  | 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
95
  | 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
96
  | 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
 
97
 
 
94
  | 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
95
  | 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
96
  | 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
97
+ | 2026-03-09 | Person B (Ayush) | H100 paper-understanding benchmark | Shifted the active H100 benchmark from a planned full multi-round rollout sweep to a first-step live environment benchmark on the same notebook | The current notebook image lacks the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so repeated sharded `scientist-local-compare-eval` attempts stayed active for a long time without producing same-turn artifacts even after retry and token-budget cuts | Produced a merged live H100 benchmark artifact set at `replicalab/outputs/training/h100-one-step-500-20260309/` covering `500` total simulations (`250` shared reset cases × baseline/trained first-step actions); the current saved adapter underperformed badly versus the deterministic baseline on first-step paper understanding and collapsed to `request_info` on every trained sample | If a full multi-round benchmark is still required later, first fix the notebook image to restore the fast attention path or move the eval to a more efficient runtime |
98
 
docs/completion.md CHANGED
@@ -25,6 +25,15 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
25
  | Remaining | 0 |
26
  | **Completion rate** | **100.00%** |
27
 
 
 
 
 
 
 
 
 
 
28
  ### Completion by Person
29
 
30
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
 
25
  | Remaining | 0 |
26
  | **Completion rate** | **100.00%** |
27
 
28
+ Post-MVP benchmark note:
29
+
30
+ - On 2026-03-09, a live Northflank H100 first-step benchmark was added as an
31
+ operational post-MVP artifact under
32
+ `replicalab/outputs/training/h100-one-step-500-20260309/`.
33
+ - It covers `500` total simulations (`250` shared reset cases × baseline and
34
+ trained first-step actions) and records paper-understanding regression data
35
+ for the current saved Scientist adapter.
36
+
37
  ### Completion by Person
38
 
39
  | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
replicalab/training/__init__.py CHANGED
@@ -25,6 +25,11 @@ from replicalab.training.lab_manager_sft import (
25
  preview_lab_manager_training,
26
  train_lab_manager_sft,
27
  )
 
 
 
 
 
28
  from replicalab.training.metrics import EvaluationSummary, summarize_episodes
29
  from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
30
  from replicalab.training.scientist_grpo import (
@@ -46,6 +51,7 @@ __all__ = [
46
  "FrozenEvidencePack",
47
  "LabManagerSFTConfig",
48
  "LabManagerSFTExample",
 
49
  "RolloutWorker",
50
  "ScientistGRPOConfig",
51
  "ScientistPromptExample",
@@ -55,6 +61,7 @@ __all__ = [
55
  "build_lab_manager_sft_examples",
56
  "build_scientist_prompt_examples",
57
  "evaluate_policy",
 
58
  "load_frozen_evidence_packs",
59
  "preview_lab_manager_training",
60
  "preview_scientist_training",
@@ -62,4 +69,5 @@ __all__ = [
62
  "summarize_episodes",
63
  "train_lab_manager_sft",
64
  "train_scientist_grpo",
 
65
  ]
 
25
  preview_lab_manager_training,
26
  train_lab_manager_sft,
27
  )
28
+ from replicalab.training.local_eval import (
29
+ PaperBalancedEvaluationCase,
30
+ build_local_scientist_policy,
31
+ build_trainable_paper_cases,
32
+ )
33
  from replicalab.training.metrics import EvaluationSummary, summarize_episodes
34
  from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
35
  from replicalab.training.scientist_grpo import (
 
51
  "FrozenEvidencePack",
52
  "LabManagerSFTConfig",
53
  "LabManagerSFTExample",
54
+ "PaperBalancedEvaluationCase",
55
  "RolloutWorker",
56
  "ScientistGRPOConfig",
57
  "ScientistPromptExample",
 
61
  "build_lab_manager_sft_examples",
62
  "build_scientist_prompt_examples",
63
  "evaluate_policy",
64
+ "build_local_scientist_policy",
65
  "load_frozen_evidence_packs",
66
  "preview_lab_manager_training",
67
  "preview_scientist_training",
 
69
  "summarize_episodes",
70
  "train_lab_manager_sft",
71
  "train_scientist_grpo",
72
+ "build_trainable_paper_cases",
73
  ]
replicalab/training/cli.py CHANGED
@@ -30,6 +30,10 @@ from replicalab.training.history import (
30
  build_benchmark_history_row,
31
  load_benchmark_history,
32
  )
 
 
 
 
33
  from replicalab.training.lab_manager_sft import (
34
  LabManagerSFTConfig,
35
  preview_lab_manager_training,
@@ -67,6 +71,8 @@ def main(argv: Sequence[str] | None = None) -> int:
67
  return _run_baseline_eval(args)
68
  if args.command == "scientist-compare-eval":
69
  return _run_scientist_compare_eval(args)
 
 
70
  if args.command == "art-scientist-train":
71
  return _run_art_scientist_train(args)
72
 
@@ -261,6 +267,69 @@ def _build_parser() -> argparse.ArgumentParser:
261
  help="Sampling temperature for the trained remote Scientist.",
262
  )
263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
  art_train = subparsers.add_parser(
265
  "art-scientist-train",
266
  help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
@@ -683,6 +752,99 @@ def _run_scientist_compare_eval(args: argparse.Namespace) -> int:
683
  return 0
684
 
685
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
686
  def _run_art_scientist_train(args: argparse.Namespace) -> int:
687
  layout = _build_layout(
688
  prefix="art-scientist",
 
30
  build_benchmark_history_row,
31
  load_benchmark_history,
32
  )
33
+ from replicalab.training.local_eval import (
34
+ build_local_scientist_policy,
35
+ build_trainable_paper_cases,
36
+ )
37
  from replicalab.training.lab_manager_sft import (
38
  LabManagerSFTConfig,
39
  preview_lab_manager_training,
 
71
  return _run_baseline_eval(args)
72
  if args.command == "scientist-compare-eval":
73
  return _run_scientist_compare_eval(args)
74
+ if args.command == "scientist-local-compare-eval":
75
+ return _run_scientist_local_compare_eval(args)
76
  if args.command == "art-scientist-train":
77
  return _run_art_scientist_train(args)
78
 
 
267
  help="Sampling temperature for the trained remote Scientist.",
268
  )
269
 
270
+ local_compare_eval = subparsers.add_parser(
271
+ "scientist-local-compare-eval",
272
+ help="Compare baseline Scientist versus a local trained LoRA adapter.",
273
+ )
274
+ _add_common_artifact_args(local_compare_eval, prefix="eval-local-compare")
275
+ local_compare_eval.add_argument(
276
+ "--base-url",
277
+ default="https://ayushozha-replicalab.hf.space",
278
+ help="ReplicaLab environment base URL.",
279
+ )
280
+ local_compare_eval.add_argument(
281
+ "--transport",
282
+ default="rest",
283
+ choices=("rest", "ws"),
284
+ help="Transport used by ReplicaLabClient.",
285
+ )
286
+ local_compare_eval.add_argument(
287
+ "--adapter-dir",
288
+ required=True,
289
+ help="Path to the trained local Scientist adapter directory.",
290
+ )
291
+ local_compare_eval.add_argument(
292
+ "--base-model",
293
+ default="Qwen/Qwen3.5-9B",
294
+ help="Base model used by the local adapter.",
295
+ )
296
+ local_compare_eval.add_argument(
297
+ "--case-count",
298
+ type=int,
299
+ default=500,
300
+ help="Number of live rollout simulations to run.",
301
+ )
302
+ local_compare_eval.add_argument(
303
+ "--case-offset",
304
+ type=int,
305
+ default=0,
306
+ help="Starting case index for sharded live rollout runs.",
307
+ )
308
+ local_compare_eval.add_argument(
309
+ "--difficulties",
310
+ nargs="+",
311
+ default=["easy", "medium", "hard"],
312
+ help="Difficulty levels to cycle through when building the live rollout set.",
313
+ )
314
+ local_compare_eval.add_argument(
315
+ "--max-completion-tokens",
316
+ type=int,
317
+ default=450,
318
+ help="Max completion tokens for the local trained Scientist.",
319
+ )
320
+ local_compare_eval.add_argument(
321
+ "--temperature",
322
+ type=float,
323
+ default=0.0,
324
+ help="Sampling temperature for the local trained Scientist.",
325
+ )
326
+ local_compare_eval.add_argument(
327
+ "--max-retries",
328
+ type=int,
329
+ default=2,
330
+ help="Maximum parse-retry attempts for the local trained Scientist.",
331
+ )
332
+
333
  art_train = subparsers.add_parser(
334
  "art-scientist-train",
335
  help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
 
752
  return 0
753
 
754
 
755
+ def _run_scientist_local_compare_eval(args: argparse.Namespace) -> int:
756
+ layout = _build_layout(
757
+ prefix="eval-local-compare",
758
+ persist_root=args.persist_root,
759
+ run_name=args.run_name,
760
+ )
761
+ case_specs = build_trainable_paper_cases(
762
+ args.case_count,
763
+ case_index_offset=args.case_offset,
764
+ difficulties=args.difficulties,
765
+ )
766
+ cases = [spec.to_evaluation_case() for spec in case_specs]
767
+ trained_policy = build_local_scientist_policy(
768
+ base_model=args.base_model,
769
+ adapter_dir=args.adapter_dir,
770
+ max_completion_tokens=args.max_completion_tokens,
771
+ temperature=args.temperature,
772
+ max_retries=args.max_retries,
773
+ )
774
+ records_by_label, rows = compare_policies(
775
+ base_url=args.base_url,
776
+ policies=[
777
+ ("baseline", build_baseline_scientist_action),
778
+ ("trained", trained_policy),
779
+ ],
780
+ cases=cases,
781
+ transport=args.transport,
782
+ )
783
+ write_json(
784
+ layout.manifests_dir / "evaluation_cases.json",
785
+ [spec.model_dump(mode="json") for spec in case_specs],
786
+ )
787
+ _write_run_metadata(
788
+ layout,
789
+ {
790
+ "kind": "scientist_local_compare_eval",
791
+ "base_url": args.base_url,
792
+ "transport": args.transport,
793
+ "adapter_dir": args.adapter_dir,
794
+ "base_model": args.base_model,
795
+ "case_count": args.case_count,
796
+ "case_offset": args.case_offset,
797
+ "difficulties": args.difficulties,
798
+ "max_retries": args.max_retries,
799
+ "bounded_tool_policy": [
800
+ "search_evidence",
801
+ "run_code_check",
802
+ "inspect_image",
803
+ ],
804
+ },
805
+ )
806
+ for label, records in records_by_label.items():
807
+ for spec, record in zip(case_specs, records):
808
+ append_jsonl(
809
+ layout.metrics_jsonl,
810
+ {
811
+ "label": label,
812
+ "case_index": spec.case_index,
813
+ "expected_evidence_id": spec.expected_evidence_id,
814
+ "expected_paper_title": spec.expected_paper_title,
815
+ **episode_to_metrics(record).model_dump(mode="json"),
816
+ },
817
+ )
818
+ rows_payload = [row.model_dump(mode="json") for row in rows]
819
+ unique_papers = len({spec.expected_evidence_id for spec in case_specs})
820
+ write_json(
821
+ layout.summary_json,
822
+ {
823
+ "rows": rows_payload,
824
+ "case_count": len(case_specs),
825
+ "unique_expected_papers": unique_papers,
826
+ },
827
+ )
828
+ _plot_comparison_summary(rows_payload, layout=layout)
829
+ _append_history_and_plots(
830
+ layout=layout,
831
+ kind="scientist_local_compare_eval",
832
+ rows=rows_payload,
833
+ )
834
+ print(
835
+ json.dumps(
836
+ {
837
+ "rows": rows_payload,
838
+ "case_count": len(case_specs),
839
+ "unique_expected_papers": unique_papers,
840
+ },
841
+ indent=2,
842
+ sort_keys=True,
843
+ )
844
+ )
845
+ return 0
846
+
847
+
848
  def _run_art_scientist_train(args: argparse.Namespace) -> int:
849
  layout = _build_layout(
850
  prefix="art-scientist",
replicalab/training/local_eval.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Helpers for local-adapter evaluation against live ReplicaLab rollouts."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+ from typing import Callable, Sequence
7
+
8
+ from pydantic import BaseModel, ConfigDict
9
+
10
+ from replicalab.agents.scientist_policy import (
11
+ ScientistOutputParseError,
12
+ _build_live_scientist_system_prompt,
13
+ call_scientist_with_retry,
14
+ format_scientist_observation,
15
+ )
16
+ from replicalab.models import ScientistAction, ScientistActionType, ScientistObservation
17
+ from replicalab.training.corpus import load_frozen_evidence_packs, select_evidence_pack
18
+ from replicalab.training.evaluation import EvaluationCase
19
+ from replicalab.training.runtime import require_module
20
+
21
+
22
+ class PaperBalancedEvaluationCase(BaseModel):
23
+ """One deterministic rollout case with expected paper metadata."""
24
+
25
+ model_config = ConfigDict(extra="forbid")
26
+
27
+ case_index: int
28
+ seed: int
29
+ scenario: str
30
+ difficulty: str
31
+ expected_evidence_id: str
32
+ expected_paper_title: str
33
+
34
+ def to_evaluation_case(self) -> EvaluationCase:
35
+ return EvaluationCase(
36
+ seed=self.seed,
37
+ scenario=self.scenario,
38
+ difficulty=self.difficulty,
39
+ )
40
+
41
+
42
+ def build_trainable_paper_cases(
43
+ total_cases: int,
44
+ *,
45
+ case_index_offset: int = 0,
46
+ difficulties: Sequence[str] = ("easy", "medium", "hard"),
47
+ ) -> list[PaperBalancedEvaluationCase]:
48
+ """Build a deterministic live-eval set balanced across trainable papers."""
49
+
50
+ if total_cases < 1:
51
+ raise ValueError("total_cases must be at least 1")
52
+ if case_index_offset < 0:
53
+ raise ValueError("case_index_offset must be at least 0")
54
+ if not difficulties:
55
+ raise ValueError("difficulties must not be empty")
56
+
57
+ packs = [pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env]
58
+ if not packs:
59
+ raise ValueError("No trainable evidence packs are wired into the current env.")
60
+
61
+ by_template: dict[str, list[object]] = {}
62
+ for pack in packs:
63
+ assert pack.template is not None
64
+ by_template.setdefault(pack.template, []).append(pack)
65
+ for template in by_template:
66
+ by_template[template] = sorted(
67
+ by_template[template],
68
+ key=lambda pack: pack.scenario_number, # type: ignore[attr-defined]
69
+ )
70
+
71
+ targets: list[tuple[str, int, object]] = []
72
+ for template in sorted(by_template):
73
+ for pack_index, pack in enumerate(by_template[template]):
74
+ targets.append((template, pack_index, pack))
75
+
76
+ cases: list[PaperBalancedEvaluationCase] = []
77
+ for local_index in range(total_cases):
78
+ case_index = case_index_offset + local_index
79
+ template, pack_index, pack = targets[case_index % len(targets)]
80
+ cycle_index = case_index // len(targets)
81
+ template_pack_count = len(by_template[template])
82
+ seed = pack_index + cycle_index * template_pack_count
83
+ difficulty = difficulties[(case_index + pack_index) % len(difficulties)]
84
+ cases.append(
85
+ PaperBalancedEvaluationCase(
86
+ case_index=case_index,
87
+ seed=seed,
88
+ scenario=template,
89
+ difficulty=difficulty,
90
+ expected_evidence_id=pack.evidence_id, # type: ignore[attr-defined]
91
+ expected_paper_title=pack.downloaded_paper_title, # type: ignore[attr-defined]
92
+ )
93
+ )
94
+
95
+ return cases
96
+
97
+
98
+ def build_local_scientist_policy(
99
+ *,
100
+ base_model: str,
101
+ adapter_dir: str | Path,
102
+ max_completion_tokens: int = 450,
103
+ temperature: float = 0.0,
104
+ max_retries: int = 2,
105
+ ) -> Callable[[ScientistObservation], ScientistAction]:
106
+ """Create a sync Scientist policy callable backed by a local PEFT adapter."""
107
+
108
+ torch = require_module("torch")
109
+ transformers = require_module("transformers")
110
+ peft = require_module("peft")
111
+
112
+ adapter_path = Path(adapter_dir).expanduser().resolve()
113
+ if not adapter_path.exists():
114
+ raise FileNotFoundError(f"Adapter directory does not exist: {adapter_path}")
115
+
116
+ tokenizer = transformers.AutoTokenizer.from_pretrained(
117
+ str(adapter_path),
118
+ trust_remote_code=True,
119
+ )
120
+ model = transformers.AutoModelForCausalLM.from_pretrained(
121
+ base_model,
122
+ torch_dtype=torch.bfloat16,
123
+ device_map="auto",
124
+ trust_remote_code=True,
125
+ )
126
+ model = peft.PeftModel.from_pretrained(model, str(adapter_path))
127
+ model.eval()
128
+ device = next(model.parameters()).device
129
+
130
+ evidence_packs = [
131
+ pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env
132
+ ]
133
+
134
+ def generate_fn(messages: list[dict[str, str]]) -> str:
135
+ prompt_text = tokenizer.apply_chat_template(
136
+ messages,
137
+ tokenize=False,
138
+ add_generation_prompt=True,
139
+ )
140
+ enc = tokenizer(prompt_text, return_tensors="pt").to(device)
141
+ generation_kwargs = {
142
+ "input_ids": enc["input_ids"],
143
+ "attention_mask": enc["attention_mask"],
144
+ "max_new_tokens": max_completion_tokens,
145
+ "pad_token_id": tokenizer.eos_token_id,
146
+ "do_sample": temperature > 0.0,
147
+ }
148
+ if temperature > 0.0:
149
+ generation_kwargs["temperature"] = temperature
150
+ with torch.no_grad():
151
+ outputs = model.generate(**generation_kwargs)
152
+ generated_ids = outputs[0][enc["input_ids"].shape[1]:]
153
+ return tokenizer.decode(generated_ids, skip_special_tokens=True)
154
+
155
+ def policy_fn(
156
+ observation: ScientistObservation,
157
+ *,
158
+ seed: int | None = None,
159
+ scenario: str | None = None,
160
+ difficulty: str | None = None,
161
+ ) -> ScientistAction:
162
+ evidence_pack = None
163
+ if seed is not None and scenario is not None:
164
+ evidence_pack = select_evidence_pack(
165
+ evidence_packs,
166
+ template=scenario, # type: ignore[arg-type]
167
+ seed=seed,
168
+ )
169
+
170
+ user_message = format_scientist_observation(observation)
171
+ if evidence_pack is not None:
172
+ user_message += "\n\nFrozen evidence pack:\n" + evidence_pack.prompt_block()
173
+
174
+ try:
175
+ result = call_scientist_with_retry(
176
+ generate_fn,
177
+ _build_live_scientist_system_prompt(
178
+ observation,
179
+ evidence_pack=evidence_pack,
180
+ difficulty=difficulty,
181
+ scenario=scenario,
182
+ ),
183
+ observation,
184
+ max_retries=max_retries,
185
+ user_message_override=user_message,
186
+ )
187
+ return result.action
188
+ except ScientistOutputParseError:
189
+ return ScientistAction(
190
+ action_type=ScientistActionType.REQUEST_INFO,
191
+ sample_size=0,
192
+ controls=[],
193
+ technique="",
194
+ duration_days=0,
195
+ required_equipment=[],
196
+ required_reagents=[],
197
+ questions=[
198
+ "Please restate the main blocking requirement or missing evidence."
199
+ ],
200
+ rationale="",
201
+ )
202
+
203
+ return policy_fn
204
+
205
+
206
+ __all__ = [
207
+ "PaperBalancedEvaluationCase",
208
+ "build_local_scientist_policy",
209
+ "build_trainable_paper_cases",
210
+ ]
tests/test_local_eval.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from replicalab.training.local_eval import build_trainable_paper_cases
4
+
5
+
6
+ def test_build_trainable_paper_cases_builds_exact_requested_count() -> None:
7
+ cases = build_trainable_paper_cases(50)
8
+
9
+ assert len(cases) == 50
10
+ assert all(case.scenario in {"ml_benchmark", "finance_trading"} for case in cases)
11
+ assert len({case.expected_evidence_id for case in cases[:34]}) == 34
12
+ assert len({case.expected_evidence_id for case in cases}) >= 34
13
+
14
+
15
+ def test_build_trainable_paper_cases_rejects_non_positive_count() -> None:
16
+ try:
17
+ build_trainable_paper_cases(0)
18
+ except ValueError as exc:
19
+ assert "at least 1" in str(exc)
20
+ else:
21
+ raise AssertionError("Expected ValueError for non-positive case count")
22
+
23
+
24
+ def test_build_trainable_paper_cases_supports_offsets() -> None:
25
+ cases = build_trainable_paper_cases(3, case_index_offset=34)
26
+
27
+ assert [case.case_index for case in cases] == [34, 35, 36]
28
+ assert len({case.expected_evidence_id for case in cases}) == 3
tests/test_training_cli.py CHANGED
@@ -206,3 +206,132 @@ def test_scientist_compare_eval_cli_writes_rows(tmp_path, monkeypatch) -> None:
206
  assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
207
  assert payload["rows"][1]["average_reward"] == 3.5
208
  assert history_path.exists()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
  assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
207
  assert payload["rows"][1]["average_reward"] == 3.5
208
  assert history_path.exists()
209
+
210
+
211
+ def test_scientist_local_compare_eval_cli_writes_cases_and_metrics(tmp_path, monkeypatch) -> None:
212
+ baseline_record = EpisodeRecord(
213
+ seed=0,
214
+ scenario="ml_benchmark",
215
+ difficulty="easy",
216
+ episode_id="baseline-1",
217
+ total_reward=1.0,
218
+ reward_breakdown=RewardBreakdown(rigor=0.3, feasibility=0.4, fidelity=0.5),
219
+ verdict="timeout",
220
+ agreement_reached=False,
221
+ )
222
+ trained_record = EpisodeRecord(
223
+ seed=0,
224
+ scenario="ml_benchmark",
225
+ difficulty="easy",
226
+ episode_id="trained-1",
227
+ total_reward=2.5,
228
+ reward_breakdown=RewardBreakdown(rigor=0.7, feasibility=0.8, fidelity=0.75),
229
+ verdict="accept",
230
+ agreement_reached=True,
231
+ )
232
+ rows = [
233
+ PolicyComparisonRow(
234
+ label="baseline",
235
+ episode_count=1,
236
+ average_reward=1.0,
237
+ average_rounds=2.0,
238
+ agreement_rate=0.0,
239
+ invalid_action_rate=0.0,
240
+ average_invalid_bounded_tool_rate=0.0,
241
+ average_rigor=0.3,
242
+ average_feasibility=0.4,
243
+ average_fidelity=0.5,
244
+ average_parsimony=1.0,
245
+ average_tool_trace_count=0.0,
246
+ average_paper_understanding=0.2,
247
+ average_communication_quality=0.0,
248
+ ),
249
+ PolicyComparisonRow(
250
+ label="trained",
251
+ episode_count=1,
252
+ average_reward=2.5,
253
+ average_rounds=1.0,
254
+ agreement_rate=1.0,
255
+ invalid_action_rate=0.0,
256
+ average_invalid_bounded_tool_rate=0.0,
257
+ average_rigor=0.7,
258
+ average_feasibility=0.8,
259
+ average_fidelity=0.75,
260
+ average_parsimony=1.0,
261
+ average_tool_trace_count=0.0,
262
+ average_paper_understanding=0.6,
263
+ average_communication_quality=0.0,
264
+ ),
265
+ ]
266
+
267
+ class _CaseSpec:
268
+ case_index = 7
269
+ expected_evidence_id = "ml:paper-1"
270
+ expected_paper_title = "Paper 1"
271
+
272
+ def to_evaluation_case(self) -> object:
273
+ return object()
274
+
275
+ def model_dump(self, mode: str = "json") -> dict[str, object]:
276
+ return {
277
+ "case_index": 7,
278
+ "seed": 0,
279
+ "scenario": "ml_benchmark",
280
+ "difficulty": "easy",
281
+ "expected_evidence_id": "ml:paper-1",
282
+ "expected_paper_title": "Paper 1",
283
+ }
284
+
285
+ monkeypatch.setattr(
286
+ "replicalab.training.cli.build_trainable_paper_cases",
287
+ lambda *args, **kwargs: [_CaseSpec()],
288
+ )
289
+ monkeypatch.setattr(
290
+ "replicalab.training.cli.build_local_scientist_policy",
291
+ lambda **_: (lambda _obs: None),
292
+ )
293
+ monkeypatch.setattr(
294
+ "replicalab.training.cli.compare_policies",
295
+ lambda **_: (
296
+ {"baseline": [baseline_record], "trained": [trained_record]},
297
+ rows,
298
+ ),
299
+ )
300
+ monkeypatch.setattr(
301
+ "replicalab.training.cli.plot_evaluation_bars",
302
+ lambda *args, **kwargs: None,
303
+ )
304
+ monkeypatch.setattr(
305
+ "replicalab.training.cli.plot_benchmark_history",
306
+ lambda *args, **kwargs: None,
307
+ )
308
+
309
+ exit_code = main(
310
+ [
311
+ "scientist-local-compare-eval",
312
+ "--persist-root",
313
+ str(tmp_path),
314
+ "--run-name",
315
+ "local-compare-test",
316
+ "--adapter-dir",
317
+ str(tmp_path / "adapter"),
318
+ "--case-count",
319
+ "1",
320
+ "--case-offset",
321
+ "7",
322
+ ]
323
+ )
324
+
325
+ assert exit_code == 0
326
+ summary_path = tmp_path / "local-compare-test" / "reports" / "summary.json"
327
+ metrics_path = tmp_path / "local-compare-test" / "reports" / "metrics.jsonl"
328
+ cases_path = tmp_path / "local-compare-test" / "manifests" / "evaluation_cases.json"
329
+ payload = json.loads(summary_path.read_text(encoding="utf-8"))
330
+ assert payload["case_count"] == 1
331
+ assert payload["unique_expected_papers"] == 1
332
+ metrics_lines = metrics_path.read_text(encoding="utf-8").strip().splitlines()
333
+ assert len(metrics_lines) == 2
334
+ first_metric = json.loads(metrics_lines[0])
335
+ assert first_metric["case_index"] == 7
336
+ assert first_metric["expected_evidence_id"] == "ml:paper-1"
337
+ assert cases_path.exists()