Spaces:

ayushozha
/

replicalab

Running

App Files Files Community

ayushozha commited on Mar 10

Commit

a29a83d

1 Parent(s): cd197ac

Add local H100 scientist eval tooling

Browse files

Files changed (8) hide show

docs/ayush/notes.md +30 -0
docs/changes.md +1 -0
docs/completion.md +9 -0
replicalab/training/__init__.py +8 -0
replicalab/training/cli.py +162 -0
replicalab/training/local_eval.py +210 -0
tests/test_local_eval.py +28 -0
tests/test_training_cli.py +129 -0

docs/ayush/notes.md CHANGED Viewed

@@ -84,3 +84,33 @@ Current localhost model-runtime note:
   - clamps duration to the current lab time limit
 - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.

   - clamps duration to the current lab time limit
 - If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
+Current March 9 H100 benchmark note:
+- The full multi-round `scientist-local-compare-eval` path is live on the
+  Northflank H100 notebook, but the current notebook image is missing the fast
+  linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large
+  sharded rollout sweeps did not flush artifacts on a practical same-turn
+  timescale.
+- A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
+  `250` shared reset cases with both baseline and trained Scientist first-step
+  actions, for `500` total simulations.
+- The merged artifact root is
+  `replicalab/outputs/training/h100-one-step-500-20260309/`.
+- The benchmark spans `34` trainable papers.
+- Summary result:
+  - baseline average first-step paper understanding: `0.61692084`
+  - trained average first-step paper understanding: `0.063866752`
+  - baseline average first-step reward: `0.3`
+  - trained average first-step reward: `0.05`
+  - trained request-info rate: `1.0`
+  - invalid-action rate stayed `0.0` for both labels
+- Scenario-level understanding:
+  - baseline `finance_trading`: `0.596033`
+  - trained `finance_trading`: `0.018182`
+  - baseline `ml_benchmark`: `0.633333`
+  - trained `ml_benchmark`: `0.099762`
+- Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is
+  materially worse than the deterministic baseline on first-step paper
+  grounding and currently behaves like a universal `request_info` policy under
+  a fast decode budget.

docs/changes.md CHANGED Viewed

@@ -94,4 +94,5 @@ Rules:
 | 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
 | 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
 | 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |

 | 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
 | 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
 | 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
+| 2026-03-09 | Person B (Ayush) | H100 paper-understanding benchmark | Shifted the active H100 benchmark from a planned full multi-round rollout sweep to a first-step live environment benchmark on the same notebook | The current notebook image lacks the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so repeated sharded `scientist-local-compare-eval` attempts stayed active for a long time without producing same-turn artifacts even after retry and token-budget cuts | Produced a merged live H100 benchmark artifact set at `replicalab/outputs/training/h100-one-step-500-20260309/` covering `500` total simulations (`250` shared reset cases × baseline/trained first-step actions); the current saved adapter underperformed badly versus the deterministic baseline on first-step paper understanding and collapsed to `request_info` on every trained sample | If a full multi-round benchmark is still required later, first fix the notebook image to restore the fast attention path or move the eval to a more efficient runtime |

docs/completion.md CHANGED Viewed

@@ -25,6 +25,15 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
 | Remaining | 0 |
 | **Completion rate** | **100.00%** |
 ### Completion by Person
 | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |

 | Remaining | 0 |
 | **Completion rate** | **100.00%** |
+Post-MVP benchmark note:
+- On 2026-03-09, a live Northflank H100 first-step benchmark was added as an
+  operational post-MVP artifact under
+  `replicalab/outputs/training/h100-one-step-500-20260309/`.
+- It covers `500` total simulations (`250` shared reset cases × baseline and
+  trained first-step actions) and records paper-understanding regression data
+  for the current saved Scientist adapter.
 ### Completion by Person
 | Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |

replicalab/training/__init__.py CHANGED Viewed

@@ -25,6 +25,11 @@ from replicalab.training.lab_manager_sft import (
     preview_lab_manager_training,
     train_lab_manager_sft,
 )
 from replicalab.training.metrics import EvaluationSummary, summarize_episodes
 from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
 from replicalab.training.scientist_grpo import (
@@ -46,6 +51,7 @@ __all__ = [
     "FrozenEvidencePack",
     "LabManagerSFTConfig",
     "LabManagerSFTExample",
     "RolloutWorker",
     "ScientistGRPOConfig",
     "ScientistPromptExample",
@@ -55,6 +61,7 @@ __all__ = [
     "build_lab_manager_sft_examples",
     "build_scientist_prompt_examples",
     "evaluate_policy",
     "load_frozen_evidence_packs",
     "preview_lab_manager_training",
     "preview_scientist_training",
@@ -62,4 +69,5 @@ __all__ = [
     "summarize_episodes",
     "train_lab_manager_sft",
     "train_scientist_grpo",
 ]

     preview_lab_manager_training,
     train_lab_manager_sft,
 )
+from replicalab.training.local_eval import (
+    PaperBalancedEvaluationCase,
+    build_local_scientist_policy,
+    build_trainable_paper_cases,
+)
 from replicalab.training.metrics import EvaluationSummary, summarize_episodes
 from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
 from replicalab.training.scientist_grpo import (
     "FrozenEvidencePack",
     "LabManagerSFTConfig",
     "LabManagerSFTExample",
+    "PaperBalancedEvaluationCase",
     "RolloutWorker",
     "ScientistGRPOConfig",
     "ScientistPromptExample",
     "build_lab_manager_sft_examples",
     "build_scientist_prompt_examples",
     "evaluate_policy",
+    "build_local_scientist_policy",
     "load_frozen_evidence_packs",
     "preview_lab_manager_training",
     "preview_scientist_training",
     "summarize_episodes",
     "train_lab_manager_sft",
     "train_scientist_grpo",
+    "build_trainable_paper_cases",
 ]

replicalab/training/cli.py CHANGED Viewed

@@ -30,6 +30,10 @@ from replicalab.training.history import (
     build_benchmark_history_row,
     load_benchmark_history,
 )
 from replicalab.training.lab_manager_sft import (
     LabManagerSFTConfig,
     preview_lab_manager_training,
@@ -67,6 +71,8 @@ def main(argv: Sequence[str] | None = None) -> int:
         return _run_baseline_eval(args)
     if args.command == "scientist-compare-eval":
         return _run_scientist_compare_eval(args)
     if args.command == "art-scientist-train":
         return _run_art_scientist_train(args)
@@ -261,6 +267,69 @@ def _build_parser() -> argparse.ArgumentParser:
         help="Sampling temperature for the trained remote Scientist.",
     )
     art_train = subparsers.add_parser(
         "art-scientist-train",
         help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
@@ -683,6 +752,99 @@ def _run_scientist_compare_eval(args: argparse.Namespace) -> int:
     return 0
 def _run_art_scientist_train(args: argparse.Namespace) -> int:
     layout = _build_layout(
         prefix="art-scientist",

     build_benchmark_history_row,
     load_benchmark_history,
 )
+from replicalab.training.local_eval import (
+    build_local_scientist_policy,
+    build_trainable_paper_cases,
+)
 from replicalab.training.lab_manager_sft import (
     LabManagerSFTConfig,
     preview_lab_manager_training,
         return _run_baseline_eval(args)
     if args.command == "scientist-compare-eval":
         return _run_scientist_compare_eval(args)
+    if args.command == "scientist-local-compare-eval":
+        return _run_scientist_local_compare_eval(args)
     if args.command == "art-scientist-train":
         return _run_art_scientist_train(args)
         help="Sampling temperature for the trained remote Scientist.",
     )
+    local_compare_eval = subparsers.add_parser(
+        "scientist-local-compare-eval",
+        help="Compare baseline Scientist versus a local trained LoRA adapter.",
+    )
+    _add_common_artifact_args(local_compare_eval, prefix="eval-local-compare")
+    local_compare_eval.add_argument(
+        "--base-url",
+        default="https://ayushozha-replicalab.hf.space",
+        help="ReplicaLab environment base URL.",
+    )
+    local_compare_eval.add_argument(
+        "--transport",
+        default="rest",
+        choices=("rest", "ws"),
+        help="Transport used by ReplicaLabClient.",
+    )
+    local_compare_eval.add_argument(
+        "--adapter-dir",
+        required=True,
+        help="Path to the trained local Scientist adapter directory.",
+    )
+    local_compare_eval.add_argument(
+        "--base-model",
+        default="Qwen/Qwen3.5-9B",
+        help="Base model used by the local adapter.",
+    )
+    local_compare_eval.add_argument(
+        "--case-count",
+        type=int,
+        default=500,
+        help="Number of live rollout simulations to run.",
+    )
+    local_compare_eval.add_argument(
+        "--case-offset",
+        type=int,
+        default=0,
+        help="Starting case index for sharded live rollout runs.",
+    )
+    local_compare_eval.add_argument(
+        "--difficulties",
+        nargs="+",
+        default=["easy", "medium", "hard"],
+        help="Difficulty levels to cycle through when building the live rollout set.",
+    )
+    local_compare_eval.add_argument(
+        "--max-completion-tokens",
+        type=int,
+        default=450,
+        help="Max completion tokens for the local trained Scientist.",
+    )
+    local_compare_eval.add_argument(
+        "--temperature",
+        type=float,
+        default=0.0,
+        help="Sampling temperature for the local trained Scientist.",
+    )
+    local_compare_eval.add_argument(
+        "--max-retries",
+        type=int,
+        default=2,
+        help="Maximum parse-retry attempts for the local trained Scientist.",
+    )
     art_train = subparsers.add_parser(
         "art-scientist-train",
         help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
     return 0
+def _run_scientist_local_compare_eval(args: argparse.Namespace) -> int:
+    layout = _build_layout(
+        prefix="eval-local-compare",
+        persist_root=args.persist_root,
+        run_name=args.run_name,
+    )
+    case_specs = build_trainable_paper_cases(
+        args.case_count,
+        case_index_offset=args.case_offset,
+        difficulties=args.difficulties,
+    )
+    cases = [spec.to_evaluation_case() for spec in case_specs]
+    trained_policy = build_local_scientist_policy(
+        base_model=args.base_model,
+        adapter_dir=args.adapter_dir,
+        max_completion_tokens=args.max_completion_tokens,
+        temperature=args.temperature,
+        max_retries=args.max_retries,
+    )
+    records_by_label, rows = compare_policies(
+        base_url=args.base_url,
+        policies=[
+            ("baseline", build_baseline_scientist_action),
+            ("trained", trained_policy),
+        ],
+        cases=cases,
+        transport=args.transport,
+    )
+    write_json(
+        layout.manifests_dir / "evaluation_cases.json",
+        [spec.model_dump(mode="json") for spec in case_specs],
+    )
+    _write_run_metadata(
+        layout,
+        {
+            "kind": "scientist_local_compare_eval",
+            "base_url": args.base_url,
+            "transport": args.transport,
+            "adapter_dir": args.adapter_dir,
+            "base_model": args.base_model,
+            "case_count": args.case_count,
+            "case_offset": args.case_offset,
+            "difficulties": args.difficulties,
+            "max_retries": args.max_retries,
+            "bounded_tool_policy": [
+                "search_evidence",
+                "run_code_check",
+                "inspect_image",
+            ],
+        },
+    )
+    for label, records in records_by_label.items():
+        for spec, record in zip(case_specs, records):
+            append_jsonl(
+                layout.metrics_jsonl,
+                {
+                    "label": label,
+                    "case_index": spec.case_index,
+                    "expected_evidence_id": spec.expected_evidence_id,
+                    "expected_paper_title": spec.expected_paper_title,
+                    **episode_to_metrics(record).model_dump(mode="json"),
+                },
+            )
+    rows_payload = [row.model_dump(mode="json") for row in rows]
+    unique_papers = len({spec.expected_evidence_id for spec in case_specs})
+    write_json(
+        layout.summary_json,
+        {
+            "rows": rows_payload,
+            "case_count": len(case_specs),
+            "unique_expected_papers": unique_papers,
+        },
+    )
+    _plot_comparison_summary(rows_payload, layout=layout)
+    _append_history_and_plots(
+        layout=layout,
+        kind="scientist_local_compare_eval",
+        rows=rows_payload,
+    )
+    print(
+        json.dumps(
+            {
+                "rows": rows_payload,
+                "case_count": len(case_specs),
+                "unique_expected_papers": unique_papers,
+            },
+            indent=2,
+            sort_keys=True,
+        )
+    )
+    return 0
 def _run_art_scientist_train(args: argparse.Namespace) -> int:
     layout = _build_layout(
         prefix="art-scientist",

replicalab/training/local_eval.py ADDED Viewed

	@@ -0,0 +1,210 @@

+"""Helpers for local-adapter evaluation against live ReplicaLab rollouts."""
+from __future__ import annotations
+from pathlib import Path
+from typing import Callable, Sequence
+from pydantic import BaseModel, ConfigDict
+from replicalab.agents.scientist_policy import (
+    ScientistOutputParseError,
+    _build_live_scientist_system_prompt,
+    call_scientist_with_retry,
+    format_scientist_observation,
+)
+from replicalab.models import ScientistAction, ScientistActionType, ScientistObservation
+from replicalab.training.corpus import load_frozen_evidence_packs, select_evidence_pack
+from replicalab.training.evaluation import EvaluationCase
+from replicalab.training.runtime import require_module
+class PaperBalancedEvaluationCase(BaseModel):
+    """One deterministic rollout case with expected paper metadata."""
+    model_config = ConfigDict(extra="forbid")
+    case_index: int
+    seed: int
+    scenario: str
+    difficulty: str
+    expected_evidence_id: str
+    expected_paper_title: str
+    def to_evaluation_case(self) -> EvaluationCase:
+        return EvaluationCase(
+            seed=self.seed,
+            scenario=self.scenario,
+            difficulty=self.difficulty,
+        )
+def build_trainable_paper_cases(
+    total_cases: int,
+    *,
+    case_index_offset: int = 0,
+    difficulties: Sequence[str] = ("easy", "medium", "hard"),
+) -> list[PaperBalancedEvaluationCase]:
+    """Build a deterministic live-eval set balanced across trainable papers."""
+    if total_cases < 1:
+        raise ValueError("total_cases must be at least 1")
+    if case_index_offset < 0:
+        raise ValueError("case_index_offset must be at least 0")
+    if not difficulties:
+        raise ValueError("difficulties must not be empty")
+    packs = [pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env]
+    if not packs:
+        raise ValueError("No trainable evidence packs are wired into the current env.")
+    by_template: dict[str, list[object]] = {}
+    for pack in packs:
+        assert pack.template is not None
+        by_template.setdefault(pack.template, []).append(pack)
+    for template in by_template:
+        by_template[template] = sorted(
+            by_template[template],
+            key=lambda pack: pack.scenario_number,  # type: ignore[attr-defined]
+        )
+    targets: list[tuple[str, int, object]] = []
+    for template in sorted(by_template):
+        for pack_index, pack in enumerate(by_template[template]):
+            targets.append((template, pack_index, pack))
+    cases: list[PaperBalancedEvaluationCase] = []
+    for local_index in range(total_cases):
+        case_index = case_index_offset + local_index
+        template, pack_index, pack = targets[case_index % len(targets)]
+        cycle_index = case_index // len(targets)
+        template_pack_count = len(by_template[template])
+        seed = pack_index + cycle_index * template_pack_count
+        difficulty = difficulties[(case_index + pack_index) % len(difficulties)]
+        cases.append(
+            PaperBalancedEvaluationCase(
+                case_index=case_index,
+                seed=seed,
+                scenario=template,
+                difficulty=difficulty,
+                expected_evidence_id=pack.evidence_id,  # type: ignore[attr-defined]
+                expected_paper_title=pack.downloaded_paper_title,  # type: ignore[attr-defined]
+            )
+        )
+    return cases
+def build_local_scientist_policy(
+    *,
+    base_model: str,
+    adapter_dir: str | Path,
+    max_completion_tokens: int = 450,
+    temperature: float = 0.0,
+    max_retries: int = 2,
+) -> Callable[[ScientistObservation], ScientistAction]:
+    """Create a sync Scientist policy callable backed by a local PEFT adapter."""
+    torch = require_module("torch")
+    transformers = require_module("transformers")
+    peft = require_module("peft")
+    adapter_path = Path(adapter_dir).expanduser().resolve()
+    if not adapter_path.exists():
+        raise FileNotFoundError(f"Adapter directory does not exist: {adapter_path}")
+    tokenizer = transformers.AutoTokenizer.from_pretrained(
+        str(adapter_path),
+        trust_remote_code=True,
+    )
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        base_model,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    model = peft.PeftModel.from_pretrained(model, str(adapter_path))
+    model.eval()
+    device = next(model.parameters()).device
+    evidence_packs = [
+        pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env
+    ]
+    def generate_fn(messages: list[dict[str, str]]) -> str:
+        prompt_text = tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        enc = tokenizer(prompt_text, return_tensors="pt").to(device)
+        generation_kwargs = {
+            "input_ids": enc["input_ids"],
+            "attention_mask": enc["attention_mask"],
+            "max_new_tokens": max_completion_tokens,
+            "pad_token_id": tokenizer.eos_token_id,
+            "do_sample": temperature > 0.0,
+        }
+        if temperature > 0.0:
+            generation_kwargs["temperature"] = temperature
+        with torch.no_grad():
+            outputs = model.generate(**generation_kwargs)
+        generated_ids = outputs[0][enc["input_ids"].shape[1]:]
+        return tokenizer.decode(generated_ids, skip_special_tokens=True)
+    def policy_fn(
+        observation: ScientistObservation,
+        *,
+        seed: int | None = None,
+        scenario: str | None = None,
+        difficulty: str | None = None,
+    ) -> ScientistAction:
+        evidence_pack = None
+        if seed is not None and scenario is not None:
+            evidence_pack = select_evidence_pack(
+                evidence_packs,
+                template=scenario,  # type: ignore[arg-type]
+                seed=seed,
+            )
+        user_message = format_scientist_observation(observation)
+        if evidence_pack is not None:
+            user_message += "\n\nFrozen evidence pack:\n" + evidence_pack.prompt_block()
+        try:
+            result = call_scientist_with_retry(
+                generate_fn,
+                _build_live_scientist_system_prompt(
+                    observation,
+                    evidence_pack=evidence_pack,
+                    difficulty=difficulty,
+                    scenario=scenario,
+                ),
+                observation,
+                max_retries=max_retries,
+                user_message_override=user_message,
+            )
+            return result.action
+        except ScientistOutputParseError:
+            return ScientistAction(
+                action_type=ScientistActionType.REQUEST_INFO,
+                sample_size=0,
+                controls=[],
+                technique="",
+                duration_days=0,
+                required_equipment=[],
+                required_reagents=[],
+                questions=[
+                    "Please restate the main blocking requirement or missing evidence."
+                ],
+                rationale="",
+            )
+    return policy_fn
+__all__ = [
+    "PaperBalancedEvaluationCase",
+    "build_local_scientist_policy",
+    "build_trainable_paper_cases",
+]

tests/test_local_eval.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from __future__ import annotations
+from replicalab.training.local_eval import build_trainable_paper_cases
+def test_build_trainable_paper_cases_builds_exact_requested_count() -> None:
+    cases = build_trainable_paper_cases(50)
+    assert len(cases) == 50
+    assert all(case.scenario in {"ml_benchmark", "finance_trading"} for case in cases)
+    assert len({case.expected_evidence_id for case in cases[:34]}) == 34
+    assert len({case.expected_evidence_id for case in cases}) >= 34
+def test_build_trainable_paper_cases_rejects_non_positive_count() -> None:
+    try:
+        build_trainable_paper_cases(0)
+    except ValueError as exc:
+        assert "at least 1" in str(exc)
+    else:
+        raise AssertionError("Expected ValueError for non-positive case count")
+def test_build_trainable_paper_cases_supports_offsets() -> None:
+    cases = build_trainable_paper_cases(3, case_index_offset=34)
+    assert [case.case_index for case in cases] == [34, 35, 36]
+    assert len({case.expected_evidence_id for case in cases}) == 3

tests/test_training_cli.py CHANGED Viewed

@@ -206,3 +206,132 @@ def test_scientist_compare_eval_cli_writes_rows(tmp_path, monkeypatch) -> None:
     assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
     assert payload["rows"][1]["average_reward"] == 3.5
     assert history_path.exists()

     assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
     assert payload["rows"][1]["average_reward"] == 3.5
     assert history_path.exists()
+def test_scientist_local_compare_eval_cli_writes_cases_and_metrics(tmp_path, monkeypatch) -> None:
+    baseline_record = EpisodeRecord(
+        seed=0,
+        scenario="ml_benchmark",
+        difficulty="easy",
+        episode_id="baseline-1",
+        total_reward=1.0,
+        reward_breakdown=RewardBreakdown(rigor=0.3, feasibility=0.4, fidelity=0.5),
+        verdict="timeout",
+        agreement_reached=False,
+    )
+    trained_record = EpisodeRecord(
+        seed=0,
+        scenario="ml_benchmark",
+        difficulty="easy",
+        episode_id="trained-1",
+        total_reward=2.5,
+        reward_breakdown=RewardBreakdown(rigor=0.7, feasibility=0.8, fidelity=0.75),
+        verdict="accept",
+        agreement_reached=True,
+    )
+    rows = [
+        PolicyComparisonRow(
+            label="baseline",
+            episode_count=1,
+            average_reward=1.0,
+            average_rounds=2.0,
+            agreement_rate=0.0,
+            invalid_action_rate=0.0,
+            average_invalid_bounded_tool_rate=0.0,
+            average_rigor=0.3,
+            average_feasibility=0.4,
+            average_fidelity=0.5,
+            average_parsimony=1.0,
+            average_tool_trace_count=0.0,
+            average_paper_understanding=0.2,
+            average_communication_quality=0.0,
+        ),
+        PolicyComparisonRow(
+            label="trained",
+            episode_count=1,
+            average_reward=2.5,
+            average_rounds=1.0,
+            agreement_rate=1.0,
+            invalid_action_rate=0.0,
+            average_invalid_bounded_tool_rate=0.0,
+            average_rigor=0.7,
+            average_feasibility=0.8,
+            average_fidelity=0.75,
+            average_parsimony=1.0,
+            average_tool_trace_count=0.0,
+            average_paper_understanding=0.6,
+            average_communication_quality=0.0,
+        ),
+    ]
+    class _CaseSpec:
+        case_index = 7
+        expected_evidence_id = "ml:paper-1"
+        expected_paper_title = "Paper 1"
+        def to_evaluation_case(self) -> object:
+            return object()
+        def model_dump(self, mode: str = "json") -> dict[str, object]:
+            return {
+                "case_index": 7,
+                "seed": 0,
+                "scenario": "ml_benchmark",
+                "difficulty": "easy",
+                "expected_evidence_id": "ml:paper-1",
+                "expected_paper_title": "Paper 1",
+            }
+    monkeypatch.setattr(
+        "replicalab.training.cli.build_trainable_paper_cases",
+        lambda *args, **kwargs: [_CaseSpec()],
+    )
+    monkeypatch.setattr(
+        "replicalab.training.cli.build_local_scientist_policy",
+        lambda **_: (lambda _obs: None),
+    )
+    monkeypatch.setattr(
+        "replicalab.training.cli.compare_policies",
+        lambda **_: (
+            {"baseline": [baseline_record], "trained": [trained_record]},
+            rows,
+        ),
+    )
+    monkeypatch.setattr(
+        "replicalab.training.cli.plot_evaluation_bars",
+        lambda *args, **kwargs: None,
+    )
+    monkeypatch.setattr(
+        "replicalab.training.cli.plot_benchmark_history",
+        lambda *args, **kwargs: None,
+    )
+    exit_code = main(
+        [
+            "scientist-local-compare-eval",
+            "--persist-root",
+            str(tmp_path),
+            "--run-name",
+            "local-compare-test",
+            "--adapter-dir",
+            str(tmp_path / "adapter"),
+            "--case-count",
+            "1",
+            "--case-offset",
+            "7",
+        ]
+    )
+    assert exit_code == 0
+    summary_path = tmp_path / "local-compare-test" / "reports" / "summary.json"
+    metrics_path = tmp_path / "local-compare-test" / "reports" / "metrics.jsonl"
+    cases_path = tmp_path / "local-compare-test" / "manifests" / "evaluation_cases.json"
+    payload = json.loads(summary_path.read_text(encoding="utf-8"))
+    assert payload["case_count"] == 1
+    assert payload["unique_expected_papers"] == 1
+    metrics_lines = metrics_path.read_text(encoding="utf-8").strip().splitlines()
+    assert len(metrics_lines) == 2
+    first_metric = json.loads(metrics_lines[0])
+    assert first_metric["case_index"] == 7
+    assert first_metric["expected_evidence_id"] == "ml:paper-1"
+    assert cases_path.exists()