Spaces:
Running
Running
Add local H100 scientist eval tooling
Browse files- docs/ayush/notes.md +30 -0
- docs/changes.md +1 -0
- docs/completion.md +9 -0
- replicalab/training/__init__.py +8 -0
- replicalab/training/cli.py +162 -0
- replicalab/training/local_eval.py +210 -0
- tests/test_local_eval.py +28 -0
- tests/test_training_cli.py +129 -0
docs/ayush/notes.md
CHANGED
|
@@ -84,3 +84,33 @@ Current localhost model-runtime note:
|
|
| 84 |
- clamps duration to the current lab time limit
|
| 85 |
- If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
- clamps duration to the current lab time limit
|
| 85 |
- If the local model stalls or errors, `/agent-step` falls back to the deterministic baseline Scientist and records that in the step metadata as `scientist_runtime=ollama_fallback`.
|
| 86 |
|
| 87 |
+
Current March 9 H100 benchmark note:
|
| 88 |
+
|
| 89 |
+
- The full multi-round `scientist-local-compare-eval` path is live on the
|
| 90 |
+
Northflank H100 notebook, but the current notebook image is missing the fast
|
| 91 |
+
linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so large
|
| 92 |
+
sharded rollout sweeps did not flush artifacts on a practical same-turn
|
| 93 |
+
timescale.
|
| 94 |
+
- A fallback live H100 first-step benchmark was run on 2026-03-09 instead:
|
| 95 |
+
`250` shared reset cases with both baseline and trained Scientist first-step
|
| 96 |
+
actions, for `500` total simulations.
|
| 97 |
+
- The merged artifact root is
|
| 98 |
+
`replicalab/outputs/training/h100-one-step-500-20260309/`.
|
| 99 |
+
- The benchmark spans `34` trainable papers.
|
| 100 |
+
- Summary result:
|
| 101 |
+
- baseline average first-step paper understanding: `0.61692084`
|
| 102 |
+
- trained average first-step paper understanding: `0.063866752`
|
| 103 |
+
- baseline average first-step reward: `0.3`
|
| 104 |
+
- trained average first-step reward: `0.05`
|
| 105 |
+
- trained request-info rate: `1.0`
|
| 106 |
+
- invalid-action rate stayed `0.0` for both labels
|
| 107 |
+
- Scenario-level understanding:
|
| 108 |
+
- baseline `finance_trading`: `0.596033`
|
| 109 |
+
- trained `finance_trading`: `0.018182`
|
| 110 |
+
- baseline `ml_benchmark`: `0.633333`
|
| 111 |
+
- trained `ml_benchmark`: `0.099762`
|
| 112 |
+
- Current interpretation: the saved `replicalab-qwen3.5-grpo` adapter is
|
| 113 |
+
materially worse than the deterministic baseline on first-step paper
|
| 114 |
+
grounding and currently behaves like a universal `request_info` policy under
|
| 115 |
+
a fast decode budget.
|
| 116 |
+
|
docs/changes.md
CHANGED
|
@@ -94,4 +94,5 @@ Rules:
|
|
| 94 |
| 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
|
| 95 |
| 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
|
| 96 |
| 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
|
|
|
|
| 97 |
|
|
|
|
| 94 |
| 2026-03-09 | Person B (Ayush) | Post-MVP training refinement | Shifted the active training iteration from the older `Qwen3-8B` assumption to `Qwen3.5-9B`, added prompt-goal expansion plus paper-understanding and communication metrics, and started persisting cross-run benchmark history plots | Model quality is now the bottleneck, so the next useful work is better training coverage and evaluation signal rather than more plumbing; the user also requested a clearer separation between immediate metric work and a later execution-environment redesign | Scientist and Lab Manager defaults now target `Qwen/Qwen3.5-9B`, eval outputs now track `paper_understanding` and `communication_quality`, shared benchmark history now accumulates under `replicalab/outputs/training/history/`, and `docs/training_goals.md` records the larger execution-env phase as a separate architecture track | Keep the deterministic judge as the reward source; treat any large-model judge such as `Qwen3.5-122B-A10B` as audit-only until an explicit architecture change is approved |
|
| 95 |
| 2026-03-09 | Person B (Ayush) | Deployment reality check for HF + Northflank | Recorded the current hosted-model and training-launch blockers after verifying the live tokens and remote resources instead of assuming the documented path was still operational | The project docs described HF-heavy hosting and Northflank H100 training as available paths, but the current HF account is not billable and the current Northflank training job is not runnable yet | Verified via live checks that the HF token authenticates but the account reports `canPay=false` with no orgs, that `replicalab-train` returns `409 No deployment configured` when started, and that the live `replicalab-ai` container on `nf-gpu-hack-16-64` does not expose `nvidia-smi` or `/dev/nvidia*` | Before promising heavy-model hosting or H100 training, attach a runnable image to the job, re-probe GPU visibility from inside the runtime, and enable a billing-backed HF account or move serving to another provider |
|
| 96 |
| 2026-03-09 | Person B (Ayush) | Northflank notebook validation | Validated the separate Northflank notebook service after the original pasted notebook hostname turned out to be stale | The repo previously had an unrunnable training job but the team also had a live Jupyter route; without checking the actual service, it was unclear whether H100 access existed, whether the notebook credentials worked, and whether the saved training state was usable | Verified the live `notebook-openport/jupyter-pytorch` service, confirmed successful Jupyter login, confirmed in-container `NVIDIA H100 80GB HBM3`, identified the live notebook DNS `app--jupyter-pytorch--9y6g97v7czb9.code.run`, and inspected the saved GRPO outputs/logs showing checkpoints through step 200 followed by a chat-template/content-format failure | Use the notebook as the current heavy-run path only after reconciling its repo state with the main workspace and fixing the `apply_chat_template` message-format bug |
|
| 97 |
+
| 2026-03-09 | Person B (Ayush) | H100 paper-understanding benchmark | Shifted the active H100 benchmark from a planned full multi-round rollout sweep to a first-step live environment benchmark on the same notebook | The current notebook image lacks the fast linear-attention path for the saved `unsloth/Qwen3.5-0.8B` adapter, so repeated sharded `scientist-local-compare-eval` attempts stayed active for a long time without producing same-turn artifacts even after retry and token-budget cuts | Produced a merged live H100 benchmark artifact set at `replicalab/outputs/training/h100-one-step-500-20260309/` covering `500` total simulations (`250` shared reset cases × baseline/trained first-step actions); the current saved adapter underperformed badly versus the deterministic baseline on first-step paper understanding and collapsed to `request_info` on every trained sample | If a full multi-round benchmark is still required later, first fix the notebook image to restore the fast attention path or move the eval to a more efficient runtime |
|
| 98 |
|
docs/completion.md
CHANGED
|
@@ -25,6 +25,15 @@ Source of truth: `ReplicaLab_Comprehensive_Task_Division.md`
|
|
| 25 |
| Remaining | 0 |
|
| 26 |
| **Completion rate** | **100.00%** |
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
### Completion by Person
|
| 29 |
|
| 30 |
| Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
|
|
|
|
| 25 |
| Remaining | 0 |
|
| 26 |
| **Completion rate** | **100.00%** |
|
| 27 |
|
| 28 |
+
Post-MVP benchmark note:
|
| 29 |
+
|
| 30 |
+
- On 2026-03-09, a live Northflank H100 first-step benchmark was added as an
|
| 31 |
+
operational post-MVP artifact under
|
| 32 |
+
`replicalab/outputs/training/h100-one-step-500-20260309/`.
|
| 33 |
+
- It covers `500` total simulations (`250` shared reset cases × baseline and
|
| 34 |
+
trained first-step actions) and records paper-understanding regression data
|
| 35 |
+
for the current saved Scientist adapter.
|
| 36 |
+
|
| 37 |
### Completion by Person
|
| 38 |
|
| 39 |
| Person | Assigned | Completed (own) | Completed (by others) | Remaining | Rate |
|
replicalab/training/__init__.py
CHANGED
|
@@ -25,6 +25,11 @@ from replicalab.training.lab_manager_sft import (
|
|
| 25 |
preview_lab_manager_training,
|
| 26 |
train_lab_manager_sft,
|
| 27 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
from replicalab.training.metrics import EvaluationSummary, summarize_episodes
|
| 29 |
from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
|
| 30 |
from replicalab.training.scientist_grpo import (
|
|
@@ -46,6 +51,7 @@ __all__ = [
|
|
| 46 |
"FrozenEvidencePack",
|
| 47 |
"LabManagerSFTConfig",
|
| 48 |
"LabManagerSFTExample",
|
|
|
|
| 49 |
"RolloutWorker",
|
| 50 |
"ScientistGRPOConfig",
|
| 51 |
"ScientistPromptExample",
|
|
@@ -55,6 +61,7 @@ __all__ = [
|
|
| 55 |
"build_lab_manager_sft_examples",
|
| 56 |
"build_scientist_prompt_examples",
|
| 57 |
"evaluate_policy",
|
|
|
|
| 58 |
"load_frozen_evidence_packs",
|
| 59 |
"preview_lab_manager_training",
|
| 60 |
"preview_scientist_training",
|
|
@@ -62,4 +69,5 @@ __all__ = [
|
|
| 62 |
"summarize_episodes",
|
| 63 |
"train_lab_manager_sft",
|
| 64 |
"train_scientist_grpo",
|
|
|
|
| 65 |
]
|
|
|
|
| 25 |
preview_lab_manager_training,
|
| 26 |
train_lab_manager_sft,
|
| 27 |
)
|
| 28 |
+
from replicalab.training.local_eval import (
|
| 29 |
+
PaperBalancedEvaluationCase,
|
| 30 |
+
build_local_scientist_policy,
|
| 31 |
+
build_trainable_paper_cases,
|
| 32 |
+
)
|
| 33 |
from replicalab.training.metrics import EvaluationSummary, summarize_episodes
|
| 34 |
from replicalab.training.rollout import EpisodeRecord, RolloutWorker, StepRecord
|
| 35 |
from replicalab.training.scientist_grpo import (
|
|
|
|
| 51 |
"FrozenEvidencePack",
|
| 52 |
"LabManagerSFTConfig",
|
| 53 |
"LabManagerSFTExample",
|
| 54 |
+
"PaperBalancedEvaluationCase",
|
| 55 |
"RolloutWorker",
|
| 56 |
"ScientistGRPOConfig",
|
| 57 |
"ScientistPromptExample",
|
|
|
|
| 61 |
"build_lab_manager_sft_examples",
|
| 62 |
"build_scientist_prompt_examples",
|
| 63 |
"evaluate_policy",
|
| 64 |
+
"build_local_scientist_policy",
|
| 65 |
"load_frozen_evidence_packs",
|
| 66 |
"preview_lab_manager_training",
|
| 67 |
"preview_scientist_training",
|
|
|
|
| 69 |
"summarize_episodes",
|
| 70 |
"train_lab_manager_sft",
|
| 71 |
"train_scientist_grpo",
|
| 72 |
+
"build_trainable_paper_cases",
|
| 73 |
]
|
replicalab/training/cli.py
CHANGED
|
@@ -30,6 +30,10 @@ from replicalab.training.history import (
|
|
| 30 |
build_benchmark_history_row,
|
| 31 |
load_benchmark_history,
|
| 32 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
from replicalab.training.lab_manager_sft import (
|
| 34 |
LabManagerSFTConfig,
|
| 35 |
preview_lab_manager_training,
|
|
@@ -67,6 +71,8 @@ def main(argv: Sequence[str] | None = None) -> int:
|
|
| 67 |
return _run_baseline_eval(args)
|
| 68 |
if args.command == "scientist-compare-eval":
|
| 69 |
return _run_scientist_compare_eval(args)
|
|
|
|
|
|
|
| 70 |
if args.command == "art-scientist-train":
|
| 71 |
return _run_art_scientist_train(args)
|
| 72 |
|
|
@@ -261,6 +267,69 @@ def _build_parser() -> argparse.ArgumentParser:
|
|
| 261 |
help="Sampling temperature for the trained remote Scientist.",
|
| 262 |
)
|
| 263 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 264 |
art_train = subparsers.add_parser(
|
| 265 |
"art-scientist-train",
|
| 266 |
help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
|
|
@@ -683,6 +752,99 @@ def _run_scientist_compare_eval(args: argparse.Namespace) -> int:
|
|
| 683 |
return 0
|
| 684 |
|
| 685 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 686 |
def _run_art_scientist_train(args: argparse.Namespace) -> int:
|
| 687 |
layout = _build_layout(
|
| 688 |
prefix="art-scientist",
|
|
|
|
| 30 |
build_benchmark_history_row,
|
| 31 |
load_benchmark_history,
|
| 32 |
)
|
| 33 |
+
from replicalab.training.local_eval import (
|
| 34 |
+
build_local_scientist_policy,
|
| 35 |
+
build_trainable_paper_cases,
|
| 36 |
+
)
|
| 37 |
from replicalab.training.lab_manager_sft import (
|
| 38 |
LabManagerSFTConfig,
|
| 39 |
preview_lab_manager_training,
|
|
|
|
| 71 |
return _run_baseline_eval(args)
|
| 72 |
if args.command == "scientist-compare-eval":
|
| 73 |
return _run_scientist_compare_eval(args)
|
| 74 |
+
if args.command == "scientist-local-compare-eval":
|
| 75 |
+
return _run_scientist_local_compare_eval(args)
|
| 76 |
if args.command == "art-scientist-train":
|
| 77 |
return _run_art_scientist_train(args)
|
| 78 |
|
|
|
|
| 267 |
help="Sampling temperature for the trained remote Scientist.",
|
| 268 |
)
|
| 269 |
|
| 270 |
+
local_compare_eval = subparsers.add_parser(
|
| 271 |
+
"scientist-local-compare-eval",
|
| 272 |
+
help="Compare baseline Scientist versus a local trained LoRA adapter.",
|
| 273 |
+
)
|
| 274 |
+
_add_common_artifact_args(local_compare_eval, prefix="eval-local-compare")
|
| 275 |
+
local_compare_eval.add_argument(
|
| 276 |
+
"--base-url",
|
| 277 |
+
default="https://ayushozha-replicalab.hf.space",
|
| 278 |
+
help="ReplicaLab environment base URL.",
|
| 279 |
+
)
|
| 280 |
+
local_compare_eval.add_argument(
|
| 281 |
+
"--transport",
|
| 282 |
+
default="rest",
|
| 283 |
+
choices=("rest", "ws"),
|
| 284 |
+
help="Transport used by ReplicaLabClient.",
|
| 285 |
+
)
|
| 286 |
+
local_compare_eval.add_argument(
|
| 287 |
+
"--adapter-dir",
|
| 288 |
+
required=True,
|
| 289 |
+
help="Path to the trained local Scientist adapter directory.",
|
| 290 |
+
)
|
| 291 |
+
local_compare_eval.add_argument(
|
| 292 |
+
"--base-model",
|
| 293 |
+
default="Qwen/Qwen3.5-9B",
|
| 294 |
+
help="Base model used by the local adapter.",
|
| 295 |
+
)
|
| 296 |
+
local_compare_eval.add_argument(
|
| 297 |
+
"--case-count",
|
| 298 |
+
type=int,
|
| 299 |
+
default=500,
|
| 300 |
+
help="Number of live rollout simulations to run.",
|
| 301 |
+
)
|
| 302 |
+
local_compare_eval.add_argument(
|
| 303 |
+
"--case-offset",
|
| 304 |
+
type=int,
|
| 305 |
+
default=0,
|
| 306 |
+
help="Starting case index for sharded live rollout runs.",
|
| 307 |
+
)
|
| 308 |
+
local_compare_eval.add_argument(
|
| 309 |
+
"--difficulties",
|
| 310 |
+
nargs="+",
|
| 311 |
+
default=["easy", "medium", "hard"],
|
| 312 |
+
help="Difficulty levels to cycle through when building the live rollout set.",
|
| 313 |
+
)
|
| 314 |
+
local_compare_eval.add_argument(
|
| 315 |
+
"--max-completion-tokens",
|
| 316 |
+
type=int,
|
| 317 |
+
default=450,
|
| 318 |
+
help="Max completion tokens for the local trained Scientist.",
|
| 319 |
+
)
|
| 320 |
+
local_compare_eval.add_argument(
|
| 321 |
+
"--temperature",
|
| 322 |
+
type=float,
|
| 323 |
+
default=0.0,
|
| 324 |
+
help="Sampling temperature for the local trained Scientist.",
|
| 325 |
+
)
|
| 326 |
+
local_compare_eval.add_argument(
|
| 327 |
+
"--max-retries",
|
| 328 |
+
type=int,
|
| 329 |
+
default=2,
|
| 330 |
+
help="Maximum parse-retry attempts for the local trained Scientist.",
|
| 331 |
+
)
|
| 332 |
+
|
| 333 |
art_train = subparsers.add_parser(
|
| 334 |
"art-scientist-train",
|
| 335 |
help="Run ART serverless RL training against the ReplicaLab OpenEnv deployment.",
|
|
|
|
| 752 |
return 0
|
| 753 |
|
| 754 |
|
| 755 |
+
def _run_scientist_local_compare_eval(args: argparse.Namespace) -> int:
|
| 756 |
+
layout = _build_layout(
|
| 757 |
+
prefix="eval-local-compare",
|
| 758 |
+
persist_root=args.persist_root,
|
| 759 |
+
run_name=args.run_name,
|
| 760 |
+
)
|
| 761 |
+
case_specs = build_trainable_paper_cases(
|
| 762 |
+
args.case_count,
|
| 763 |
+
case_index_offset=args.case_offset,
|
| 764 |
+
difficulties=args.difficulties,
|
| 765 |
+
)
|
| 766 |
+
cases = [spec.to_evaluation_case() for spec in case_specs]
|
| 767 |
+
trained_policy = build_local_scientist_policy(
|
| 768 |
+
base_model=args.base_model,
|
| 769 |
+
adapter_dir=args.adapter_dir,
|
| 770 |
+
max_completion_tokens=args.max_completion_tokens,
|
| 771 |
+
temperature=args.temperature,
|
| 772 |
+
max_retries=args.max_retries,
|
| 773 |
+
)
|
| 774 |
+
records_by_label, rows = compare_policies(
|
| 775 |
+
base_url=args.base_url,
|
| 776 |
+
policies=[
|
| 777 |
+
("baseline", build_baseline_scientist_action),
|
| 778 |
+
("trained", trained_policy),
|
| 779 |
+
],
|
| 780 |
+
cases=cases,
|
| 781 |
+
transport=args.transport,
|
| 782 |
+
)
|
| 783 |
+
write_json(
|
| 784 |
+
layout.manifests_dir / "evaluation_cases.json",
|
| 785 |
+
[spec.model_dump(mode="json") for spec in case_specs],
|
| 786 |
+
)
|
| 787 |
+
_write_run_metadata(
|
| 788 |
+
layout,
|
| 789 |
+
{
|
| 790 |
+
"kind": "scientist_local_compare_eval",
|
| 791 |
+
"base_url": args.base_url,
|
| 792 |
+
"transport": args.transport,
|
| 793 |
+
"adapter_dir": args.adapter_dir,
|
| 794 |
+
"base_model": args.base_model,
|
| 795 |
+
"case_count": args.case_count,
|
| 796 |
+
"case_offset": args.case_offset,
|
| 797 |
+
"difficulties": args.difficulties,
|
| 798 |
+
"max_retries": args.max_retries,
|
| 799 |
+
"bounded_tool_policy": [
|
| 800 |
+
"search_evidence",
|
| 801 |
+
"run_code_check",
|
| 802 |
+
"inspect_image",
|
| 803 |
+
],
|
| 804 |
+
},
|
| 805 |
+
)
|
| 806 |
+
for label, records in records_by_label.items():
|
| 807 |
+
for spec, record in zip(case_specs, records):
|
| 808 |
+
append_jsonl(
|
| 809 |
+
layout.metrics_jsonl,
|
| 810 |
+
{
|
| 811 |
+
"label": label,
|
| 812 |
+
"case_index": spec.case_index,
|
| 813 |
+
"expected_evidence_id": spec.expected_evidence_id,
|
| 814 |
+
"expected_paper_title": spec.expected_paper_title,
|
| 815 |
+
**episode_to_metrics(record).model_dump(mode="json"),
|
| 816 |
+
},
|
| 817 |
+
)
|
| 818 |
+
rows_payload = [row.model_dump(mode="json") for row in rows]
|
| 819 |
+
unique_papers = len({spec.expected_evidence_id for spec in case_specs})
|
| 820 |
+
write_json(
|
| 821 |
+
layout.summary_json,
|
| 822 |
+
{
|
| 823 |
+
"rows": rows_payload,
|
| 824 |
+
"case_count": len(case_specs),
|
| 825 |
+
"unique_expected_papers": unique_papers,
|
| 826 |
+
},
|
| 827 |
+
)
|
| 828 |
+
_plot_comparison_summary(rows_payload, layout=layout)
|
| 829 |
+
_append_history_and_plots(
|
| 830 |
+
layout=layout,
|
| 831 |
+
kind="scientist_local_compare_eval",
|
| 832 |
+
rows=rows_payload,
|
| 833 |
+
)
|
| 834 |
+
print(
|
| 835 |
+
json.dumps(
|
| 836 |
+
{
|
| 837 |
+
"rows": rows_payload,
|
| 838 |
+
"case_count": len(case_specs),
|
| 839 |
+
"unique_expected_papers": unique_papers,
|
| 840 |
+
},
|
| 841 |
+
indent=2,
|
| 842 |
+
sort_keys=True,
|
| 843 |
+
)
|
| 844 |
+
)
|
| 845 |
+
return 0
|
| 846 |
+
|
| 847 |
+
|
| 848 |
def _run_art_scientist_train(args: argparse.Namespace) -> int:
|
| 849 |
layout = _build_layout(
|
| 850 |
prefix="art-scientist",
|
replicalab/training/local_eval.py
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Helpers for local-adapter evaluation against live ReplicaLab rollouts."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
from typing import Callable, Sequence
|
| 7 |
+
|
| 8 |
+
from pydantic import BaseModel, ConfigDict
|
| 9 |
+
|
| 10 |
+
from replicalab.agents.scientist_policy import (
|
| 11 |
+
ScientistOutputParseError,
|
| 12 |
+
_build_live_scientist_system_prompt,
|
| 13 |
+
call_scientist_with_retry,
|
| 14 |
+
format_scientist_observation,
|
| 15 |
+
)
|
| 16 |
+
from replicalab.models import ScientistAction, ScientistActionType, ScientistObservation
|
| 17 |
+
from replicalab.training.corpus import load_frozen_evidence_packs, select_evidence_pack
|
| 18 |
+
from replicalab.training.evaluation import EvaluationCase
|
| 19 |
+
from replicalab.training.runtime import require_module
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class PaperBalancedEvaluationCase(BaseModel):
|
| 23 |
+
"""One deterministic rollout case with expected paper metadata."""
|
| 24 |
+
|
| 25 |
+
model_config = ConfigDict(extra="forbid")
|
| 26 |
+
|
| 27 |
+
case_index: int
|
| 28 |
+
seed: int
|
| 29 |
+
scenario: str
|
| 30 |
+
difficulty: str
|
| 31 |
+
expected_evidence_id: str
|
| 32 |
+
expected_paper_title: str
|
| 33 |
+
|
| 34 |
+
def to_evaluation_case(self) -> EvaluationCase:
|
| 35 |
+
return EvaluationCase(
|
| 36 |
+
seed=self.seed,
|
| 37 |
+
scenario=self.scenario,
|
| 38 |
+
difficulty=self.difficulty,
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def build_trainable_paper_cases(
|
| 43 |
+
total_cases: int,
|
| 44 |
+
*,
|
| 45 |
+
case_index_offset: int = 0,
|
| 46 |
+
difficulties: Sequence[str] = ("easy", "medium", "hard"),
|
| 47 |
+
) -> list[PaperBalancedEvaluationCase]:
|
| 48 |
+
"""Build a deterministic live-eval set balanced across trainable papers."""
|
| 49 |
+
|
| 50 |
+
if total_cases < 1:
|
| 51 |
+
raise ValueError("total_cases must be at least 1")
|
| 52 |
+
if case_index_offset < 0:
|
| 53 |
+
raise ValueError("case_index_offset must be at least 0")
|
| 54 |
+
if not difficulties:
|
| 55 |
+
raise ValueError("difficulties must not be empty")
|
| 56 |
+
|
| 57 |
+
packs = [pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env]
|
| 58 |
+
if not packs:
|
| 59 |
+
raise ValueError("No trainable evidence packs are wired into the current env.")
|
| 60 |
+
|
| 61 |
+
by_template: dict[str, list[object]] = {}
|
| 62 |
+
for pack in packs:
|
| 63 |
+
assert pack.template is not None
|
| 64 |
+
by_template.setdefault(pack.template, []).append(pack)
|
| 65 |
+
for template in by_template:
|
| 66 |
+
by_template[template] = sorted(
|
| 67 |
+
by_template[template],
|
| 68 |
+
key=lambda pack: pack.scenario_number, # type: ignore[attr-defined]
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
targets: list[tuple[str, int, object]] = []
|
| 72 |
+
for template in sorted(by_template):
|
| 73 |
+
for pack_index, pack in enumerate(by_template[template]):
|
| 74 |
+
targets.append((template, pack_index, pack))
|
| 75 |
+
|
| 76 |
+
cases: list[PaperBalancedEvaluationCase] = []
|
| 77 |
+
for local_index in range(total_cases):
|
| 78 |
+
case_index = case_index_offset + local_index
|
| 79 |
+
template, pack_index, pack = targets[case_index % len(targets)]
|
| 80 |
+
cycle_index = case_index // len(targets)
|
| 81 |
+
template_pack_count = len(by_template[template])
|
| 82 |
+
seed = pack_index + cycle_index * template_pack_count
|
| 83 |
+
difficulty = difficulties[(case_index + pack_index) % len(difficulties)]
|
| 84 |
+
cases.append(
|
| 85 |
+
PaperBalancedEvaluationCase(
|
| 86 |
+
case_index=case_index,
|
| 87 |
+
seed=seed,
|
| 88 |
+
scenario=template,
|
| 89 |
+
difficulty=difficulty,
|
| 90 |
+
expected_evidence_id=pack.evidence_id, # type: ignore[attr-defined]
|
| 91 |
+
expected_paper_title=pack.downloaded_paper_title, # type: ignore[attr-defined]
|
| 92 |
+
)
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
return cases
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def build_local_scientist_policy(
|
| 99 |
+
*,
|
| 100 |
+
base_model: str,
|
| 101 |
+
adapter_dir: str | Path,
|
| 102 |
+
max_completion_tokens: int = 450,
|
| 103 |
+
temperature: float = 0.0,
|
| 104 |
+
max_retries: int = 2,
|
| 105 |
+
) -> Callable[[ScientistObservation], ScientistAction]:
|
| 106 |
+
"""Create a sync Scientist policy callable backed by a local PEFT adapter."""
|
| 107 |
+
|
| 108 |
+
torch = require_module("torch")
|
| 109 |
+
transformers = require_module("transformers")
|
| 110 |
+
peft = require_module("peft")
|
| 111 |
+
|
| 112 |
+
adapter_path = Path(adapter_dir).expanduser().resolve()
|
| 113 |
+
if not adapter_path.exists():
|
| 114 |
+
raise FileNotFoundError(f"Adapter directory does not exist: {adapter_path}")
|
| 115 |
+
|
| 116 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained(
|
| 117 |
+
str(adapter_path),
|
| 118 |
+
trust_remote_code=True,
|
| 119 |
+
)
|
| 120 |
+
model = transformers.AutoModelForCausalLM.from_pretrained(
|
| 121 |
+
base_model,
|
| 122 |
+
torch_dtype=torch.bfloat16,
|
| 123 |
+
device_map="auto",
|
| 124 |
+
trust_remote_code=True,
|
| 125 |
+
)
|
| 126 |
+
model = peft.PeftModel.from_pretrained(model, str(adapter_path))
|
| 127 |
+
model.eval()
|
| 128 |
+
device = next(model.parameters()).device
|
| 129 |
+
|
| 130 |
+
evidence_packs = [
|
| 131 |
+
pack for pack in load_frozen_evidence_packs() if pack.trainable_in_env
|
| 132 |
+
]
|
| 133 |
+
|
| 134 |
+
def generate_fn(messages: list[dict[str, str]]) -> str:
|
| 135 |
+
prompt_text = tokenizer.apply_chat_template(
|
| 136 |
+
messages,
|
| 137 |
+
tokenize=False,
|
| 138 |
+
add_generation_prompt=True,
|
| 139 |
+
)
|
| 140 |
+
enc = tokenizer(prompt_text, return_tensors="pt").to(device)
|
| 141 |
+
generation_kwargs = {
|
| 142 |
+
"input_ids": enc["input_ids"],
|
| 143 |
+
"attention_mask": enc["attention_mask"],
|
| 144 |
+
"max_new_tokens": max_completion_tokens,
|
| 145 |
+
"pad_token_id": tokenizer.eos_token_id,
|
| 146 |
+
"do_sample": temperature > 0.0,
|
| 147 |
+
}
|
| 148 |
+
if temperature > 0.0:
|
| 149 |
+
generation_kwargs["temperature"] = temperature
|
| 150 |
+
with torch.no_grad():
|
| 151 |
+
outputs = model.generate(**generation_kwargs)
|
| 152 |
+
generated_ids = outputs[0][enc["input_ids"].shape[1]:]
|
| 153 |
+
return tokenizer.decode(generated_ids, skip_special_tokens=True)
|
| 154 |
+
|
| 155 |
+
def policy_fn(
|
| 156 |
+
observation: ScientistObservation,
|
| 157 |
+
*,
|
| 158 |
+
seed: int | None = None,
|
| 159 |
+
scenario: str | None = None,
|
| 160 |
+
difficulty: str | None = None,
|
| 161 |
+
) -> ScientistAction:
|
| 162 |
+
evidence_pack = None
|
| 163 |
+
if seed is not None and scenario is not None:
|
| 164 |
+
evidence_pack = select_evidence_pack(
|
| 165 |
+
evidence_packs,
|
| 166 |
+
template=scenario, # type: ignore[arg-type]
|
| 167 |
+
seed=seed,
|
| 168 |
+
)
|
| 169 |
+
|
| 170 |
+
user_message = format_scientist_observation(observation)
|
| 171 |
+
if evidence_pack is not None:
|
| 172 |
+
user_message += "\n\nFrozen evidence pack:\n" + evidence_pack.prompt_block()
|
| 173 |
+
|
| 174 |
+
try:
|
| 175 |
+
result = call_scientist_with_retry(
|
| 176 |
+
generate_fn,
|
| 177 |
+
_build_live_scientist_system_prompt(
|
| 178 |
+
observation,
|
| 179 |
+
evidence_pack=evidence_pack,
|
| 180 |
+
difficulty=difficulty,
|
| 181 |
+
scenario=scenario,
|
| 182 |
+
),
|
| 183 |
+
observation,
|
| 184 |
+
max_retries=max_retries,
|
| 185 |
+
user_message_override=user_message,
|
| 186 |
+
)
|
| 187 |
+
return result.action
|
| 188 |
+
except ScientistOutputParseError:
|
| 189 |
+
return ScientistAction(
|
| 190 |
+
action_type=ScientistActionType.REQUEST_INFO,
|
| 191 |
+
sample_size=0,
|
| 192 |
+
controls=[],
|
| 193 |
+
technique="",
|
| 194 |
+
duration_days=0,
|
| 195 |
+
required_equipment=[],
|
| 196 |
+
required_reagents=[],
|
| 197 |
+
questions=[
|
| 198 |
+
"Please restate the main blocking requirement or missing evidence."
|
| 199 |
+
],
|
| 200 |
+
rationale="",
|
| 201 |
+
)
|
| 202 |
+
|
| 203 |
+
return policy_fn
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
__all__ = [
|
| 207 |
+
"PaperBalancedEvaluationCase",
|
| 208 |
+
"build_local_scientist_policy",
|
| 209 |
+
"build_trainable_paper_cases",
|
| 210 |
+
]
|
tests/test_local_eval.py
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from replicalab.training.local_eval import build_trainable_paper_cases
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def test_build_trainable_paper_cases_builds_exact_requested_count() -> None:
|
| 7 |
+
cases = build_trainable_paper_cases(50)
|
| 8 |
+
|
| 9 |
+
assert len(cases) == 50
|
| 10 |
+
assert all(case.scenario in {"ml_benchmark", "finance_trading"} for case in cases)
|
| 11 |
+
assert len({case.expected_evidence_id for case in cases[:34]}) == 34
|
| 12 |
+
assert len({case.expected_evidence_id for case in cases}) >= 34
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def test_build_trainable_paper_cases_rejects_non_positive_count() -> None:
|
| 16 |
+
try:
|
| 17 |
+
build_trainable_paper_cases(0)
|
| 18 |
+
except ValueError as exc:
|
| 19 |
+
assert "at least 1" in str(exc)
|
| 20 |
+
else:
|
| 21 |
+
raise AssertionError("Expected ValueError for non-positive case count")
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def test_build_trainable_paper_cases_supports_offsets() -> None:
|
| 25 |
+
cases = build_trainable_paper_cases(3, case_index_offset=34)
|
| 26 |
+
|
| 27 |
+
assert [case.case_index for case in cases] == [34, 35, 36]
|
| 28 |
+
assert len({case.expected_evidence_id for case in cases}) == 3
|
tests/test_training_cli.py
CHANGED
|
@@ -206,3 +206,132 @@ def test_scientist_compare_eval_cli_writes_rows(tmp_path, monkeypatch) -> None:
|
|
| 206 |
assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
|
| 207 |
assert payload["rows"][1]["average_reward"] == 3.5
|
| 208 |
assert history_path.exists()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
assert [row["label"] for row in payload["rows"]] == ["baseline", "trained"]
|
| 207 |
assert payload["rows"][1]["average_reward"] == 3.5
|
| 208 |
assert history_path.exists()
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
def test_scientist_local_compare_eval_cli_writes_cases_and_metrics(tmp_path, monkeypatch) -> None:
|
| 212 |
+
baseline_record = EpisodeRecord(
|
| 213 |
+
seed=0,
|
| 214 |
+
scenario="ml_benchmark",
|
| 215 |
+
difficulty="easy",
|
| 216 |
+
episode_id="baseline-1",
|
| 217 |
+
total_reward=1.0,
|
| 218 |
+
reward_breakdown=RewardBreakdown(rigor=0.3, feasibility=0.4, fidelity=0.5),
|
| 219 |
+
verdict="timeout",
|
| 220 |
+
agreement_reached=False,
|
| 221 |
+
)
|
| 222 |
+
trained_record = EpisodeRecord(
|
| 223 |
+
seed=0,
|
| 224 |
+
scenario="ml_benchmark",
|
| 225 |
+
difficulty="easy",
|
| 226 |
+
episode_id="trained-1",
|
| 227 |
+
total_reward=2.5,
|
| 228 |
+
reward_breakdown=RewardBreakdown(rigor=0.7, feasibility=0.8, fidelity=0.75),
|
| 229 |
+
verdict="accept",
|
| 230 |
+
agreement_reached=True,
|
| 231 |
+
)
|
| 232 |
+
rows = [
|
| 233 |
+
PolicyComparisonRow(
|
| 234 |
+
label="baseline",
|
| 235 |
+
episode_count=1,
|
| 236 |
+
average_reward=1.0,
|
| 237 |
+
average_rounds=2.0,
|
| 238 |
+
agreement_rate=0.0,
|
| 239 |
+
invalid_action_rate=0.0,
|
| 240 |
+
average_invalid_bounded_tool_rate=0.0,
|
| 241 |
+
average_rigor=0.3,
|
| 242 |
+
average_feasibility=0.4,
|
| 243 |
+
average_fidelity=0.5,
|
| 244 |
+
average_parsimony=1.0,
|
| 245 |
+
average_tool_trace_count=0.0,
|
| 246 |
+
average_paper_understanding=0.2,
|
| 247 |
+
average_communication_quality=0.0,
|
| 248 |
+
),
|
| 249 |
+
PolicyComparisonRow(
|
| 250 |
+
label="trained",
|
| 251 |
+
episode_count=1,
|
| 252 |
+
average_reward=2.5,
|
| 253 |
+
average_rounds=1.0,
|
| 254 |
+
agreement_rate=1.0,
|
| 255 |
+
invalid_action_rate=0.0,
|
| 256 |
+
average_invalid_bounded_tool_rate=0.0,
|
| 257 |
+
average_rigor=0.7,
|
| 258 |
+
average_feasibility=0.8,
|
| 259 |
+
average_fidelity=0.75,
|
| 260 |
+
average_parsimony=1.0,
|
| 261 |
+
average_tool_trace_count=0.0,
|
| 262 |
+
average_paper_understanding=0.6,
|
| 263 |
+
average_communication_quality=0.0,
|
| 264 |
+
),
|
| 265 |
+
]
|
| 266 |
+
|
| 267 |
+
class _CaseSpec:
|
| 268 |
+
case_index = 7
|
| 269 |
+
expected_evidence_id = "ml:paper-1"
|
| 270 |
+
expected_paper_title = "Paper 1"
|
| 271 |
+
|
| 272 |
+
def to_evaluation_case(self) -> object:
|
| 273 |
+
return object()
|
| 274 |
+
|
| 275 |
+
def model_dump(self, mode: str = "json") -> dict[str, object]:
|
| 276 |
+
return {
|
| 277 |
+
"case_index": 7,
|
| 278 |
+
"seed": 0,
|
| 279 |
+
"scenario": "ml_benchmark",
|
| 280 |
+
"difficulty": "easy",
|
| 281 |
+
"expected_evidence_id": "ml:paper-1",
|
| 282 |
+
"expected_paper_title": "Paper 1",
|
| 283 |
+
}
|
| 284 |
+
|
| 285 |
+
monkeypatch.setattr(
|
| 286 |
+
"replicalab.training.cli.build_trainable_paper_cases",
|
| 287 |
+
lambda *args, **kwargs: [_CaseSpec()],
|
| 288 |
+
)
|
| 289 |
+
monkeypatch.setattr(
|
| 290 |
+
"replicalab.training.cli.build_local_scientist_policy",
|
| 291 |
+
lambda **_: (lambda _obs: None),
|
| 292 |
+
)
|
| 293 |
+
monkeypatch.setattr(
|
| 294 |
+
"replicalab.training.cli.compare_policies",
|
| 295 |
+
lambda **_: (
|
| 296 |
+
{"baseline": [baseline_record], "trained": [trained_record]},
|
| 297 |
+
rows,
|
| 298 |
+
),
|
| 299 |
+
)
|
| 300 |
+
monkeypatch.setattr(
|
| 301 |
+
"replicalab.training.cli.plot_evaluation_bars",
|
| 302 |
+
lambda *args, **kwargs: None,
|
| 303 |
+
)
|
| 304 |
+
monkeypatch.setattr(
|
| 305 |
+
"replicalab.training.cli.plot_benchmark_history",
|
| 306 |
+
lambda *args, **kwargs: None,
|
| 307 |
+
)
|
| 308 |
+
|
| 309 |
+
exit_code = main(
|
| 310 |
+
[
|
| 311 |
+
"scientist-local-compare-eval",
|
| 312 |
+
"--persist-root",
|
| 313 |
+
str(tmp_path),
|
| 314 |
+
"--run-name",
|
| 315 |
+
"local-compare-test",
|
| 316 |
+
"--adapter-dir",
|
| 317 |
+
str(tmp_path / "adapter"),
|
| 318 |
+
"--case-count",
|
| 319 |
+
"1",
|
| 320 |
+
"--case-offset",
|
| 321 |
+
"7",
|
| 322 |
+
]
|
| 323 |
+
)
|
| 324 |
+
|
| 325 |
+
assert exit_code == 0
|
| 326 |
+
summary_path = tmp_path / "local-compare-test" / "reports" / "summary.json"
|
| 327 |
+
metrics_path = tmp_path / "local-compare-test" / "reports" / "metrics.jsonl"
|
| 328 |
+
cases_path = tmp_path / "local-compare-test" / "manifests" / "evaluation_cases.json"
|
| 329 |
+
payload = json.loads(summary_path.read_text(encoding="utf-8"))
|
| 330 |
+
assert payload["case_count"] == 1
|
| 331 |
+
assert payload["unique_expected_papers"] == 1
|
| 332 |
+
metrics_lines = metrics_path.read_text(encoding="utf-8").strip().splitlines()
|
| 333 |
+
assert len(metrics_lines) == 2
|
| 334 |
+
first_metric = json.loads(metrics_lines[0])
|
| 335 |
+
assert first_metric["case_index"] == 7
|
| 336 |
+
assert first_metric["expected_evidence_id"] == "ml:paper-1"
|
| 337 |
+
assert cases_path.exists()
|