# Evaluation Architecture This document describes how evaluation flows through the codebase and the design choices behind the current structure. ## Audience - **Problem contributors** (see `CONTRIBUTING.md`) - **Model submitters** (see `SUBMIT.md`) - **General researchers** using Frontier-CS to evaluate solutions ## Goals - Clear separation between single-problem and batch evaluation. - Shared validation and config parsing across research backends. - Predictable cleanup to avoid orphaned cloud resources. - Explicit naming to avoid backend ambiguity. ## Architecture at a Glance - **CLI**: - `frontier eval` → `SingleEvaluator` - `frontier batch` → `BatchEvaluator` - `frontier list` / `frontier show` → problem discovery - **CI**: - Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator` - Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator` ## Components ### SingleEvaluator Unified API for **single-problem evaluation**: - Selects a runner based on track and backend. - Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit. ### BatchEvaluator Orchestrates **batch evaluation** with parallel workers and SkyPilot cluster pools: - Work queues with resumable state and result aggregation. - Resource-grouped cluster pools — pairs are grouped by `ResourceSignature` (cloud, accelerators, instance type) so that CPU-only and GPU problems run on separate pools. - Hash-based resume — each result stores solution/problem hashes so stale results are automatically re-evaluated when source changes. ### Runners Runners execute evaluation. The class hierarchy: ``` Runner (ABC) ├── ResearchRunner # shared: problem validation, config loading, uv install │ ├── ResearchDockerRunner # local container │ └── ResearchSkyPilotRunner # cloud via SkyPilot ├── AlgorithmicLocalRunner # go-judge via HTTP └── AlgorithmicSkyPilotRunner # go-judge on SkyPilot ``` ### Solution Generation (`gen/`) The `gen/` module generates solutions by calling LLMs. It is independent of the evaluation pipeline — `frontier eval` and `frontier batch` do not depend on it. It provides an LLM interface, an API key pool for concurrent requests, and solution file formatting. ## Design Decisions - **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling, resumable state, and cluster pools. This keeps single-run paths lightweight and batch runs scalable. - **Shared research helpers**: input validation and config parsing live in the `ResearchRunner` base class so Docker and SkyPilot backends stay in sync. - **Cleanup strategy**: research evaluations tear down clusters by default unless `keep_cluster` is set. `SingleEvaluator` cleans up via an active-cluster registry; `BatchEvaluator` manages its own pool lifecycle. - **Naming**: runner class names encode track + backend (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. - **Score semantics**: a score of **0** can mean the evaluator ran successfully; failures are reported via status/metadata rather than score alone. - **Reference solutions**: problems ship with `reference.cpp`/`reference.py` so CI can verify end-to-end evaluation without model submissions. - **Results separation**: evaluation outputs go to a dedicated results repository to keep the main repo lean and auditable. - **Internal vs public**: internal test cases and tooling live in a private repo; public artifacts are kept minimal but compatible. - **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch scheduling; local runs use the same script or `frontier eval` for quick iteration. - **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by `ResourceSignature` (cloud × accelerators × instance type) and creates a separate pool per group, avoiding the waste of running CPU-only problems on GPU clusters. - **Hash-based resume**: resuming a batch compares solution/problem hashes against stored results. Changed inputs are re-evaluated even when a prior result exists, preventing silently stale scores. - **Generation vs evaluation**: solution generation (`gen/`) is fully decoupled from evaluation. Generated files are plain source files with no special metadata; the evaluator has no dependency on the generation pipeline. ## Runner Flow (Research) Both research runners share the same pre-evaluation steps (via `ResearchRunner`): 1. Validate solution file and `.FAILED` marker. 2. Verify the problem path exists. 3. Load `config.yaml` and runtime settings. 4. Build uv install command if `uv_project` is specified. Execution diverges at the backend: - **Docker** — launches a local container. - **SkyPilot** — provisions a cloud VM and runs remotely. ## Operations (Cleanup + CI) - **Cleanup**: research evaluations tear down clusters by default unless `keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool lifecycle. - **CI**: problem validation runs single evals; the weekly batch job runs full evaluations on SkyPilot (typically GCP).