| # Evaluation Architecture |
|
|
| This document describes how evaluation flows through the codebase and the |
| design choices behind the current structure. |
|
|
| ## Audience |
|
|
| - **Problem contributors** (see `CONTRIBUTING.md`) |
| - **Model submitters** (see `SUBMIT.md`) |
| - **General researchers** using Frontier-CS to evaluate solutions |
|
|
| ## Goals |
|
|
| - Clear separation between single-problem and batch evaluation. |
| - Shared validation and config parsing across research backends. |
| - Predictable cleanup to avoid orphaned cloud resources. |
| - Explicit naming to avoid backend ambiguity. |
|
|
| ## Architecture at a Glance |
|
|
| - **CLI**: |
| - `frontier eval` β `SingleEvaluator` |
| - `frontier batch` β `BatchEvaluator` |
| - `frontier list` / `frontier show` β problem discovery |
| - **CI**: |
| - Validate Problems β `scripts/validate_problems.py` β `SingleEvaluator` |
| - Weekly Batch Evaluation β `scripts/run_eval.sh` β `BatchEvaluator` |
|
|
| ## Components |
|
|
| ### SingleEvaluator |
| Unified API for **single-problem evaluation**: |
|
|
| - Selects a runner based on track and backend. |
| - Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit. |
|
|
| ### BatchEvaluator |
| Orchestrates **batch evaluation** with parallel workers and SkyPilot |
| cluster pools: |
|
|
| - Work queues with resumable state and result aggregation. |
| - Resource-grouped cluster pools β pairs are grouped by |
| `ResourceSignature` (cloud, accelerators, instance type) so that |
| CPU-only and GPU problems run on separate pools. |
| - Hash-based resume β each result stores solution/problem hashes so stale |
| results are automatically re-evaluated when source changes. |
|
|
| ### Runners |
| Runners execute evaluation. The class hierarchy: |
|
|
| ``` |
| Runner (ABC) |
| βββ ResearchRunner # shared: problem validation, config loading, uv install |
| β βββ ResearchDockerRunner # local container |
| β βββ ResearchSkyPilotRunner # cloud via SkyPilot |
| βββ AlgorithmicLocalRunner # go-judge via HTTP |
| βββ AlgorithmicSkyPilotRunner # go-judge on SkyPilot |
| ``` |
|
|
| ### Solution Generation (`gen/`) |
| The `gen/` module generates solutions by calling LLMs. It is independent of |
| the evaluation pipeline β `frontier eval` and `frontier batch` do not |
| depend on it. It provides an LLM interface, an API key pool for concurrent |
| requests, and solution file formatting. |
|
|
| ## Design Decisions |
|
|
| - **Single vs Batch**: `SingleEvaluator` stays focused on one-off runs |
| (simple API + cleanup hooks), while `BatchEvaluator` owns scheduling, |
| resumable state, and cluster pools. This keeps single-run paths |
| lightweight and batch runs scalable. |
| - **Shared research helpers**: input validation and config parsing live in |
| the `ResearchRunner` base class so Docker and SkyPilot backends stay in |
| sync. |
| - **Cleanup strategy**: research evaluations tear down clusters by default |
| unless `keep_cluster` is set. `SingleEvaluator` cleans up via an |
| active-cluster registry; `BatchEvaluator` manages its own pool lifecycle. |
| - **Naming**: runner class names encode track + backend |
| (e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs. |
| - **Score semantics**: a score of **0** can mean the evaluator ran |
| successfully; failures are reported via status/metadata rather than score |
| alone. |
| - **Reference solutions**: problems ship with `reference.cpp`/`reference.py` |
| so CI can verify end-to-end evaluation without model submissions. |
| - **Results separation**: evaluation outputs go to a dedicated results |
| repository to keep the main repo lean and auditable. |
| - **Internal vs public**: internal test cases and tooling live in a private |
| repo; public artifacts are kept minimal but compatible. |
| - **Weekly vs local**: weekly CI uses `scripts/run_eval.sh` with batch |
| scheduling; local runs use the same script or `frontier eval` for quick |
| iteration. |
| - **Resource-grouped cluster pools**: `BatchEvaluator` groups pairs by |
| `ResourceSignature` (cloud Γ accelerators Γ instance type) and creates a |
| separate pool per group, avoiding the waste of running CPU-only problems |
| on GPU clusters. |
| - **Hash-based resume**: resuming a batch compares solution/problem hashes |
| against stored results. Changed inputs are re-evaluated even when a prior |
| result exists, preventing silently stale scores. |
| - **Generation vs evaluation**: solution generation (`gen/`) is fully |
| decoupled from evaluation. Generated files are plain source files with no |
| special metadata; the evaluator has no dependency on the generation |
| pipeline. |
|
|
| ## Runner Flow (Research) |
|
|
| Both research runners share the same pre-evaluation steps (via |
| `ResearchRunner`): |
|
|
| 1. Validate solution file and `.FAILED` marker. |
| 2. Verify the problem path exists. |
| 3. Load `config.yaml` and runtime settings. |
| 4. Build uv install command if `uv_project` is specified. |
|
|
| Execution diverges at the backend: |
|
|
| - **Docker** β launches a local container. |
| - **SkyPilot** β provisions a cloud VM and runs remotely. |
|
|
| ## Operations (Cleanup + CI) |
|
|
| - **Cleanup**: research evaluations tear down clusters by default unless |
| `keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to |
| clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool |
| lifecycle. |
|
|
| - **CI**: problem validation runs single evals; the weekly batch job runs |
| full evaluations on SkyPilot (typically GCP). |
|
|