Evaluation Architecture
This document describes how evaluation flows through the codebase and the design choices behind the current structure.
Audience
- Problem contributors (see
CONTRIBUTING.md) - Model submitters (see
SUBMIT.md) - General researchers using Frontier-CS to evaluate solutions
Goals
- Clear separation between single-problem and batch evaluation.
- Shared validation and config parsing across research backends.
- Predictable cleanup to avoid orphaned cloud resources.
- Explicit naming to avoid backend ambiguity.
Architecture at a Glance
- CLI:
frontier evalβSingleEvaluatorfrontier batchβBatchEvaluatorfrontier list/frontier showβ problem discovery
- CI:
- Validate Problems β
scripts/validate_problems.pyβSingleEvaluator - Weekly Batch Evaluation β
scripts/run_eval.shβBatchEvaluator
- Validate Problems β
Components
SingleEvaluator
Unified API for single-problem evaluation:
- Selects a runner based on track and backend.
- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.
BatchEvaluator
Orchestrates batch evaluation with parallel workers and SkyPilot cluster pools:
- Work queues with resumable state and result aggregation.
- Resource-grouped cluster pools β pairs are grouped by
ResourceSignature(cloud, accelerators, instance type) so that CPU-only and GPU problems run on separate pools. - Hash-based resume β each result stores solution/problem hashes so stale results are automatically re-evaluated when source changes.
Runners
Runners execute evaluation. The class hierarchy:
Runner (ABC)
βββ ResearchRunner # shared: problem validation, config loading, uv install
β βββ ResearchDockerRunner # local container
β βββ ResearchSkyPilotRunner # cloud via SkyPilot
βββ AlgorithmicLocalRunner # go-judge via HTTP
βββ AlgorithmicSkyPilotRunner # go-judge on SkyPilot
Solution Generation (gen/)
The gen/ module generates solutions by calling LLMs. It is independent of
the evaluation pipeline β frontier eval and frontier batch do not
depend on it. It provides an LLM interface, an API key pool for concurrent
requests, and solution file formatting.
Design Decisions
- Single vs Batch:
SingleEvaluatorstays focused on one-off runs (simple API + cleanup hooks), whileBatchEvaluatorowns scheduling, resumable state, and cluster pools. This keeps single-run paths lightweight and batch runs scalable. - Shared research helpers: input validation and config parsing live in
the
ResearchRunnerbase class so Docker and SkyPilot backends stay in sync. - Cleanup strategy: research evaluations tear down clusters by default
unless
keep_clusteris set.SingleEvaluatorcleans up via an active-cluster registry;BatchEvaluatormanages its own pool lifecycle. - Naming: runner class names encode track + backend
(e.g.,
ResearchDockerRunner) to remove ambiguity in logs and docs. - Score semantics: a score of 0 can mean the evaluator ran successfully; failures are reported via status/metadata rather than score alone.
- Reference solutions: problems ship with
reference.cpp/reference.pyso CI can verify end-to-end evaluation without model submissions. - Results separation: evaluation outputs go to a dedicated results repository to keep the main repo lean and auditable.
- Internal vs public: internal test cases and tooling live in a private repo; public artifacts are kept minimal but compatible.
- Weekly vs local: weekly CI uses
scripts/run_eval.shwith batch scheduling; local runs use the same script orfrontier evalfor quick iteration. - Resource-grouped cluster pools:
BatchEvaluatorgroups pairs byResourceSignature(cloud Γ accelerators Γ instance type) and creates a separate pool per group, avoiding the waste of running CPU-only problems on GPU clusters. - Hash-based resume: resuming a batch compares solution/problem hashes against stored results. Changed inputs are re-evaluated even when a prior result exists, preventing silently stale scores.
- Generation vs evaluation: solution generation (
gen/) is fully decoupled from evaluation. Generated files are plain source files with no special metadata; the evaluator has no dependency on the generation pipeline.
Runner Flow (Research)
Both research runners share the same pre-evaluation steps (via
ResearchRunner):
- Validate solution file and
.FAILEDmarker. - Verify the problem path exists.
- Load
config.yamland runtime settings. - Build uv install command if
uv_projectis specified.
Execution diverges at the backend:
- Docker β launches a local container.
- SkyPilot β provisions a cloud VM and runs remotely.
Operations (Cleanup + CI)
Cleanup: research evaluations tear down clusters by default unless
keep_cluster=True.SingleEvaluatoruses an active-cluster registry to clean up on SIGINT/atexit;BatchEvaluatormanages its own cluster pool lifecycle.CI: problem validation runs single evals; the weekly batch job runs full evaluations on SkyPilot (typically GCP).