Evaluation Architecture

This document describes how evaluation flows through the codebase and the design choices behind the current structure.

Audience

Problem contributors (see CONTRIBUTING.md)
Model submitters (see SUBMIT.md)
General researchers using Frontier-CS to evaluate solutions

Goals

Clear separation between single-problem and batch evaluation.
Shared validation and config parsing across research backends.
Predictable cleanup to avoid orphaned cloud resources.
Explicit naming to avoid backend ambiguity.

Architecture at a Glance

CLI:
- frontier eval → SingleEvaluator
- frontier batch → BatchEvaluator
- frontier list / frontier show → problem discovery
CI:
- Validate Problems → scripts/validate_problems.py → SingleEvaluator
- Weekly Batch Evaluation → scripts/run_eval.sh → BatchEvaluator

Components

SingleEvaluator

Unified API for single-problem evaluation:

Selects a runner based on track and backend.
Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.

BatchEvaluator

Orchestrates batch evaluation with parallel workers and SkyPilot cluster pools:

Work queues with resumable state and result aggregation.
Resource-grouped cluster pools — pairs are grouped by ResourceSignature (cloud, accelerators, instance type) so that CPU-only and GPU problems run on separate pools.
Hash-based resume — each result stores solution/problem hashes so stale results are automatically re-evaluated when source changes.

Runners

Runners execute evaluation. The class hierarchy:

Runner (ABC)
├── ResearchRunner              # shared: problem validation, config loading, uv install
│   ├── ResearchDockerRunner    # local container
│   └── ResearchSkyPilotRunner  # cloud via SkyPilot
├── AlgorithmicLocalRunner      # go-judge via HTTP
└── AlgorithmicSkyPilotRunner   # go-judge on SkyPilot

Solution Generation (`gen/`)

The gen/ module generates solutions by calling LLMs. It is independent of the evaluation pipeline — frontier eval and frontier batch do not depend on it. It provides an LLM interface, an API key pool for concurrent requests, and solution file formatting.

Design Decisions

Single vs Batch: SingleEvaluator stays focused on one-off runs (simple API + cleanup hooks), while BatchEvaluator owns scheduling, resumable state, and cluster pools. This keeps single-run paths lightweight and batch runs scalable.
Shared research helpers: input validation and config parsing live in the ResearchRunner base class so Docker and SkyPilot backends stay in sync.
Cleanup strategy: research evaluations tear down clusters by default unless keep_cluster is set. SingleEvaluator cleans up via an active-cluster registry; BatchEvaluator manages its own pool lifecycle.
Naming: runner class names encode track + backend (e.g., ResearchDockerRunner) to remove ambiguity in logs and docs.
Score semantics: a score of 0 can mean the evaluator ran successfully; failures are reported via status/metadata rather than score alone.
Reference solutions: problems ship with reference.cpp/reference.py so CI can verify end-to-end evaluation without model submissions.
Results separation: evaluation outputs go to a dedicated results repository to keep the main repo lean and auditable.
Internal vs public: internal test cases and tooling live in a private repo; public artifacts are kept minimal but compatible.
Weekly vs local: weekly CI uses scripts/run_eval.sh with batch scheduling; local runs use the same script or frontier eval for quick iteration.
Resource-grouped cluster pools: BatchEvaluator groups pairs by ResourceSignature (cloud × accelerators × instance type) and creates a separate pool per group, avoiding the waste of running CPU-only problems on GPU clusters.
Hash-based resume: resuming a batch compares solution/problem hashes against stored results. Changed inputs are re-evaluated even when a prior result exists, preventing silently stale scores.
Generation vs evaluation: solution generation (gen/) is fully decoupled from evaluation. Generated files are plain source files with no special metadata; the evaluator has no dependency on the generation pipeline.

Runner Flow (Research)

Both research runners share the same pre-evaluation steps (via ResearchRunner):

Validate solution file and .FAILED marker.
Verify the problem path exists.
Load config.yaml and runtime settings.
Build uv install command if uv_project is specified.

Execution diverges at the backend:

Docker — launches a local container.
SkyPilot — provisions a cloud VM and runs remotely.

Operations (Cleanup + CI)

Cleanup: research evaluations tear down clusters by default unless keep_cluster=True. SingleEvaluator uses an active-cluster registry to clean up on SIGINT/atexit; BatchEvaluator manages its own cluster pool lifecycle.
CI: problem validation runs single evals; the weekly batch job runs full evaluations on SkyPilot (typically GCP).