Add files using upload-large-folder tool

2facf1f verified 30 days ago

5.29 kB

	# Evaluation Architecture

	This document describes how evaluation flows through the codebase and the
	design choices behind the current structure.

	## Audience

	- Problem contributors (see `CONTRIBUTING.md`)
	- Model submitters (see `SUBMIT.md`)
	- General researchers using Frontier-CS to evaluate solutions

	## Goals

	- Clear separation between single-problem and batch evaluation.
	- Shared validation and config parsing across research backends.
	- Predictable cleanup to avoid orphaned cloud resources.
	- Explicit naming to avoid backend ambiguity.

	## Architecture at a Glance

	- CLI:
	- `frontier eval` → `SingleEvaluator`
	- `frontier batch` → `BatchEvaluator`
	- `frontier list` / `frontier show` → problem discovery
	- CI:
	- Validate Problems → `scripts/validate_problems.py` → `SingleEvaluator`
	- Weekly Batch Evaluation → `scripts/run_eval.sh` → `BatchEvaluator`

	## Components

	### SingleEvaluator
	Unified API for single-problem evaluation:

	- Selects a runner based on track and backend.
	- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.

	### BatchEvaluator
	Orchestrates batch evaluation with parallel workers and SkyPilot
	cluster pools:

	- Work queues with resumable state and result aggregation.
	- Resource-grouped cluster pools — pairs are grouped by
	`ResourceSignature` (cloud, accelerators, instance type) so that
	CPU-only and GPU problems run on separate pools.
	- Hash-based resume — each result stores solution/problem hashes so stale
	results are automatically re-evaluated when source changes.

	### Runners
	Runners execute evaluation. The class hierarchy:

	```
	Runner (ABC)
	├── ResearchRunner # shared: problem validation, config loading, uv install
	│ ├── ResearchDockerRunner # local container
	│ └── ResearchSkyPilotRunner # cloud via SkyPilot
	├── AlgorithmicLocalRunner # go-judge via HTTP
	└── AlgorithmicSkyPilotRunner # go-judge on SkyPilot
	```

	### Solution Generation (`gen/`)
	The `gen/` module generates solutions by calling LLMs. It is independent of
	the evaluation pipeline — `frontier eval` and `frontier batch` do not
	depend on it. It provides an LLM interface, an API key pool for concurrent
	requests, and solution file formatting.

	## Design Decisions

	- Single vs Batch: `SingleEvaluator` stays focused on one-off runs
	(simple API + cleanup hooks), while `BatchEvaluator` owns scheduling,
	resumable state, and cluster pools. This keeps single-run paths
	lightweight and batch runs scalable.
	- Shared research helpers: input validation and config parsing live in
	the `ResearchRunner` base class so Docker and SkyPilot backends stay in
	sync.
	- Cleanup strategy: research evaluations tear down clusters by default
	unless `keep_cluster` is set. `SingleEvaluator` cleans up via an
	active-cluster registry; `BatchEvaluator` manages its own pool lifecycle.
	- Naming: runner class names encode track + backend
	(e.g., `ResearchDockerRunner`) to remove ambiguity in logs and docs.
	- Score semantics: a score of 0 can mean the evaluator ran
	successfully; failures are reported via status/metadata rather than score
	alone.
	- Reference solutions: problems ship with `reference.cpp`/`reference.py`
	so CI can verify end-to-end evaluation without model submissions.
	- Results separation: evaluation outputs go to a dedicated results
	repository to keep the main repo lean and auditable.
	- Internal vs public: internal test cases and tooling live in a private
	repo; public artifacts are kept minimal but compatible.
	- Weekly vs local: weekly CI uses `scripts/run_eval.sh` with batch
	scheduling; local runs use the same script or `frontier eval` for quick
	iteration.
	- Resource-grouped cluster pools: `BatchEvaluator` groups pairs by
	`ResourceSignature` (cloud × accelerators × instance type) and creates a
	separate pool per group, avoiding the waste of running CPU-only problems
	on GPU clusters.
	- Hash-based resume: resuming a batch compares solution/problem hashes
	against stored results. Changed inputs are re-evaluated even when a prior
	result exists, preventing silently stale scores.
	- Generation vs evaluation: solution generation (`gen/`) is fully
	decoupled from evaluation. Generated files are plain source files with no
	special metadata; the evaluator has no dependency on the generation
	pipeline.

	## Runner Flow (Research)

	Both research runners share the same pre-evaluation steps (via
	`ResearchRunner`):

	1. Validate solution file and `.FAILED` marker.
	2. Verify the problem path exists.
	3. Load `config.yaml` and runtime settings.
	4. Build uv install command if `uv_project` is specified.

	Execution diverges at the backend:

	- Docker — launches a local container.
	- SkyPilot — provisions a cloud VM and runs remotely.

	## Operations (Cleanup + CI)

	- Cleanup: research evaluations tear down clusters by default unless
	`keep_cluster=True`. `SingleEvaluator` uses an active-cluster registry to
	clean up on SIGINT/atexit; `BatchEvaluator` manages its own cluster pool
	lifecycle.

	- CI: problem validation runs single evals; the weekly batch job runs
	full evaluations on SkyPilot (typically GCP).