# Evaluating Your Model > **For Model Providers**: Complete workflow for benchmarking your model on Frontier-CS and submitting results to the leaderboard. ## Step 1: Prepare Solutions Place your solutions in the correct directory structure: ``` {track}/solutions/{problem}/{model}.{ext} {track}/solutions/{problem}/{model}_{variant}.{ext} ``` **Examples:** ``` research/solutions/flash_attn/my_model.py research/solutions/flash_attn/my_model_1.py # variant 1 research/solutions/gemm_optimization/squares/my_model.py algorithmic/solutions/1/my_model.cpp ``` - **Research track**: Python (`.py`) by default, or C++ (`.cpp`) if problem specifies `language: cpp` in config.yaml - **Algorithmic track**: C++17 (`.cpp`) - We recommend generating **5 variants per model** to compute Score@5 ## Step 2: Run Evaluation Suppose you have a new model `my_model` and want to evaluate it. Three ways: **1. Put solutions in `solutions/` directory** ``` research/solutions/ ├── flash_attn/my_model.py ├── cross_entropy/my_model.py └── ... ``` ```bash frontier batch research --model my_model ``` **2. Use your own directory** ``` ./my_solutions/ ├── flash_attn/my_model.py ├── cross_entropy/my_model.py └── ... ``` ```bash frontier batch research --solutions-dir ./my_solutions ``` **3. Explicit pairs file** ``` # pairs.txt ./my_solutions/flash_attn/my_model.py:flash_attn ./my_solutions/cross_entropy/my_model.py:cross_entropy ``` ```bash frontier batch research --pairs-file pairs.txt ``` ### Backend Options ```bash # Research defaults to SkyPilot, algorithmic defaults to Docker frontier batch research --backend docker frontier batch algorithmic --backend skypilot # Parallelism frontier batch research --workers 20 --clusters 4 ``` ### Result Storage ```bash # Local (default): results saved to ./results/batch/{track}/ frontier batch research # Cloud bucket (requires --backend skypilot): results written directly to S3/GCS frontier batch research --bucket-url s3://my-bucket/results # Sync from bucket to local frontier batch research --bucket-url s3://my-bucket/results --sync-bucket ``` ### Control Options ```bash frontier batch research --status # Check status frontier batch research --no-resume # Force re-evaluate all frontier batch research --retry-failed # Retry failed (including score=0) ``` - Incremental evaluation with hash-based caching (solution/problem changes trigger re-evaluation) ## Step 3: View Results Results from public test case evaluation are saved to `./results/batch/{track}/`: | File | Content | |------|---------| | `results.csv` | All evaluation results | | `by_model.csv` | Score@1, Avg@5, Score@5 per model | | `by_problem.csv` | Scores per problem | | `failed.txt` | Failed evaluations | | `pending.txt` | Pending evaluations | ## Step 4: Submit to Leaderboard We welcome submissions from all models and agent frameworks. To have your results included in our leaderboard, please follow the instructions below. ### Algorithmic Problems We currently release **1-3 public test cases** per problem for local testing and debugging. Full evaluation (with all test cases) is performed on our servers. #### What to Submit 1. **Solution files**: `{problem_id}_{model_name}_solution.cpp` for each problem 2. **Model/Agent info**: Name and version of the model or agent framework used 3. **Generation method**: Brief description of how solutions were generated (e.g., one-shot, multi-turn, with/without feedback) #### Submission Format Organize your solutions as: ``` submissions/ ├── 1_gpt4_solution.cpp ├── 2_gpt4_solution.cpp ├── ... └── metadata.json ``` `metadata.json`: ```json { "model": "gpt-4o", "agent_framework": "custom", "generation_method": "one-shot", "date": "2025-01-15", "notes": "Optional additional notes" } ``` ### Research Problems Research problems require a `solution.py` file implementing the `Solution` class interface. #### Problem Structure Research problems follow a hierarchical structure: ``` Problem (e.g., gemm_optimization, poc_generation) └── Category (e.g., squares, heap_buffer_overflow) └── Variant (e.g., arvo_21000) ``` | Level | Example | Description | |-------|---------|-------------| | **Problem** | `gemm_optimization` | Top-level problem domain | | **Category** | `gemm_optimization/squares` | Scores are **aggregated** at this level for leaderboard reporting | | **Variant** | `poc_generation/heap_buffer_overflow/arvo_21000` | Each variant is **evaluated independently** with its own README | **Key distinction:** - **Evaluation**: Each variant runs independently and produces its own score - **Reporting**: Scores are aggregated by category for the leaderboard (e.g., all `heap_buffer_overflow` variants → one score) > Note: Some problems have only one level (e.g., `flash_attn`), which functions as both category and variant. #### Problem ID Format Each variant has a unique **Problem ID** based on its path under `research/`. The full list of all evaluatable variants is in [`research/scripts/problems.txt`](research/scripts/problems.txt). | Type | Example Path | Problem ID | |------|-------------|------------| | Single problem | `research/flash_attn` | `flash_attn` | | Problem with variants | `research/gemm_optimization/squares` | `gemm_optimization/squares` | | Nested variants | `research/poc_generation/heap_buffer_overflow/arvo_21000` | `poc_generation/heap_buffer_overflow/arvo_21000` | #### What to Submit 1. **Solution files**: `solution.py` for each problem, placed in a directory matching the Problem ID 2. **Model/Agent info**: Name and version of the model or agent framework used 3. **Local evaluation results** (optional but recommended): Score from running the evaluator locally #### Submission Format Your submission zip should mirror the Problem ID directory structure: ``` submission.zip ├── flash_attn/ │ └── solution.py ├── gemm_optimization/ │ └── squares/ │ └── solution.py ├── cant_be_late/ │ └── high_availability_loose_deadline/ │ └── solution.py ├── poc_generation/ │ └── heap_buffer_overflow/ │ └── arvo_21000/ │ └── solution.py └── metadata.json ``` **Important**: The directory structure must exactly match the Problem ID. For example: - `flash_attn/solution.py` - `gemm_optimization/squares/solution.py` Each `solution.py` must implement: ```python class Solution: def __init__(self): pass def solve(self, *args): # Returns: solution output (format varies by problem) pass ``` #### metadata.json ```json { "model": "gpt-4o", "agent_framework": "custom", "generation_method": "one-shot", "date": "2025-01-15", "problems_solved": [ "flash_attn", "gemm_optimization/squares", "cant_be_late/high_availability_loose_deadline" ], "notes": "Optional additional notes" } ``` ### How to Submit Send your submission to: - **Email**: qmang@berkeley.edu or wenhao.chai@princeton.edu Please include: 1. A zip/tar archive of your solutions following the format above 2. `metadata.json` with model and method information 3. (Optional) Local evaluation results if you ran them ### Leaderboard Accepted submissions will be evaluated on our full test suite and results will be published on the [Frontier-CS Leaderboard](https://frontier-cs.org). ## How We Evaluate Submissions After you submit, maintainers evaluate your solutions against the full private test suite. This runs automatically via weekly CI or manually by maintainers: ```bash ./scripts/run_eval.sh --track research ./scripts/run_eval.sh --track algorithmic ``` Options: - `-j N`: Parallelism (default: 10) - `--force`: Force re-evaluate all - `--no-push`: Don't push results Results are saved to `Frontier-CS-Result/` repository and published to the leaderboard. --- ## Using Our Generation Scripts (Optional) If you want to use our scripts to batch-generate solutions with LLMs: ### Configure **models.txt** (`research/scripts/models.txt` or `algorithmic/scripts/models.txt`) - One model name per line - Supported formats: `gpt-5`, `claude-sonnet-4-5`, `gemini/gemini-2.5-pro`, `xai/grok-4`, `deepseek/deepseek-reasoner` **indices.txt** - Controls how many variants to generate per (model, problem) pair - Single number N = generate indices 0 to N-1 - Multiple lines = specify explicit indices **API Keys** Set environment variables for the providers you need. Multiple keys per provider are supported for load balancing (e.g., `OPENAI_API_KEY`, `OPENAI_API_KEY2`, `OPENAI_API_KEY_2`). | Provider | Environment Variable | Models | | ---------- | -------------------- | ------------------------------------- | | OpenAI | `OPENAI_API_KEY` | gpt-4o, gpt-5, o1, o3, ... | | Anthropic | `ANTHROPIC_API_KEY` | claude-sonnet-4-5, claude-opus-4, ... | | Google | `GOOGLE_API_KEY` | gemini-2.5-pro, gemini-2.5-flash, ... | | xAI | `XAI_API_KEY` | grok-3, grok-3-mini, ... | | DeepSeek | `DEEPSEEK_API_KEY` | deepseek-r1, deepseek-chat, ... | | OpenRouter | `OPENROUTER_API_KEY` | openrouter/\* models | ```bash export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-... export GOOGLE_API_KEY=... ``` ### Generate Solutions #### Research Track Most research problems are Python, but some (e.g., `nbody_simulation`) require C++. The language is configured per-problem via `language` field in `config.yaml`. ```bash # Generate one solution python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5 --indices 1 # Preview what would be generated python research/scripts/generate_solutions.py --dryrun ``` #### Algorithmic Track (C++) ```bash python algorithmic/scripts/generate_solutions.py --model gpt-5 ``` ### Two Modes **Problem mode** (generate new solutions): ```bash python research/scripts/generate_solutions.py --problem flash_attn --model gpt-5 ``` Generates **problems × models × indices** (Cartesian product): - Problems: `--problem` patterns or `--problems-file` (default: auto-discover all problems) - Models: `--model` list or `--models-file` (default: `models.txt`) - Indices: `--indices N` or `--indices-file` (default: `indices.txt` or single solution) Solution naming: `{problem}/{model}.py` for index 0, `{problem}/{model}_{i}.py` for index i. **Solution mode** (regenerate existing solutions): ```bash python research/scripts/generate_solutions.py --solution "flash_attn/gpt5*" --force ``` - Matches existing solutions in `solutions/` by pattern - Model inferred from solution filename (e.g., `flash_attn/gpt5.py` → model `gpt5`) - Requires `--force` since solutions already exist - Still needs `models.txt` or `--model` to map prefix to model name ### Options | Option | Description | | ------------------------------- | ------------------------------------------------------------------------------ | | `--problem` / `--problems-file` | Problem pattern or file (default: auto-discover) | | `--model` / `--models-file` | Model(s) or file (default: `models.txt`) | | `--indices` / `--indices-file` | Solution indices count or file (default: `indices.txt`) | | `--solution PATTERN` | Regenerate existing solutions by pattern (mutually exclusive with `--problem`) | | `--force` | Overwrite existing solutions | | `--dryrun` | Preview without generating | | `--concurrency N` | Parallel API calls | | `--timeout SECONDS` | API timeout (default: 600s) | ### Output Solutions are saved in nested directories under `solutions/`: ``` solutions/ ├── flash_attn/ │ ├── gpt5.py │ ├── gpt5_1.py │ └── claude4.5sonnet.py └── cross_entropy/ └── gpt5.py ``` ### Check Coverage (Research Only) ```bash python research/scripts/check_solutions.py ``` Shows: - **Expected**: models × problems × variants - **Generated**: expected AND exists - **Missing**: expected but NOT exists - **Failed**: `.FAILED` marker files (generation errors) - **Extra**: exists but NOT expected - **Empty**: file exists but content is empty Outputs a coverage progress bar and exports `problems.txt`. ### Customization Points If you want to modify our scripts: 1. **Use OpenAI-compatible API (e.g., Azure, local models)** - Modify `base_url` parameter in `src/frontier_cs/gen/llm.py` `instantiate_llm_client` - Or pass `base_url` when initializing `GPT` class in `llm_interface.py` - DeepSeek, Grok, etc. are already implemented using OpenAI SDK with different base_url 2. **Add a new LLM provider** - Add a new class in `src/frontier_cs/gen/llm_interface.py` (inherit `LLMInterface`, implement `call_llm`) - Add provider handling in `src/frontier_cs/gen/llm.py` `instantiate_llm_client` 3. **Add model prefix mapping** - Edit `src/frontier_cs/models.py` `get_model_prefix()` to map model name → file prefix - Example: `claude-sonnet-4-5-20250929` → `claude4.5sonnet` 4. **Modify prompt templates** - Research: system prompt in `research/scripts/generate_solutions.py` - Algorithmic: `CPP_SYSTEM_PROMPT` in `algorithmic/scripts/generate_solutions.py` 5. **Customize solution filename format** - `src/frontier_cs/gen/solution_format.py` - `src/frontier_cs/models.py`: `get_solution_filename()`, `get_solution_path()`