sky2 / benchmarks /kernelbench /config.yaml
JustinTX's picture
Add files using upload-large-folder tool
af83196 verified
# KernelBench optimization benchmark configuration
# Usage: skydiscover-run evaluator/ -c config.yaml -s <strategy>
# Note: initial_program is automatically fetched from KernelBench dataset, based on the `level` and `problem_id` fields.
language: python
# Benchmark loader configuration
benchmark:
enabled: true
name: kernelbench
resolver: benchmarks.kernelbench.resolver
# Evaluator mode: set to false for native Python (no Docker), true for containerized
use_docker: true # Set to false when running on clusters without Docker/Podman privileges
# KernelBench problem specification
level: 1 # Problem difficulty level (1, 2, 3 or 4)
problem_id: 1 # Specific problem ID within the level
dataset_src: huggingface # 'huggingface' or 'local'
dataset_name: ScalingIntelligence/KernelBench
# Evaluation configuration
eval_mode: local # 'local' or 'modal'
gpu: H100 # GPU type for evaluation
num_correct_trials: 5 # Number of correctness validation runs
num_perf_trials: 100 # Number of performance measurement runs
diff_based_generation: true
max_iterations: 100
checkpoint_interval: 10
max_solution_length: 60000
llm:
api_base: "${BASE_URL}"
api_key: "${API_KEY}"
models:
- name: "gpt-5"
weight: 1.0
max_tokens: 32000
timeout: 600
prompt:
system_message: |-
You are an expert in GPU kernel optimization and PyTorch performance engineering with deep expertise
in writing high-performance CUDA kernels, Triton kernels, and optimized PyTorch operations.
PROBLEM SPECIFICATION:
Your task is to optimize a PyTorch neural network operation to achieve maximum speedup
over the baseline execution. The execution is evaluated on GPU hardware and compared against:
1. PyTorch eager mode (baseline)
2. torch.compile() optimization
PERFORMANCE METRICS:
1. **speedup_over_eager**: Speedup compared to PyTorch eager execution (PRIMARY OBJECTIVE - maximize)
2. **combined_score**: Same as speedup_over_eager (used for optimization)
3. **speedup_over_compile**: Speedup compared to torch.compile() (SECONDARY - maximize)
4. **kernel_time_ms**: Execution time of your optimized kernel in milliseconds (minimize)
5. **ref_eager_time_ms**: Reference eager execution time in milliseconds (for comparison)
OPTIMIZATION STRATEGIES:
- Consider writing custom kernels in CUDA or Triton
- Use efficient memory access patterns (coalesced reads/writes)
- Minimize memory transfers between CPU and GPU
- Leverage tensor cores when applicable
- Use fused operations to reduce kernel launches
- Optimize for the specific GPU architecture (H100, A100, etc.)
- Use appropriate data types (fp16, bf16, fp32)
- Minimize synchronization points
TECHNICAL REQUIREMENTS:
- **Correctness**: Your implementation must produce numerically correct results
- **Determinism**: Use fixed random seeds if employing stochastic methods
- **Error handling**: Graceful handling of edge cases and invalid inputs
- **GPU compatibility**: Code must run on the specified GPU hardware
# change the SkyDiscover default of 500 which causes the model to focus only on simplification
suggest_simplification_after_chars: 5000
evaluator:
timeout: 600
max_retries: 3