KernelPilot-Optimization / tool /prompt_norm.txt
wentaochen's picture
v1 init
949310d
You are an expert CUDA engineer.
Your task is:
1. Completely ignore all input-loading code, file paths, main() logic, correctness checks,
device allocations, free(), test loops, or anything unrelated to the CUDA kernel.
2. Extract ONLY the CUDA kernel from the provided code:
- The kernel must keep the original function name.
- The kernel must keep the original parameter list.
- The kernel body may be reformatted but not logically changed.
3. Rebuild a complete, compilable CUDA program using the fixed template provided below.
4. Insert the extracted kernel EXACTLY into the placeholder {{KERNEL_CODE}}.
5. Do NOT modify the test harness structure.
The test harness will load .bin files automatically and compare results.
6. Use the following information:
- Binary directory: {{BIN_DIR}}
- Variables to load: {{VARNAMES}}
- Number of test cases: {{NUM_TESTS}}
- The test data files follow the format: data/{variable_name}_{case_idx}.bin, where {variable_name} is one of: {{VARNAMES}}
The output **must be a COMPLETE, COMPILABLE .cu file**.
It must:
- include exactly one main() from the template,
- print 'T' if all tests pass,
- print 'F' otherwise,
- compile with nvcc.
Below is the CUDA test harness template you must strictly follow.
You MUST follow the template's structure exactly. All sections in the template (read_bin_f32_dyn, CUDA kernel, correctness check, main(), variable loading, allocation, launch, correctness, free) must appear exactly once.
Insert your kernel where {{KERNEL_CODE}} is located.
```cpp
#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <cmath>
using std::vector;
using std::string;
#define TOLERANCE 1e-3f // The TOLERANCE can be modified according to different tasks
#define NUM_TESTS {{NUM_TESTS}}
// --------------------------
// read_bin_f32_dyn: read .bin function
// --------------------------
vector<float> read_bin_f32_dyn(const string& filename) {
FILE* fp = fopen(filename.c_str(), "rb");
if (!fp) {
std::cerr << "Cannot open bin file: " << filename << std::endl;
exit(1);
}
fseek(fp, 0, SEEK_END);
long fsize = ftell(fp);
rewind(fp);
size_t n_elem = fsize / sizeof(float);
vector<float> data(n_elem);
fread(data.data(), sizeof(float), n_elem, fp);
fclose(fp);
return data;
}
// --------------------------
// CUDA kernel (inserted here)
// --------------------------
{{KERNEL_CODE}}
// --------------------------
// correctness check
// --------------------------
bool check_equal(const vector<float>& a, const vector<float>& b) {
if (a.size() != b.size()) return false;
for (size_t i = 0; i < a.size(); ++i) {
if (fabs(a[i] - b[i]) > TOLERANCE) return false;
}
return true;
}
// --------------------------
// main test harness
// --------------------------
int main() {
bool all_pass = true;
for (int case_id = 1; case_id <= NUM_TESTS; ++case_id) {
// --------------------------
// load all variables
// --------------------------
{{VAR_LOADING_BLOCK}}
// --------------------------
// allocate GPU buffers
// --------------------------
{{ALLOCATE_BLOCK}}
// --------------------------
// kernel launch
// --------------------------
{{KERNEL_LAUNCH_BLOCK}}
// --------------------------
// correctness check
// --------------------------
{{CORRECTNESS_BLOCK}}
// --------------------------
// free GPU memory
// --------------------------
{{FREE_BLOCK}}
}
std::cout << (all_pass ? "T" : "F") << std::endl;
return 0;
}
Below is the original CUDA source code.
Extract ONLY the kernel and rebuild the program using the above template.
======== ORIGINAL CUDA CODE START ========
{{CODE}}
======== ORIGINAL CUDA CODE END ==========
IMPORTANT RULES (strict):
1. Remove or replace any undefined types (e.g., BlockMessage). If the structure definition is missing, expand it using equivalent primitive variables, e.g., int region1_index = blockMessage_region1_index[blockID];
2. Do not generate any non-ASCII characters or Chinese comments. All output must be strictly ASCII-only.
3. If the kernel uses external macros (such as BlockSizeInEval, MaxIntervalCount), detect them from the input code and insert default definitions at the top of the normalized code.
4. Remove the inline qualifier from any __global__ kernel function. (inline is meaningless and should not be applied to __global__ functions.)
5. Ensure that all parameters referenced inside the __global__ kernel are actually defined within the function scope. No undefined external dependencies are allowed.
6. If many numerical mismatches occur in fabs(a[i] - b[i]), allow increasing the tolerance in the template (e.g., from 1e-3f to 1e-1f).
7. Ignore warnings about unused local variables. These do not affect correctness or compilation and should not be treated as errors.