| You are an expert CUDA engineer. |
|
|
| Your task is: |
| 1. Completely ignore all input-loading code, file paths, main() logic, correctness checks, |
| device allocations, free(), test loops, or anything unrelated to the CUDA kernel. |
| 2. Extract ONLY the CUDA kernel from the provided code: |
| - The kernel must keep the original function name. |
| - The kernel must keep the original parameter list. |
| - The kernel body may be reformatted but not logically changed. |
| 3. Rebuild a complete, compilable CUDA program using the fixed template provided below. |
| 4. Insert the extracted kernel EXACTLY into the placeholder {{KERNEL_CODE}}. |
| 5. Do NOT modify the test harness structure. |
| The test harness will load .bin files automatically and compare results. |
| 6. Use the following information: |
| - Binary directory: {{BIN_DIR}} |
| - Variables to load: {{VARNAMES}} |
| - Number of test cases: {{NUM_TESTS}} |
| - The test data files follow the format: data/{variable_name}_{case_idx}.bin, where {variable_name} is one of: {{VARNAMES}} |
|
|
| The output **must be a COMPLETE, COMPILABLE .cu file**. |
| It must: |
| - include exactly one main() from the template, |
| - print 'T' if all tests pass, |
| - print 'F' otherwise, |
| - compile with nvcc. |
|
|
| Below is the CUDA test harness template you must strictly follow. |
| You MUST follow the template's structure exactly. All sections in the template (read_bin_f32_dyn, CUDA kernel, correctness check, main(), variable loading, allocation, launch, correctness, free) must appear exactly once. |
| Insert your kernel where {{KERNEL_CODE}} is located. |
|
|
| ```cpp |
| #include <cuda_runtime.h> |
| #include <iostream> |
| #include <fstream> |
| #include <vector> |
| #include <string> |
| #include <cmath> |
|
|
| using std::vector; |
| using std::string; |
|
|
| #define TOLERANCE 1e-3f // The TOLERANCE can be modified according to different tasks |
| #define NUM_TESTS {{NUM_TESTS}} |
|
|
| // -------------------------- |
| // read_bin_f32_dyn: read .bin function |
| // -------------------------- |
| vector<float> read_bin_f32_dyn(const string& filename) { |
| FILE* fp = fopen(filename.c_str(), "rb"); |
| if (!fp) { |
| std::cerr << "Cannot open bin file: " << filename << std::endl; |
| exit(1); |
| } |
|
|
| fseek(fp, 0, SEEK_END); |
| long fsize = ftell(fp); |
| rewind(fp); |
|
|
| size_t n_elem = fsize / sizeof(float); |
| vector<float> data(n_elem); |
| fread(data.data(), sizeof(float), n_elem, fp); |
| fclose(fp); |
| return data; |
| } |
|
|
| // -------------------------- |
| // CUDA kernel (inserted here) |
| // -------------------------- |
| {{KERNEL_CODE}} |
|
|
| // -------------------------- |
| // correctness check |
| // -------------------------- |
| bool check_equal(const vector<float>& a, const vector<float>& b) { |
| if (a.size() != b.size()) return false; |
| for (size_t i = 0; i < a.size(); ++i) { |
| if (fabs(a[i] - b[i]) > TOLERANCE) return false; |
| } |
| return true; |
| } |
|
|
| // -------------------------- |
| // main test harness |
| // -------------------------- |
| int main() { |
| bool all_pass = true; |
|
|
| for (int case_id = 1; case_id <= NUM_TESTS; ++case_id) { |
|
|
| // -------------------------- |
| // load all variables |
| // -------------------------- |
| {{VAR_LOADING_BLOCK}} |
|
|
| // -------------------------- |
| // allocate GPU buffers |
| // -------------------------- |
| {{ALLOCATE_BLOCK}} |
|
|
| // -------------------------- |
| // kernel launch |
| // -------------------------- |
| {{KERNEL_LAUNCH_BLOCK}} |
|
|
| // -------------------------- |
| // correctness check |
| // -------------------------- |
| {{CORRECTNESS_BLOCK}} |
|
|
| // -------------------------- |
| // free GPU memory |
| // -------------------------- |
| {{FREE_BLOCK}} |
| } |
|
|
| std::cout << (all_pass ? "T" : "F") << std::endl; |
| return 0; |
| } |
|
|
| Below is the original CUDA source code. |
| Extract ONLY the kernel and rebuild the program using the above template. |
|
|
| ======== ORIGINAL CUDA CODE START ======== |
| {{CODE}} |
| ======== ORIGINAL CUDA CODE END ========== |
|
|
| IMPORTANT RULES (strict): |
| 1. Remove or replace any undefined types (e.g., BlockMessage). If the structure definition is missing, expand it using equivalent primitive variables, e.g., int region1_index = blockMessage_region1_index[blockID]; |
| 2. Do not generate any non-ASCII characters or Chinese comments. All output must be strictly ASCII-only. |
| 3. If the kernel uses external macros (such as BlockSizeInEval, MaxIntervalCount), detect them from the input code and insert default definitions at the top of the normalized code. |
| 4. Remove the inline qualifier from any __global__ kernel function. (inline is meaningless and should not be applied to __global__ functions.) |
| 5. Ensure that all parameters referenced inside the __global__ kernel are actually defined within the function scope. No undefined external dependencies are allowed. |
| 6. If many numerical mismatches occur in fabs(a[i] - b[i]), allow increasing the tolerance in the template (e.g., from 1e-3f to 1e-1f). |
| 7. Ignore warnings about unused local variables. These do not affect correctness or compilation and should not be treated as errors. |