| You are an expert CUDA engineer. | |
| Your task is: | |
| 1. Completely ignore all input-loading code, file paths, main() logic, correctness checks, | |
| device allocations, free(), test loops, or anything unrelated to the CUDA kernel. | |
| 2. Extract ONLY the CUDA kernel from the provided code: | |
| - The kernel must keep the original function name. | |
| - The kernel must keep the original parameter list. | |
| - The kernel body may be reformatted but not logically changed. | |
| 3. Rebuild a complete, compilable CUDA program using the fixed template provided below. | |
| 4. Insert the extracted kernel EXACTLY into the placeholder {{KERNEL_CODE}}. | |
| 5. Do NOT modify the test harness structure. | |
| The test harness will load .bin files automatically and compare results. | |
| 6. Use the following information: | |
| - Binary directory: {{BIN_DIR}} | |
| - Variables to load: {{VARNAMES}} | |
| - Number of test cases: {{NUM_TESTS}} | |
| - The test data files follow the format: data/{variable_name}_{case_idx}.bin, where {variable_name} is one of: {{VARNAMES}} | |
| The output **must be a COMPLETE, COMPILABLE .cu file**. | |
| It must: | |
| - include exactly one main() from the template, | |
| - print 'T' if all tests pass, | |
| - print 'F' otherwise, | |
| - compile with nvcc. | |
| Below is the CUDA test harness template you must strictly follow. | |
| You MUST follow the template's structure exactly. All sections in the template (read_bin_f32_dyn, CUDA kernel, correctness check, main(), variable loading, allocation, launch, correctness, free) must appear exactly once. | |
| Insert your kernel where {{KERNEL_CODE}} is located. | |
| ```cpp | |
| #include <cuda_runtime.h> | |
| #include <iostream> | |
| #include <fstream> | |
| #include <vector> | |
| #include <string> | |
| #include <cmath> | |
| using std::vector; | |
| using std::string; | |
| #define TOLERANCE 1e-3f // The TOLERANCE can be modified according to different tasks | |
| #define NUM_TESTS {{NUM_TESTS}} | |
| // -------------------------- | |
| // read_bin_f32_dyn: read .bin function | |
| // -------------------------- | |
| vector<float> read_bin_f32_dyn(const string& filename) { | |
| FILE* fp = fopen(filename.c_str(), "rb"); | |
| if (!fp) { | |
| std::cerr << "Cannot open bin file: " << filename << std::endl; | |
| exit(1); | |
| } | |
| fseek(fp, 0, SEEK_END); | |
| long fsize = ftell(fp); | |
| rewind(fp); | |
| size_t n_elem = fsize / sizeof(float); | |
| vector<float> data(n_elem); | |
| fread(data.data(), sizeof(float), n_elem, fp); | |
| fclose(fp); | |
| return data; | |
| } | |
| // -------------------------- | |
| // CUDA kernel (inserted here) | |
| // -------------------------- | |
| {{KERNEL_CODE}} | |
| // -------------------------- | |
| // correctness check | |
| // -------------------------- | |
| bool check_equal(const vector<float>& a, const vector<float>& b) { | |
| if (a.size() != b.size()) return false; | |
| for (size_t i = 0; i < a.size(); ++i) { | |
| if (fabs(a[i] - b[i]) > TOLERANCE) return false; | |
| } | |
| return true; | |
| } | |
| // -------------------------- | |
| // main test harness | |
| // -------------------------- | |
| int main() { | |
| bool all_pass = true; | |
| for (int case_id = 1; case_id <= NUM_TESTS; ++case_id) { | |
| // -------------------------- | |
| // load all variables | |
| // -------------------------- | |
| {{VAR_LOADING_BLOCK}} | |
| // -------------------------- | |
| // allocate GPU buffers | |
| // -------------------------- | |
| {{ALLOCATE_BLOCK}} | |
| // -------------------------- | |
| // kernel launch | |
| // -------------------------- | |
| {{KERNEL_LAUNCH_BLOCK}} | |
| // -------------------------- | |
| // correctness check | |
| // -------------------------- | |
| {{CORRECTNESS_BLOCK}} | |
| // -------------------------- | |
| // free GPU memory | |
| // -------------------------- | |
| {{FREE_BLOCK}} | |
| } | |
| std::cout << (all_pass ? "T" : "F") << std::endl; | |
| return 0; | |
| } | |
| Below is the original CUDA source code. | |
| Extract ONLY the kernel and rebuild the program using the above template. | |
| ======== ORIGINAL CUDA CODE START ======== | |
| {{CODE}} | |
| ======== ORIGINAL CUDA CODE END ========== | |
| IMPORTANT RULES (strict): | |
| 1. Remove or replace any undefined types (e.g., BlockMessage). If the structure definition is missing, expand it using equivalent primitive variables, e.g., int region1_index = blockMessage_region1_index[blockID]; | |
| 2. Do not generate any non-ASCII characters or Chinese comments. All output must be strictly ASCII-only. | |
| 3. If the kernel uses external macros (such as BlockSizeInEval, MaxIntervalCount), detect them from the input code and insert default definitions at the top of the normalized code. | |
| 4. Remove the inline qualifier from any __global__ kernel function. (inline is meaningless and should not be applied to __global__ functions.) | |
| 5. Ensure that all parameters referenced inside the __global__ kernel are actually defined within the function scope. No undefined external dependencies are allowed. | |
| 6. If many numerical mismatches occur in fabs(a[i] - b[i]), allow increasing the tolerance in the template (e.g., from 1e-3f to 1e-1f). | |
| 7. Ignore warnings about unused local variables. These do not affect correctness or compilation and should not be treated as errors. |