wentaochen's picture
v1 init
949310d
The task is to write a CUDA kernel function on GPU, and we have the benchmark code for this task:
[code]
Optimize the kernel function for less execution time on GPU.
The output should be the content of whole .cu file containing ONE kernel function.
Do not modify the test part. Note the test data contains exactly four input sets. The generated .cu file must ensure that for each input set, the kernel function is called exactly once, resulting in a total of four kernel invocations. Do not include any extra timing logic, profiling wrappers, or repeat kernel calls that could cause each input to trigger multiple kernel launches.
When generating CUDA code, you must produce a complete, standalone, compilable program, not just a kernel or code fragment.
The program should include headers, data structures, the kernel definition, a main() function that allocates memory, launches the kernel, and prints "T" or "F" based on correctness.
Follow these strict rules:
1. Always make the code fully self-contained and directly compilable with nvcc file.cu -o file. No missing functions, dependencies, or external headers.
2. Do not use std::max, std::min, or std::abs in device code. Always use fmaxf, fminf, and fabsf instead.
3. Do not use INFINITY; use CUDART_INF_F or a large constant (e.g., 1e30f) instead.
4. Include all required headers: <vector>, <cmath>, <cstdint>, and <cuda_runtime.h>.
5. Avoid duplicate includes and never mark __global__ functions as inline.
Please focus your modifications only in the CUDA kernel function part. Avoid changing any other parts of the program, including:
- data loading, I/O, or test logic
- host-side function definitions
- main() function or CUDA memory allocation logic
Do not rename variables or structs defined outside the kernel region.