File size: 5,117 Bytes
61ba51e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | # sgl-kernel
[Kernel Library](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) for LLM inference engines
<div align="center">
[](https://github.com/sgl-project/sglang/blob/main/LICENSE)
[](https://pypi.org/project/sgl-kernel)
</div>
sgl-kernel provides optimized compute primitives for LLM inference engines, enabling efficient inference for large language models and vision-language models through custom kernel operations. It has been used by [LightLLM](https://github.com/ModelTC/LightLLM), [SGLang](https://github.com/sgl-project/sglang) and so on.
## Installation
Requires torch == 2.9.1
```bash
# Latest version
pip3 install sgl-kernel --upgrade
```
## Building from Source
Requires
- CMake ≥3.31,
- Python ≥3.10
- scikit-build-core
- ninja(optional)
### Use Makefile to build sgl-kernel
```bash
make build
```
### Limit build resource usage (CPU / parallelism)
By default, `make build` uses all available CPU cores. You can override build parallelism and NVCC compile threads:
```bash
# Limit parallel jobs (controls both make and cmake parallelism)
make build MAX_JOBS=2
# Additionally limit NVCC internal threads (reduces CPU and peak memory)
make build MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
```
## Contribution
### Steps to add a new kernel:
1. Implement the kernel in [csrc](https://github.com/sgl-project/sglang/tree/main/sgl-kernel/csrc)
2. Expose the interface in [include/sgl_kernel_ops.h](https://github.com/sgl-project/sglang/blob/main/sgl-kernel/include/sgl_kernel_ops.h)
3. Create torch extension in [csrc/common_extension.cc](https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/common_extension.cc)
4. Update [CMakeLists.txt](https://github.com/sgl-project/sglang/blob/main/sgl-kernel/CMakeLists.txt) to include new CUDA source
5. Expose Python interface in [python](https://github.com/sgl-project/sglang/blob/main/sgl-kernel/python/sgl_kernel)
6. Add test and benchmark
### Development Tips
1. When creating torch extensions, add the function definition with `m.def`, and device binding with `m.impl`:
- How to write schema: [Schema reference](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md#func)
```cpp
// We need def with schema here for torch.compile
m.def(
"bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, "
"int cublas_handle) -> ()");
m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
```
### Adapting C++ Native Types for Torch Compatibility
Third-party C++ libraries often use int and float, but PyTorch bindings require int64_t and double due to Python's type mapping.
Use make_pytorch_shim from sgl_kernel_torch_shim.h to handle conversions automatically:
```cpp
// Add type conversion for int -> int64_t
template <>
struct pytorch_library_compatible_type<int> {
using type = int64_t;
static int convert_from_type(int64_t arg) {
TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "value too large");
TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "value too small");
return arg;
}
};
```
```cpp
// Wrap your function
m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
```
### Testing & Benchmarking
1. Add pytest tests in [tests/](https://github.com/sgl-project/sglang/tree/main/sgl-kernel/tests), if you need to skip some test, please use `@pytest.mark.skipif`
```python
@pytest.mark.skipif(
skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
)
```
2. Add benchmarks using [triton benchmark](https://triton-lang.org/main/python-api/generated/triton.testing.Benchmark.html) in [benchmark/](https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark)
**We recommend using `triton.testing.do_bench_cudagraph` for kernel benchmarking**:
Compared to `triton.testing.do_bench`, `do_bench_cudagraph` provides:
- Reduced CPU overhead impact for more accurate kernel performance measurements
- Incorporation of PDL (Programmatic Dependent Launch) effects into individual kernel results
- More realistic performance data on PDL-supported architectures (SM >= 90)
3. Run test suite
## Kernel Size Analysis
Analyze CUDA kernel sizes in compiled wheel files to identify oversized kernels and template-instantiation bloat:
This tool requires `cubloaty` (install with `pip install cubloaty`) to work.
```bash
# Install cubloaty
pip install cubloaty
# Analyze a wheel file
python analyze_whl_kernel_sizes.py path/to/sgl_kernel-*.whl
# Custom output file
python analyze_whl_kernel_sizes.py path/to/sgl_kernel-*.whl --output my_analysis.txt
```
The tool generates:
- A text report with:
- Kernel groups (by name prefix)
- Individual kernel sizes (sorted by size)
Use this to identify large kernels and potential template instantiation bloat.
## FAQ
- Q: Segmentation fault with CUDA 12.6
- A: Update ptxas to 12.8, reference: [segment fault error](https://github.com/Dao-AILab/flash-attention/issues/1453)
|