# DeepGEMM DeepGEMM kernel for the [Hugging Face kernel-builder](https://github.com/huggingface/kernels) infrastructure. This package provides FP8/FP4/BF16 GEMM kernels, einsum, attention, and hyperconnection operations from [DeepSeek-AI/DeepGEMM](https://github.com/DeepSeek-AI/DeepGEMM), adapted to the kernels-community build structure with torch library bindings. ## Features - **FP8/FP4 GEMMs**: NT, NN, TN, TT variants with M-grouped and K-grouped support - **BF16 GEMMs**: NT, NN, TN, TT variants with M-grouped and K-grouped support - **cuBLASLt GEMMs**: NT, NN, TN, TT wrappers - **Einsum**: bmk,bnk->mn, bhr,hdr->bhd, bhd,hdr->bhr expressions (BF16 and FP8) - **Attention**: FP8 MQA logits (regular and paged) - **Hyperconnection**: TF32 prenorm GEMM - **Layout utilities**: Scaling factor transformations, TMA alignment ## Architecture Support - SM 9.0a (Hopper / H100) - SM 10.0a (Blackwell / B200) ## Requirements - CUDA >= 12.1 - PyTorch >= 2.1 - CUTLASS 3.9+ - NVRTC (part of CUDA Toolkit) ## Installation ```bash pip install kernels ``` ```python import kernels kernels.install("kernels-community/DeepGEMM") ``` ## Usage ```python import deep_gemm # FP8 GEMM: D = A @ B.T deep_gemm.fp8_gemm_nt((a_fp8, sfa), (b_fp8, sfb), d) # BF16 GEMM: D = A @ B.T deep_gemm.bf16_gemm_nt(a_bf16, b_bf16, d) # cuBLASLt GEMM deep_gemm.cublaslt_gemm_nt(a, b, d) ``` ## JIT Compilation DeepGEMM uses Just-In-Time (JIT) compilation for its CUDA kernels. The kernel templates (`.cuh` files in `include/deep_gemm/`) are compiled at runtime using NVCC or NVRTC. First invocations may be slower due to compilation; results are cached in `~/.deep_gemm/` for subsequent calls. ### CUTLASS Runtime Dependency The JIT-compiled kernels depend on CUTLASS headers (`cute/`, `cutlass/`) at runtime. The package will automatically search for CUTLASS in these locations: 1. `DG_CUTLASS_INCLUDE` environment variable (direct path to include dir) 2. `CUTLASS_HOME` environment variable (`$CUTLASS_HOME/include`) 3. Bundled in the package's `include/` directory 4. `CUDA_HOME/include` (some CUDA 12.8+ installs bundle `cute/`) 5. `nvidia-cutlass` Python package Set one of these if JIT compilation fails with missing CUTLASS headers: ```bash export CUTLASS_HOME=/path/to/cutlass # or export DG_CUTLASS_INCLUDE=/path/to/cutlass/include ```