hotdog: highly optimised Toeplitz/Cauchy/displacement-rank GPU kernels

HOTDOG (Highly Optimised Toeplitz-Cauchy with Displacement Rank Acceleration On GPUs) is a CUDA kernel bundle for the structured linear-algebra primitives that underpin state-space models (S4/S5, Mamba), long-convolution networks (Hyena, H3), and low-displacement-rank layers.

The kernel exposes the usual Toeplitz/Cauchy ops β€” with full autograd, BF16/FP16/complex-dtype paths, stream-aware dispatch, and torch.compile compliance via pt2_compliant_tag + needs_fixed_stride_order.

Load from the Hugging Face Hub

pip install kernels torch
import torch
from kernels import get_kernel

hotdog = get_kernel("chrisvoncsefalvay/hotdog")

B, N, L = 4, 1024, 1024
v = torch.randn(B, N, device="cuda", dtype=torch.float32)
omega = torch.linspace(0.5, 1.5, L, device="cuda", dtype=torch.float32)
lambd = torch.linspace(-1.5, -0.5, N, device="cuda", dtype=torch.float32)

out = torch.empty(B, L, device="cuda", dtype=torch.float32)
hotdog.cauchy_kernel_forward(out, v, omega, lambd)

The kernels client picks the pre-built binary that matches your Python, PyTorch, and CUDA versions. For day-to-day PyTorch use prefer the top-level Python API (see below) rather than calling the raw ops.

Public API

hotdog.cauchy_kernel(v, omega, lambd)            # Cauchy kernel evaluation
hotdog.toeplitz_matvec(c, r, x)                  # y = T(c, r) @ x
hotdog.toeplitz_matvec_transpose(c, r, x)        # y = T(c, r)^T @ x
hotdog.toeplitz_solve(c, r, b)                   # T(c, r) x = b
hotdog.toeplitz_solve_symmetric(c, b)            # T(c, c) x = b  (fast path)
hotdog.toeplitz_inverse_apply(gs_u, gs_v, x)     # T^{-1} x via Gohberg-Semencul
hotdog.gs_compute_generators(c, r)               # Gohberg-Semencul generators
hotdog.toeplitz_to_cauchy(c, r)                  # C = F T F^{-1}
hotdog.displacement_matvec(G, H, x, disp_type=0) # y = A x, generator-aware
hotdog.cauchy_pin_l2(omega, lambd)               # L2 persistence pin

Every entry accepts an optional stream= keyword argument; forward-only entries additionally accept an out= buffer. Calls broadcast leading batch dimensions and stay correct under torch.compile and torch.cuda.graph capture. Only Sylvester-type displacement (disp_type=0) is supported today. nn.Module wrappers (CauchyKernel, ToeplitzMV, ToeplitzSolve, ToeplitzInverse, DisplacementMV) are provided for parameterised layers that store structured matrices as buffers.

What's included

  • Cauchy: tiled shared-memory reduction with Kahan-Babushka compensated summation, __nv_bfloat162 pair loads on the BF16 path, $+1$ bank-padded shared tiles, and a dedicated complex (float2) kernel for S4 poles. Custom CUDA backward over $v$, $\omega$, $\lambda$.
  • Toeplitz matvec: single-block shared-memory radix-2 FFT for padded $N\leq 1024$; cuFFT-backed circulant embedding for larger sizes. Explicit transpose entry; BF16/FP16 forward via promote-on-load.
  • Toeplitz solve: block-cooperative Levinson-Durbin with shared-memory linear-prediction coefficients, warp-shuffle reductions, and opt-in large-shmem via cudaFuncSetAttribute up to the device's sharedMemPerBlockOptin ceiling.
  • Gohberg-Semencul inverse application: two-stage batched fused path ([U(Jv)|U(v)] then [L(u)|L(Ju)]) with shared $x$ FFT, record_stream discipline on every intermediate, and a small-$N$ shared-memory FFT route.
  • Displacement-rank matvec: generator-aware frequency-domain Cauchy-sum kernel; $O(B\cdot r\cdot n)$ memory, pseudoinverse handling at the Sylvester diagonal pole.
  • cauchy_pin_l2: pins $\omega$ and $\lambda$ into the L2 persistence window on the current stream for repeated-call workloads (Ampere+).

Performance summary

NVIDIA GB10 (sm_121), PyTorch 2.10, CUDA 13.0:

Operation Shape ($B\times n$) HOTDOG vs reference Achieved
cauchy_kernel 16x4096 0.55 ms 25x naive 1.87 TF/s
cauchy_kernel 16x1024 0.08 ms 8.3x naive 855 GF/s
toeplitz_matvec 4x256 0.031 ms 1.6x FFT 9 GF/s
toeplitz_solve_symmetric 4x64 0.084 ms 54.7x SciPy β€”
toeplitz_solve_symmetric 4x256 0.278 ms 16.5x SciPy β€”
toeplitz_inverse_apply 4x256 0.344 ms 28.4x over 4x FFT β€”
displacement_matvec ($r=2$) 4x256 0.101 ms 1.9x dense 44 GF/s

At $N\geq 1024$ the Cauchy kernel runs at $\approx 60%$ of the GB10 fp32 peak. Small-shape rows are launch-overhead bound, not kernel-bound; closing that gap is an orchestration concern (torch.cuda.graph, op fusion, cauchy_pin_l2), not a kernel choice.

Techniques

  1. Tiled shared-memory Cauchy reduction with Kahan-Babushka compensation against $|\omega-\lambda|$ cancellations.
  2. Architecture-aware Toeplitz dispatch between in-kernel shared-memory FFT and cuFFT-backed embedding.
  3. Block-cooperative Levinson-Durbin with a shared-memory-heavy L1 carveout and opt-in dynamic shmem past the 48 KB default.
  4. Generator-aware displacement matvec that never materialises the dense $[B,n,n]$ matrix; memory is $O(B\cdot r\cdot n)$.
  5. Fused Gohberg-Semencul apply folding four triangular Toeplitz-matvecs into two batched stages with shared $x$ FFT and record_stream discipline.
  6. Cache-aware loads and stores (__ldca, __ldcs, __stwt) plus bank-conflict padding on shared tiles.
  7. Principled pole handling at $\omega_i=\omega_j^{-1}$: minimum-norm pseudoinverse rather than epsilon regularisation.
  8. PT2 compliance: pt2_compliant_tag, needs_fixed_stride_order, and Meta (FakeTensor) implementations on every op.

Supported dtypes

Op float32 bfloat16 float16 complex64
cauchy_kernel (forward) βœ“ βœ“ β€” βœ“
cauchy_kernel (backward) βœ“ β€” β€” β€”
toeplitz_matvec βœ“ βœ“ βœ“ β€”
toeplitz_solve_* βœ“ β€” β€” β€”
toeplitz_inverse_apply βœ“ β€” β€” β€”
displacement_matvec βœ“ β€” β€” β€”
toeplitz_to_cauchy βœ“ β€” β€” β€”

Practical constraints

  • Toeplitz FFT paths zero-pad non-power-of-two inputs to the next power of two.
  • The shared-memory FFT path is enabled for padded $N\leq 1024$; larger sizes use cuFFT.
  • The symmetric solve fast path falls back to the structured general solver when its residual check does not pass or when $c\neq r$.
  • Levinson recursion requires non-singular principal minors; unstable slices are routed to a dense solve automatically.
  • The fast Gohberg-Semencul application assumes symmetric generators ($v\approx\text{reverse}(u)$); non-symmetric or gradient-bearing calls take the exact recovery path.

Build targets

Pre-built Hub variants cover sm_80, sm_86, sm_89, sm_90, and sm_120+PTX (Ampere through Blackwell) on Linux x86_64 and aarch64 for the Python and PyTorch versions kernels-community publishes against. For local development and additional arches, build with kernel-builder:

nix run -L --max-jobs 1 --cores 4 .#build-and-copy

Links

Author

I'm Chris von Csefalvay, an AI researcher specialising in post-training, and the author of Post-Training: A Practical Guide for AI Engineers and Developers (No Starch Press, 2026). I also write Post-Slop, a periodic diatribe about AI, and what it's doing for society. You can also find me on LinkedIn and X.

License

MIT

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support