feather-runtime / overlay /kernels /cuda /decode_kernels.cu
Jackoatmon's picture
Update Feather h200 training runtime image
e317e25 verified
raw
history blame contribute delete
335 Bytes
/*
* CuTe DSL decode kernels for Mamba-3 autoregressive generation.
*
* Phase 2: Optimized single-token SSM step for inference.
* Phase 1: Not needed (training only, no generation).
*
* Fuses: input_proj + conv_step + ssm_step + output_proj
* into a single kernel launch for minimal latency.
*/
// Stub: Phase 2 implementation