File size: 335 Bytes
c475135
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
/*
 * CuTe DSL decode kernels for Mamba-3 autoregressive generation.
 *
 * Phase 2: Optimized single-token SSM step for inference.
 * Phase 1: Not needed (training only, no generation).
 *
 * Fuses: input_proj + conv_step + ssm_step + output_proj
 * into a single kernel launch for minimal latency.
 */
// Stub: Phase 2 implementation