Technical Specification: SwiGLU-Attention (SWA) Fusion Kernel
Document ID: CMS-SWA-2025-001
Status: Proprietary / High Performance
Abstract
In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are executed as discrete kernels, forcing multiple round-trips to Global Memory (VRAM). SWA Fusion merges these into a single computational pass.
Computational Logic
The kernel pipelines the $Q, K, V$ projections simultaneously with the $W_1$ and $W_3$ projections of the SwiGLU layer.
- Shared Input Latches: Input $x$ is loaded into Shared Memory (SRAM) once.
- Parallel Projections: $$Y_{attn} = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V$$ $$Y_{ffn} = (SiLU(xW_1) \otimes xW_3)W_2$$
- Fused Accumulation: $x_{out} = x + Y_{attn} + Y_{ffn}$ is computed in a single thread block before writing back to HBM.
Performance Target
- Memory Bandwidth Reduction: 30% lower VRAM traffic.
- Hardware Target: Optimized for AMD CDNA3 (MI300X) Matrix Cores.