JiRack_GPT5_7b / SWA_fusion_spec.md
kgrabko's picture
Upload 2 files
c2cdec3 verified

Technical Specification: SwiGLU-Attention (SWA) Fusion Kernel

Document ID: CMS-SWA-2025-001
Status: Proprietary / High Performance

Abstract

In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are executed as discrete kernels, forcing multiple round-trips to Global Memory (VRAM). SWA Fusion merges these into a single computational pass.

Computational Logic

The kernel pipelines the $Q, K, V$ projections simultaneously with the $W_1$ and $W_3$ projections of the SwiGLU layer.

  1. Shared Input Latches: Input $x$ is loaded into Shared Memory (SRAM) once.
  2. Parallel Projections: $$Y_{attn} = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V$$ $$Y_{ffn} = (SiLU(xW_1) \otimes xW_3)W_2$$
  3. Fused Accumulation: $x_{out} = x + Y_{attn} + Y_{ffn}$ is computed in a single thread block before writing back to HBM.

Performance Target

  • Memory Bandwidth Reduction: 30% lower VRAM traffic.
  • Hardware Target: Optimized for AMD CDNA3 (MI300X) Matrix Cores.