kgrabko commited on
Commit
c2cdec3
·
verified ·
1 Parent(s): ef962ee

Upload 2 files

Browse files
Files changed (2) hide show
  1. BRE_memory_routing.md +11 -0
  2. SWA_fusion_spec.md +19 -0
BRE_memory_routing.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Buffered Routing Embedding (BRE) Algorithm
2
+ **Inventor:** Konstantin Vladimirovich Grabko
3
+
4
+ ### Problem Statement
5
+ Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.
6
+
7
+ ### The BRE Solution
8
+ BRE implements a predictive pre-fetching ring buffer.
9
+ 1. **Token Prediction Window:** A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
10
+ 2. **HBM Routing:** Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) *before* the attention computation begins.
11
+ 3. **Synchronous Paging:** BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.
SWA_fusion_spec.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technical Specification: SwiGLU-Attention (SWA) Fusion Kernel
2
+ **Document ID:** CMS-SWA-2025-001
3
+ **Status:** Proprietary / High Performance
4
+
5
+ ### Abstract
6
+ In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are executed as discrete kernels, forcing multiple round-trips to Global Memory (VRAM). SWA Fusion merges these into a single computational pass.
7
+
8
+ ### Computational Logic
9
+ The kernel pipelines the $Q, K, V$ projections simultaneously with the $W_1$ and $W_3$ projections of the SwiGLU layer.
10
+ 1. **Shared Input Latches:** Input $x$ is loaded into Shared Memory (SRAM) once.
11
+ 2. **Parallel Projections:** $$Y_{attn} = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V$$
12
+ $$Y_{ffn} = (SiLU(xW_1) \otimes xW_3)W_2$$
13
+ 3. **Fused Accumulation:** $x_{out} = x + Y_{attn} + Y_{ffn}$ is computed in a single thread block before writing back to HBM.
14
+
15
+
16
+
17
+ ### Performance Target
18
+ - **Memory Bandwidth Reduction:** 30% lower VRAM traffic.
19
+ - **Hardware Target:** Optimized for AMD CDNA3 (MI300X) Matrix Cores.