Upload 2 files
Browse files- BRE_memory_routing.md +11 -0
- SWA_fusion_spec.md +19 -0
BRE_memory_routing.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Buffered Routing Embedding (BRE) Algorithm
|
| 2 |
+
**Inventor:** Konstantin Vladimirovich Grabko
|
| 3 |
+
|
| 4 |
+
### Problem Statement
|
| 5 |
+
Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.
|
| 6 |
+
|
| 7 |
+
### The BRE Solution
|
| 8 |
+
BRE implements a predictive pre-fetching ring buffer.
|
| 9 |
+
1. **Token Prediction Window:** A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
|
| 10 |
+
2. **HBM Routing:** Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) *before* the attention computation begins.
|
| 11 |
+
3. **Synchronous Paging:** BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.
|
SWA_fusion_spec.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Technical Specification: SwiGLU-Attention (SWA) Fusion Kernel
|
| 2 |
+
**Document ID:** CMS-SWA-2025-001
|
| 3 |
+
**Status:** Proprietary / High Performance
|
| 4 |
+
|
| 5 |
+
### Abstract
|
| 6 |
+
In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are executed as discrete kernels, forcing multiple round-trips to Global Memory (VRAM). SWA Fusion merges these into a single computational pass.
|
| 7 |
+
|
| 8 |
+
### Computational Logic
|
| 9 |
+
The kernel pipelines the $Q, K, V$ projections simultaneously with the $W_1$ and $W_3$ projections of the SwiGLU layer.
|
| 10 |
+
1. **Shared Input Latches:** Input $x$ is loaded into Shared Memory (SRAM) once.
|
| 11 |
+
2. **Parallel Projections:** $$Y_{attn} = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V$$
|
| 12 |
+
$$Y_{ffn} = (SiLU(xW_1) \otimes xW_3)W_2$$
|
| 13 |
+
3. **Fused Accumulation:** $x_{out} = x + Y_{attn} + Y_{ffn}$ is computed in a single thread block before writing back to HBM.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
### Performance Target
|
| 18 |
+
- **Memory Bandwidth Reduction:** 30% lower VRAM traffic.
|
| 19 |
+
- **Hardware Target:** Optimized for AMD CDNA3 (MI300X) Matrix Cores.
|