Upload BRE_memory_routing.md
Browse files- BRE_memory_routing.md +11 -0
BRE_memory_routing.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Buffered Routing Embedding (BRE) Algorithm
|
| 2 |
+
**Inventor:** Konstantin Vladimirovich Grabko
|
| 3 |
+
|
| 4 |
+
### Problem Statement
|
| 5 |
+
Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.
|
| 6 |
+
|
| 7 |
+
### The BRE Solution
|
| 8 |
+
BRE implements a predictive pre-fetching ring buffer.
|
| 9 |
+
1. **Token Prediction Window:** A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
|
| 10 |
+
2. **HBM Routing:** Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) *before* the attention computation begins.
|
| 11 |
+
3. **Synchronous Paging:** BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.
|