Buffered Routing Embedding (BRE) Algorithm
Inventor: Konstantin Vladimirovich Grabko
Problem Statement
Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.
The BRE Solution
BRE implements a predictive pre-fetching ring buffer.
- Token Prediction Window: A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
- HBM Routing: Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) before the attention computation begins.
- Synchronous Paging: BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.