JiRack_GPT5_70b / BRE_memory_routing.md
kgrabko's picture
Upload BRE_memory_routing.md
5e80710 verified

Buffered Routing Embedding (BRE) Algorithm

Inventor: Konstantin Vladimirovich Grabko

Problem Statement

Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.

The BRE Solution

BRE implements a predictive pre-fetching ring buffer.

  1. Token Prediction Window: A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
  2. HBM Routing: Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) before the attention computation begins.
  3. Synchronous Paging: BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.