CMSManhattan
/

JiRack_GPT5_7b

Model card Files Files and versions

kgrabko commited on Dec 22, 2025

Commit

c2cdec3

·

verified ·

1 Parent(s): ef962ee

Upload 2 files

Files changed (2) hide show

BRE_memory_routing.md +11 -0
SWA_fusion_spec.md +19 -0

BRE_memory_routing.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# Buffered Routing Embedding (BRE) Algorithm
+**Inventor:** Konstantin Vladimirovich Grabko
+### Problem Statement
+Ultra-scale models (140B+) suffer from "Memory Wall" bottlenecks where the GPU waits for embedding weights to be fetched from HBM.
+### The BRE Solution
+BRE implements a predictive pre-fetching ring buffer.
+1. **Token Prediction Window:** A lightweight heuristic monitors the last $N$ tokens to predict high-probability future embeddings.
+2. **HBM Routing:** Predicted weights are moved from standard HBM to a specialized "High-Speed Buffer" partition (L3 Cache/Shared Memory) *before* the attention computation begins.
+3. **Synchronous Paging:** BRE uses Peer-to-Peer (P2P) DMA transfers across the ROCm/Infinity Fabric to ensure that weights for the next 4 layers are already local to the current GPU.

SWA_fusion_spec.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Technical Specification: SwiGLU-Attention (SWA) Fusion Kernel
+**Document ID:** CMS-SWA-2025-001
+**Status:** Proprietary / High Performance
+### Abstract
+In standard transformers, the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) are executed as discrete kernels, forcing multiple round-trips to Global Memory (VRAM). SWA Fusion merges these into a single computational pass.
+### Computational Logic
+The kernel pipelines the $Q, K, V$ projections simultaneously with the $W_1$ and $W_3$ projections of the SwiGLU layer.
+1. **Shared Input Latches:** Input $x$ is loaded into Shared Memory (SRAM) once.
+2. **Parallel Projections:** $$Y_{attn} = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V$$
+   $$Y_{ffn} = (SiLU(xW_1) \otimes xW_3)W_2$$
+3. **Fused Accumulation:** $x_{out} = x + Y_{attn} + Y_{ffn}$ is computed in a single thread block before writing back to HBM.
+### Performance Target
+- **Memory Bandwidth Reduction:** 30% lower VRAM traffic.
+- **Hardware Target:** Optimized for AMD CDNA3 (MI300X) Matrix Cores.