--- library_name: fireecho tags: - inference - triton - quantization - moe - fp4 - fp8 - int2 - single-gpu - blackwell - hebbian - speculative-decoding - custom-kernel license: cc-by-nc-4.0 pipeline_tag: text-generation datasets: - Qwen/Qwen3-Omni-30B-A3B-Instruct model-index: - name: FireEcho Engine results: - task: type: text-generation name: Inference Throughput dataset: name: FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each) type: custom metrics: - name: Decode Speed (Baseline FP4) type: tokens_per_second value: 43.3 verified: false - name: Decode Speed (Full-Stack + CUDA Graph) type: tokens_per_second value: 57.1 verified: false - name: Speedup vs Naive PyTorch type: speedup value: 124 verified: false - name: VRAM Usage (Model) type: gpu_memory_gb value: 20.0 verified: false - name: VRAM Usage (Peak) type: gpu_memory_gb value: 21.5 verified: false - name: Model Load Time (seconds) type: latency value: 110 verified: false - name: Compression Ratio (BF16 to FP4) type: compression value: 4.0 verified: false - task: type: text-generation name: Optimization Stack Ablation dataset: name: FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy) type: custom metrics: - name: "L0: Baseline (FP4 + Packed MoE + Flat KV)" type: tokens_per_second value: 43.3 - name: "L1: + FP8 KV Cache" type: tokens_per_second value: 41.8 - name: "L2: + L2 Layer Prefetch" type: tokens_per_second value: 41.4 - name: "L3: + Atlas Ban & Pick + MoDES" type: tokens_per_second value: 39.7 - name: "L4: + FE-XC Cold Experts (518 demoted)" type: tokens_per_second value: 37.4 - name: "L5: + INT2 Coldest Experts (399 demoted)" type: tokens_per_second value: 37.4 - name: "L6: + CUDA Graph Decode" type: tokens_per_second value: 57.1 - task: type: text-generation name: Optimization History (0.4 to 49.4 tok/s) dataset: name: FireEcho Kernel Optimization Log (RTX 5090) type: custom metrics: - name: "Step 0: Naive Python Loop (128 experts)" type: tokens_per_second value: 0.4 - name: "Step 1: Grouped Dispatch + TF32" type: tokens_per_second value: 7.7 - name: "Step 2: Fused gate_up_proj" type: tokens_per_second value: 9.5 - name: "Step 3: Single-Token Decode Fast Path" type: tokens_per_second value: 12.6 - name: "Step 4: Multi-Expert Goliath Kernel" type: tokens_per_second value: 18.8 - name: "Step 5: Packed MoE (Contiguous Buffer)" type: tokens_per_second value: 30.8 - name: "Step 6: Flat KV Cache (Zero torch.cat)" type: tokens_per_second value: 40.9 - name: "Step 7: CUDA Graph + FlashDecode" type: tokens_per_second value: 49.4 --- # FireEcho Engine **High-performance single-GPU inference kernel for 30B+ MoE models** Created by [Luis E. Davila Flores](https://x.com/Joysulem) ## What is FireEcho? FireEcho is a from-scratch inference engine that runs **Qwen3-Omni-30B** (30.5 billion parameters, 128-expert MoE) on a **single RTX 5090** at **45+ tokens/second** using only **20 GB VRAM**. It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop — no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries. ## Key Results | Metric | Value | |--------|-------| | Model | Qwen3-Omni-30B-A3B-Instruct | | Parameters | 30.5B total, ~3.3B active/token | | GPU | NVIDIA RTX 5090 (32 GB, Blackwell) | | VRAM Usage | 20.0 GB (model) + 3.1 GB (KV cache) | | Decode Speed | **45+ tok/s** (single user, greedy) | | Compression | 4x (BF16 61 GB -> FP4 20 GB) | | Load Time | 110 seconds (streaming, 3.1 GB CPU RAM) | | Speedup | **124x** over naive PyTorch baseline | ## Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts) | Configuration | tok/s | vs Base | Notes | |--------------|-------|---------|-------| | L0: Baseline (FP4 + Packed MoE + Flat KV) | 43.3 | 1.00x | Core engine, all FP4 experts | | L1: + FP8 KV cache | 41.8 | 0.97x | 50% KV VRAM savings | | L2: + L2 layer prefetch | 41.4 | 0.96x | Pins next layer in L2 cache | | L3: + Atlas gatekeeper | 39.7 | 0.92x | Expert banning + MoDES skip | | L4: + FE-XC cold experts (518 demoted) | 37.4 | 0.86x | Codebook 2-bit cold experts | | L5: + INT2 coldest experts (399 demoted) | 37.4 | 0.86x | Scalar 2-bit coldest experts | | **L6: + CUDA Graph decode** | **57.1** | **1.32x** | **Graph-captured 48-layer forward** | **Peak VRAM**: 21.5 GB | **Baseline forward**: ~23.1ms/tok | **Full-stack forward**: ~17.5ms/tok **Note**: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding — with a trained EAGLE-3 head at 70% acceptance, projected throughput is **~457 tok/s**. ## Speed Optimization History Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds: | Step | Optimization | tok/s | Cumulative Speedup | |------|-------------|-------|--------------------| | 0 | Baseline (128-expert Python loop) | 0.4 | 1x | | 1 | Grouped dispatch + TF32 + autotune | 7.7 | 19x | | 2 | Fused gate_up_proj (2->1 matmul) | 9.5 | 24x | | 3 | Single-token decode fast path | 12.6 | 32x | | 4 | Multi-expert Goliath kernel | 18.8 | 47x | | 5 | Packed MoE (contiguous buffer) | 30.8 | 77x | | 6 | Flat KV cache (zero torch.cat) | 40.9 | 102x | | 7 | CUDA Graph + FlashDecode | 49.4 | **124x** | ## The Goliath Kernel: Why It's Fast Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic. **Goliath FP4** dequantizes **inside** the Triton matmul tile loop — in registers, with zero global memory writes: ```python # Simplified Goliath FP4 inner loop for k_block in range(0, K, BLOCK_K): w_packed = tl.load(weight_ptr + offsets) # Load FP4 packed bytes w_lo = (w_packed & 0xF) * scale # Dequant low nibble in-register w_hi = (w_packed >> 4) * scale # Dequant high nibble in-register acc += tl.dot(a_tile, w_tile) # Tensor core matmul ``` **Packed MoE** eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous `[128, K//2, N]` buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer — zero `.item()` calls, zero CPU-GPU synchronization. ## Quantization Formats | Format | Bits | Compression | Quality | Used For | |--------|------|-------------|---------|----------| | BF16 | 16 | 1x | Perfect | Attention Q/K/V/O | | Goliath FP4 | 4 | 4x | Near-perfect | Hot MoE experts | | FE-XC | 2 | 8x | Very good (codebook) | Cold MoE experts | | INT2 | 2 | 8x | Acceptable (scalar) | Coldest MoE experts | | Goliath FP8 | 8 | 2x | Excellent | FP8 KV cache | | FE-MX | 4-8 | 2-4x | Adaptive | Hebbian memory | ## Unique Features ### Hebbian Memory Biologically-inspired fast weights that **learn during inference** (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA. ### Atlas Gatekeeper (FE-AGK) Runtime expert management: - **Ban & Pick**: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts) - **MoDES**: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers) ### FE-XC / INT2 Cold Expert Demotion Automatically compresses rarely-used experts to 2-bit: - FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel) - INT2: Scalar 2-bit (simple but lower quality) - Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2 ### FlashDecode Custom Triton attention kernel for M=1 decoding: - Online softmax (no separate softmax pass) - Reads only valid KV positions (no padding waste) - GQA support (4 KV heads -> 32 query heads) - 15.8ms per token (48 layers, 4096-token context) ### EAGLE-3 Speculative Decoding (infrastructure ready) Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress. ## Quick Start ```python from fireecho_kernel import FireEchoEngine # Load (streams layer-by-layer, 110s, 20 GB VRAM) engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B") # Enable optimizations engine.enable_flat_decode(kv_dtype='fp8') # FP8 KV cache engine.enable_cuda_graph_decode() # CUDA Graph # Generate input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda() output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7) print(engine.tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Requirements - **GPU**: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended - **CUDA**: 12.4+ - **Python**: 3.10-3.12 - **PyTorch**: 2.4.0+ - **Triton**: 3.0+ - **OS**: Linux x86_64 ## Hardware Independence FireEcho uses **zero NVIDIA proprietary libraries**: - No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention) - All custom kernels are pure **Triton** (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU) - Runs anywhere Triton runs ## Architecture ``` FireEcho Engine ├── fireecho_kernel.py # Main engine (9000+ lines) │ ├── FireEchoEngine # Load, generate, speculate │ ├── MoEFFN # Packed MoE with fused dispatch │ ├── HebbianMemory # Fast weights (learn at inference) │ ├── FlashDecode # Triton M=1 GQA attention │ └── CUDA Graph # Graph-captured decode ├── goliath_kernel.py # Quantized GEMM kernels (3000+ lines) │ ├── GoliathFP4 # FP4 fused dequant-matmul │ ├── GoliathFP8 # FP8 fused dequant-matmul │ ├── GoliathINT2 # INT2 scalar quantization │ ├── GoliathFEXC # FE-XC codebook 2-bit │ └── Packed MoE # Contiguous expert buffers ├── triton_hebbian.py # Fused Hebbian kernels ├── femx_storage.py # Block floating point storage └── persistent_memory.py # AGI-like persistent memory ``` ## License CC BY-NC 4.0 — Free for research and non-commercial use with attribution. For commercial licensing: [@Joysulem on X/Twitter](https://x.com/Joysulem) ## Citation ```bibtex @software{fireecho2026, author = {Davila Flores, Luis E.}, title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models}, year = {2026}, url = {https://github.com/Joysulem/FireEcho} } ```