FireEcho Engine
High-performance single-GPU inference kernel for 30B+ MoE models
Created by Luis E. Davila Flores
What is FireEcho?
FireEcho is a from-scratch inference engine that runs Qwen3-Omni-30B (30.5 billion parameters, 128-expert MoE) on a single RTX 5090 at 45+ tokens/second using only 20 GB VRAM.
It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop — no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries.
Key Results
| Metric | Value |
|---|---|
| Model | Qwen3-Omni-30B-A3B-Instruct |
| Parameters | 30.5B total, ~3.3B active/token |
| GPU | NVIDIA RTX 5090 (32 GB, Blackwell) |
| VRAM Usage | 20.0 GB (model) + 3.1 GB (KV cache) |
| Decode Speed | 45+ tok/s (single user, greedy) |
| Compression | 4x (BF16 61 GB -> FP4 20 GB) |
| Load Time | 110 seconds (streaming, 3.1 GB CPU RAM) |
| Speedup | 124x over naive PyTorch baseline |
Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts)
| Configuration | tok/s | vs Base | Notes |
|---|---|---|---|
| L0: Baseline (FP4 + Packed MoE + Flat KV) | 43.3 | 1.00x | Core engine, all FP4 experts |
| L1: + FP8 KV cache | 41.8 | 0.97x | 50% KV VRAM savings |
| L2: + L2 layer prefetch | 41.4 | 0.96x | Pins next layer in L2 cache |
| L3: + Atlas gatekeeper | 39.7 | 0.92x | Expert banning + MoDES skip |
| L4: + FE-XC cold experts (518 demoted) | 37.4 | 0.86x | Codebook 2-bit cold experts |
| L5: + INT2 coldest experts (399 demoted) | 37.4 | 0.86x | Scalar 2-bit coldest experts |
| L6: + CUDA Graph decode | 57.1 | 1.32x | Graph-captured 48-layer forward |
Peak VRAM: 21.5 GB | Baseline forward: ~23.1ms/tok | Full-stack forward: ~17.5ms/tok
Note: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding — with a trained EAGLE-3 head at 70% acceptance, projected throughput is ~457 tok/s.
Speed Optimization History
Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds:
| Step | Optimization | tok/s | Cumulative Speedup |
|---|---|---|---|
| 0 | Baseline (128-expert Python loop) | 0.4 | 1x |
| 1 | Grouped dispatch + TF32 + autotune | 7.7 | 19x |
| 2 | Fused gate_up_proj (2->1 matmul) | 9.5 | 24x |
| 3 | Single-token decode fast path | 12.6 | 32x |
| 4 | Multi-expert Goliath kernel | 18.8 | 47x |
| 5 | Packed MoE (contiguous buffer) | 30.8 | 77x |
| 6 | Flat KV cache (zero torch.cat) | 40.9 | 102x |
| 7 | CUDA Graph + FlashDecode | 49.4 | 124x |
The Goliath Kernel: Why It's Fast
Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic.
Goliath FP4 dequantizes inside the Triton matmul tile loop — in registers, with zero global memory writes:
# Simplified Goliath FP4 inner loop
for k_block in range(0, K, BLOCK_K):
w_packed = tl.load(weight_ptr + offsets) # Load FP4 packed bytes
w_lo = (w_packed & 0xF) * scale # Dequant low nibble in-register
w_hi = (w_packed >> 4) * scale # Dequant high nibble in-register
acc += tl.dot(a_tile, w_tile) # Tensor core matmul
Packed MoE eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous [128, K//2, N] buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer — zero .item() calls, zero CPU-GPU synchronization.
Quantization Formats
| Format | Bits | Compression | Quality | Used For |
|---|---|---|---|---|
| BF16 | 16 | 1x | Perfect | Attention Q/K/V/O |
| Goliath FP4 | 4 | 4x | Near-perfect | Hot MoE experts |
| FE-XC | 2 | 8x | Very good (codebook) | Cold MoE experts |
| INT2 | 2 | 8x | Acceptable (scalar) | Coldest MoE experts |
| Goliath FP8 | 8 | 2x | Excellent | FP8 KV cache |
| FE-MX | 4-8 | 2-4x | Adaptive | Hebbian memory |
Unique Features
Hebbian Memory
Biologically-inspired fast weights that learn during inference (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA.
Atlas Gatekeeper (FE-AGK)
Runtime expert management:
- Ban & Pick: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts)
- MoDES: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers)
FE-XC / INT2 Cold Expert Demotion
Automatically compresses rarely-used experts to 2-bit:
- FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel)
- INT2: Scalar 2-bit (simple but lower quality)
- Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2
FlashDecode
Custom Triton attention kernel for M=1 decoding:
- Online softmax (no separate softmax pass)
- Reads only valid KV positions (no padding waste)
- GQA support (4 KV heads -> 32 query heads)
- 15.8ms per token (48 layers, 4096-token context)
EAGLE-3 Speculative Decoding (infrastructure ready)
Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress.
Quick Start
from fireecho_kernel import FireEchoEngine
# Load (streams layer-by-layer, 110s, 20 GB VRAM)
engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B")
# Enable optimizations
engine.enable_flat_decode(kv_dtype='fp8') # FP8 KV cache
engine.enable_cuda_graph_decode() # CUDA Graph
# Generate
input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda()
output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
Requirements
- GPU: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended
- CUDA: 12.4+
- Python: 3.10-3.12
- PyTorch: 2.4.0+
- Triton: 3.0+
- OS: Linux x86_64
Hardware Independence
FireEcho uses zero NVIDIA proprietary libraries:
- No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention)
- All custom kernels are pure Triton (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU)
- Runs anywhere Triton runs
Architecture
FireEcho Engine
├── fireecho_kernel.py # Main engine (9000+ lines)
│ ├── FireEchoEngine # Load, generate, speculate
│ ├── MoEFFN # Packed MoE with fused dispatch
│ ├── HebbianMemory # Fast weights (learn at inference)
│ ├── FlashDecode # Triton M=1 GQA attention
│ └── CUDA Graph # Graph-captured decode
├── goliath_kernel.py # Quantized GEMM kernels (3000+ lines)
│ ├── GoliathFP4 # FP4 fused dequant-matmul
│ ├── GoliathFP8 # FP8 fused dequant-matmul
│ ├── GoliathINT2 # INT2 scalar quantization
│ ├── GoliathFEXC # FE-XC codebook 2-bit
│ └── Packed MoE # Contiguous expert buffers
├── triton_hebbian.py # Fused Hebbian kernels
├── femx_storage.py # Block floating point storage
└── persistent_memory.py # AGI-like persistent memory
License
CC BY-NC 4.0 — Free for research and non-commercial use with attribution.
For commercial licensing: @Joysulem on X/Twitter
Citation
@software{fireecho2026,
author = {Davila Flores, Luis E.},
title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models},
year = {2026},
url = {https://github.com/Joysulem/FireEcho}
}
- Downloads last month
- 8
Evaluation results
- Decode Speed (Baseline FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported43.300
- Decode Speed (Full-Stack + CUDA Graph) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported57.100
- Speedup vs Naive PyTorch on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported124.000
- VRAM Usage (Model) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported20.000
- VRAM Usage (Peak) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported21.500
- Model Load Time (seconds) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported110.000
- Compression Ratio (BF16 to FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)self-reported4.000
- L0: Baseline (FP4 + Packed MoE + Flat KV) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported43.300
- L1: + FP8 KV Cache on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported41.800
- L2: + L2 Layer Prefetch on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported41.400
- L3: + Atlas Ban & Pick + MoDES on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported39.700
- L4: + FE-XC Cold Experts (518 demoted) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported37.400
- L5: + INT2 Coldest Experts (399 demoted) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported37.400
- L6: + CUDA Graph Decode on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)self-reported57.100
- Step 0: Naive Python Loop (128 experts) on FireEcho Kernel Optimization Log (RTX 5090)self-reported0.400
- Step 1: Grouped Dispatch + TF32 on FireEcho Kernel Optimization Log (RTX 5090)self-reported7.700
- Step 2: Fused gate_up_proj on FireEcho Kernel Optimization Log (RTX 5090)self-reported9.500
- Step 3: Single-Token Decode Fast Path on FireEcho Kernel Optimization Log (RTX 5090)self-reported12.600
- Step 4: Multi-Expert Goliath Kernel on FireEcho Kernel Optimization Log (RTX 5090)self-reported18.800
- Step 5: Packed MoE (Contiguous Buffer) on FireEcho Kernel Optimization Log (RTX 5090)self-reported30.800