FireEcho Engine

High-performance single-GPU inference kernel for 30B+ MoE models

Created by Luis E. Davila Flores

What is FireEcho?

FireEcho is a from-scratch inference engine that runs Qwen3-Omni-30B (30.5 billion parameters, 128-expert MoE) on a single RTX 5090 at 45+ tokens/second using only 20 GB VRAM.

It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop — no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries.

Key Results

Metric Value
Model Qwen3-Omni-30B-A3B-Instruct
Parameters 30.5B total, ~3.3B active/token
GPU NVIDIA RTX 5090 (32 GB, Blackwell)
VRAM Usage 20.0 GB (model) + 3.1 GB (KV cache)
Decode Speed 45+ tok/s (single user, greedy)
Compression 4x (BF16 61 GB -> FP4 20 GB)
Load Time 110 seconds (streaming, 3.1 GB CPU RAM)
Speedup 124x over naive PyTorch baseline

Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts)

Configuration tok/s vs Base Notes
L0: Baseline (FP4 + Packed MoE + Flat KV) 43.3 1.00x Core engine, all FP4 experts
L1: + FP8 KV cache 41.8 0.97x 50% KV VRAM savings
L2: + L2 layer prefetch 41.4 0.96x Pins next layer in L2 cache
L3: + Atlas gatekeeper 39.7 0.92x Expert banning + MoDES skip
L4: + FE-XC cold experts (518 demoted) 37.4 0.86x Codebook 2-bit cold experts
L5: + INT2 coldest experts (399 demoted) 37.4 0.86x Scalar 2-bit coldest experts
L6: + CUDA Graph decode 57.1 1.32x Graph-captured 48-layer forward

Peak VRAM: 21.5 GB | Baseline forward: ~23.1ms/tok | Full-stack forward: ~17.5ms/tok

Note: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding — with a trained EAGLE-3 head at 70% acceptance, projected throughput is ~457 tok/s.

Speed Optimization History

Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds:

Step Optimization tok/s Cumulative Speedup
0 Baseline (128-expert Python loop) 0.4 1x
1 Grouped dispatch + TF32 + autotune 7.7 19x
2 Fused gate_up_proj (2->1 matmul) 9.5 24x
3 Single-token decode fast path 12.6 32x
4 Multi-expert Goliath kernel 18.8 47x
5 Packed MoE (contiguous buffer) 30.8 77x
6 Flat KV cache (zero torch.cat) 40.9 102x
7 CUDA Graph + FlashDecode 49.4 124x

The Goliath Kernel: Why It's Fast

Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic.

Goliath FP4 dequantizes inside the Triton matmul tile loop — in registers, with zero global memory writes:

# Simplified Goliath FP4 inner loop
for k_block in range(0, K, BLOCK_K):
    w_packed = tl.load(weight_ptr + offsets)     # Load FP4 packed bytes
    w_lo = (w_packed & 0xF) * scale              # Dequant low nibble in-register
    w_hi = (w_packed >> 4) * scale               # Dequant high nibble in-register
    acc += tl.dot(a_tile, w_tile)                # Tensor core matmul

Packed MoE eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous [128, K//2, N] buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer — zero .item() calls, zero CPU-GPU synchronization.

Quantization Formats

Format Bits Compression Quality Used For
BF16 16 1x Perfect Attention Q/K/V/O
Goliath FP4 4 4x Near-perfect Hot MoE experts
FE-XC 2 8x Very good (codebook) Cold MoE experts
INT2 2 8x Acceptable (scalar) Coldest MoE experts
Goliath FP8 8 2x Excellent FP8 KV cache
FE-MX 4-8 2-4x Adaptive Hebbian memory

Unique Features

Hebbian Memory

Biologically-inspired fast weights that learn during inference (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA.

Atlas Gatekeeper (FE-AGK)

Runtime expert management:

  • Ban & Pick: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts)
  • MoDES: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers)

FE-XC / INT2 Cold Expert Demotion

Automatically compresses rarely-used experts to 2-bit:

  • FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel)
  • INT2: Scalar 2-bit (simple but lower quality)
  • Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2

FlashDecode

Custom Triton attention kernel for M=1 decoding:

  • Online softmax (no separate softmax pass)
  • Reads only valid KV positions (no padding waste)
  • GQA support (4 KV heads -> 32 query heads)
  • 15.8ms per token (48 layers, 4096-token context)

EAGLE-3 Speculative Decoding (infrastructure ready)

Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress.

Quick Start

from fireecho_kernel import FireEchoEngine

# Load (streams layer-by-layer, 110s, 20 GB VRAM)
engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B")

# Enable optimizations
engine.enable_flat_decode(kv_dtype='fp8')    # FP8 KV cache
engine.enable_cuda_graph_decode()            # CUDA Graph

# Generate
input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda()
output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

  • GPU: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended
  • CUDA: 12.4+
  • Python: 3.10-3.12
  • PyTorch: 2.4.0+
  • Triton: 3.0+
  • OS: Linux x86_64

Hardware Independence

FireEcho uses zero NVIDIA proprietary libraries:

  • No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention)
  • All custom kernels are pure Triton (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU)
  • Runs anywhere Triton runs

Architecture

FireEcho Engine
├── fireecho_kernel.py      # Main engine (9000+ lines)
│   ├── FireEchoEngine      # Load, generate, speculate
│   ├── MoEFFN              # Packed MoE with fused dispatch
│   ├── HebbianMemory       # Fast weights (learn at inference)
│   ├── FlashDecode         # Triton M=1 GQA attention
│   └── CUDA Graph          # Graph-captured decode
├── goliath_kernel.py       # Quantized GEMM kernels (3000+ lines)
│   ├── GoliathFP4          # FP4 fused dequant-matmul
│   ├── GoliathFP8          # FP8 fused dequant-matmul
│   ├── GoliathINT2         # INT2 scalar quantization
│   ├── GoliathFEXC         # FE-XC codebook 2-bit
│   └── Packed MoE          # Contiguous expert buffers
├── triton_hebbian.py       # Fused Hebbian kernels
├── femx_storage.py         # Block floating point storage
└── persistent_memory.py    # AGI-like persistent memory

License

CC BY-NC 4.0 — Free for research and non-commercial use with attribution.

For commercial licensing: @Joysulem on X/Twitter

Citation

@software{fireecho2026,
  author = {Davila Flores, Luis E.},
  title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models},
  year = {2026},
  url = {https://github.com/Joysulem/FireEcho}
}
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Decode Speed (Baseline FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    43.300
  • Decode Speed (Full-Stack + CUDA Graph) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    57.100
  • Speedup vs Naive PyTorch on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    124.000
  • VRAM Usage (Model) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    20.000
  • VRAM Usage (Peak) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    21.500
  • Model Load Time (seconds) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    110.000
  • Compression Ratio (BF16 to FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
    self-reported
    4.000
  • L0: Baseline (FP4 + Packed MoE + Flat KV) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    43.300
  • L1: + FP8 KV Cache on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    41.800
  • L2: + L2 Layer Prefetch on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    41.400
  • L3: + Atlas Ban & Pick + MoDES on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    39.700
  • L4: + FE-XC Cold Experts (518 demoted) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    37.400
  • L5: + INT2 Coldest Experts (399 demoted) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    37.400
  • L6: + CUDA Graph Decode on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
    self-reported
    57.100
  • Step 0: Naive Python Loop (128 experts) on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    0.400
  • Step 1: Grouped Dispatch + TF32 on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    7.700
  • Step 2: Fused gate_up_proj on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    9.500
  • Step 3: Single-Token Decode Fast Path on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    12.600
  • Step 4: Multi-Expert Goliath Kernel on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    18.800
  • Step 5: Packed MoE (Contiguous Buffer) on FireEcho Kernel Optimization Log (RTX 5090)
    self-reported
    30.800