FireEcho Engine

High-performance single-GPU inference kernel for 30B+ MoE models

What is FireEcho?

FireEcho is a from-scratch inference engine that runs Qwen3-Omni-30B (30.5 billion parameters, 128-expert MoE) on a single RTX 5090 at 45+ tokens/second using only 20 GB VRAM.

It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop — no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries.

Key Results

Metric	Value
Model	Qwen3-Omni-30B-A3B-Instruct
Parameters	30.5B total, ~3.3B active/token
GPU	NVIDIA RTX 5090 (32 GB, Blackwell)
VRAM Usage	20.0 GB (model) + 3.1 GB (KV cache)
Decode Speed	45+ tok/s (single user, greedy)
Compression	4x (BF16 61 GB -> FP4 20 GB)
Load Time	110 seconds (streaming, 3.1 GB CPU RAM)
Speedup	124x over naive PyTorch baseline

Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts)

Configuration	tok/s	vs Base	Notes
L0: Baseline (FP4 + Packed MoE + Flat KV)	43.3	1.00x	Core engine, all FP4 experts
L1: + FP8 KV cache	41.8	0.97x	50% KV VRAM savings
L2: + L2 layer prefetch	41.4	0.96x	Pins next layer in L2 cache
L3: + Atlas gatekeeper	39.7	0.92x	Expert banning + MoDES skip
L4: + FE-XC cold experts (518 demoted)	37.4	0.86x	Codebook 2-bit cold experts
L5: + INT2 coldest experts (399 demoted)	37.4	0.86x	Scalar 2-bit coldest experts
L6: + CUDA Graph decode	57.1	1.32x	Graph-captured 48-layer forward

Peak VRAM: 21.5 GB | Baseline forward: ~23.1ms/tok | Full-stack forward: ~17.5ms/tok

Note: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding — with a trained EAGLE-3 head at 70% acceptance, projected throughput is ~457 tok/s.

Speed Optimization History

Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds:

Step	Optimization	tok/s	Cumulative Speedup
0	Baseline (128-expert Python loop)	0.4	1x
1	Grouped dispatch + TF32 + autotune	7.7	19x
2	Fused gate_up_proj (2->1 matmul)	9.5	24x
3	Single-token decode fast path	12.6	32x
4	Multi-expert Goliath kernel	18.8	47x
5	Packed MoE (contiguous buffer)	30.8	77x
6	Flat KV cache (zero torch.cat)	40.9	102x
7	CUDA Graph + FlashDecode	49.4	124x

The Goliath Kernel: Why It's Fast

Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic.

Goliath FP4 dequantizes inside the Triton matmul tile loop — in registers, with zero global memory writes:

# Simplified Goliath FP4 inner loop
for k_block in range(0, K, BLOCK_K):
    w_packed = tl.load(weight_ptr + offsets)     # Load FP4 packed bytes
    w_lo = (w_packed & 0xF) * scale              # Dequant low nibble in-register
    w_hi = (w_packed >> 4) * scale               # Dequant high nibble in-register
    acc += tl.dot(a_tile, w_tile)                # Tensor core matmul

Packed MoE eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous [128, K//2, N] buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer — zero .item() calls, zero CPU-GPU synchronization.

Quantization Formats

Format	Bits	Compression	Quality	Used For
BF16	16	1x	Perfect	Attention Q/K/V/O
Goliath FP4	4	4x	Near-perfect	Hot MoE experts
FE-XC	2	8x	Very good (codebook)	Cold MoE experts
INT2	2	8x	Acceptable (scalar)	Coldest MoE experts
Goliath FP8	8	2x	Excellent	FP8 KV cache
FE-MX	4-8	2-4x	Adaptive	Hebbian memory

Unique Features

Hebbian Memory

Biologically-inspired fast weights that learn during inference (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA.

Atlas Gatekeeper (FE-AGK)

Runtime expert management:

Ban & Pick: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts)
MoDES: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers)

FE-XC / INT2 Cold Expert Demotion

Automatically compresses rarely-used experts to 2-bit:

FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel)
INT2: Scalar 2-bit (simple but lower quality)
Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2

FlashDecode

Custom Triton attention kernel for M=1 decoding:

Online softmax (no separate softmax pass)
Reads only valid KV positions (no padding waste)
GQA support (4 KV heads -> 32 query heads)
15.8ms per token (48 layers, 4096-token context)

EAGLE-3 Speculative Decoding (infrastructure ready)

Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress.

Quick Start

from fireecho_kernel import FireEchoEngine

# Load (streams layer-by-layer, 110s, 20 GB VRAM)
engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B")

# Enable optimizations
engine.enable_flat_decode(kv_dtype='fp8')    # FP8 KV cache
engine.enable_cuda_graph_decode()            # CUDA Graph

# Generate
input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda()
output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

GPU: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended
CUDA: 12.4+
Python: 3.10-3.12
PyTorch: 2.4.0+
Triton: 3.0+
OS: Linux x86_64

Hardware Independence

FireEcho uses zero NVIDIA proprietary libraries:

No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention)
All custom kernels are pure Triton (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU)
Runs anywhere Triton runs

Architecture

FireEcho Engine
├── fireecho_kernel.py      # Main engine (9000+ lines)
│   ├── FireEchoEngine      # Load, generate, speculate
│   ├── MoEFFN              # Packed MoE with fused dispatch
│   ├── HebbianMemory       # Fast weights (learn at inference)
│   ├── FlashDecode         # Triton M=1 GQA attention
│   └── CUDA Graph          # Graph-captured decode
├── goliath_kernel.py       # Quantized GEMM kernels (3000+ lines)
│   ├── GoliathFP4          # FP4 fused dequant-matmul
│   ├── GoliathFP8          # FP8 fused dequant-matmul
│   ├── GoliathINT2         # INT2 scalar quantization
│   ├── GoliathFEXC         # FE-XC codebook 2-bit
│   └── Packed MoE          # Contiguous expert buffers
├── triton_hebbian.py       # Fused Hebbian kernels
├── femx_storage.py         # Block floating point storage
└── persistent_memory.py    # AGI-like persistent memory

License

CC BY-NC 4.0 — Free for research and non-commercial use with attribution.

For commercial licensing: @Joysulem on X/Twitter

Citation

@software{fireecho2026,
  author = {Davila Flores, Luis E.},
  title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models},
  year = {2026},
  url = {https://github.com/Joysulem/FireEcho}
}

Downloads last month: 8

Evaluation results

Decode Speed (Baseline FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

43.300
Decode Speed (Full-Stack + CUDA Graph) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

57.100
Speedup vs Naive PyTorch on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

124.000
VRAM Usage (Model) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

20.000
VRAM Usage (Peak) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

21.500
Model Load Time (seconds) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

110.000
Compression Ratio (BF16 to FP4) on FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
self-reported

4.000
L0: Baseline (FP4 + Packed MoE + Flat KV) on FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
self-reported

43.300