FireEcho / README.md
Joysulem's picture
Create README.md
71bbd32 verified
metadata
library_name: fireecho
tags:
  - inference
  - triton
  - quantization
  - moe
  - fp4
  - fp8
  - int2
  - single-gpu
  - blackwell
  - hebbian
  - speculative-decoding
  - custom-kernel
license: cc-by-nc-4.0
pipeline_tag: text-generation
datasets:
  - Qwen/Qwen3-Omni-30B-A3B-Instruct
model-index:
  - name: FireEcho Engine
    results:
      - task:
          type: text-generation
          name: Inference Throughput
        dataset:
          name: FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
          type: custom
        metrics:
          - name: Decode Speed (Baseline FP4)
            type: tokens_per_second
            value: 43.3
            verified: false
          - name: Decode Speed (Full-Stack + CUDA Graph)
            type: tokens_per_second
            value: 57.1
            verified: false
          - name: Speedup vs Naive PyTorch
            type: speedup
            value: 124
            verified: false
          - name: VRAM Usage (Model)
            type: gpu_memory_gb
            value: 20
            verified: false
          - name: VRAM Usage (Peak)
            type: gpu_memory_gb
            value: 21.5
            verified: false
          - name: Model Load Time (seconds)
            type: latency
            value: 110
            verified: false
          - name: Compression Ratio (BF16 to FP4)
            type: compression
            value: 4
            verified: false
      - task:
          type: text-generation
          name: Optimization Stack Ablation
        dataset:
          name: FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
          type: custom
        metrics:
          - name: 'L0: Baseline (FP4 + Packed MoE + Flat KV)'
            type: tokens_per_second
            value: 43.3
          - name: 'L1: + FP8 KV Cache'
            type: tokens_per_second
            value: 41.8
          - name: 'L2: + L2 Layer Prefetch'
            type: tokens_per_second
            value: 41.4
          - name: 'L3: + Atlas Ban & Pick + MoDES'
            type: tokens_per_second
            value: 39.7
          - name: 'L4: + FE-XC Cold Experts (518 demoted)'
            type: tokens_per_second
            value: 37.4
          - name: 'L5: + INT2 Coldest Experts (399 demoted)'
            type: tokens_per_second
            value: 37.4
          - name: 'L6: + CUDA Graph Decode'
            type: tokens_per_second
            value: 57.1
      - task:
          type: text-generation
          name: Optimization History (0.4 to 49.4 tok/s)
        dataset:
          name: FireEcho Kernel Optimization Log (RTX 5090)
          type: custom
        metrics:
          - name: 'Step 0: Naive Python Loop (128 experts)'
            type: tokens_per_second
            value: 0.4
          - name: 'Step 1: Grouped Dispatch + TF32'
            type: tokens_per_second
            value: 7.7
          - name: 'Step 2: Fused gate_up_proj'
            type: tokens_per_second
            value: 9.5
          - name: 'Step 3: Single-Token Decode Fast Path'
            type: tokens_per_second
            value: 12.6
          - name: 'Step 4: Multi-Expert Goliath Kernel'
            type: tokens_per_second
            value: 18.8
          - name: 'Step 5: Packed MoE (Contiguous Buffer)'
            type: tokens_per_second
            value: 30.8
          - name: 'Step 6: Flat KV Cache (Zero torch.cat)'
            type: tokens_per_second
            value: 40.9
          - name: 'Step 7: CUDA Graph + FlashDecode'
            type: tokens_per_second
            value: 49.4

FireEcho Engine

High-performance single-GPU inference kernel for 30B+ MoE models

Created by Luis E. Davila Flores

What is FireEcho?

FireEcho is a from-scratch inference engine that runs Qwen3-Omni-30B (30.5 billion parameters, 128-expert MoE) on a single RTX 5090 at 45+ tokens/second using only 20 GB VRAM.

It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop β€” no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries.

Key Results

Metric Value
Model Qwen3-Omni-30B-A3B-Instruct
Parameters 30.5B total, ~3.3B active/token
GPU NVIDIA RTX 5090 (32 GB, Blackwell)
VRAM Usage 20.0 GB (model) + 3.1 GB (KV cache)
Decode Speed 45+ tok/s (single user, greedy)
Compression 4x (BF16 61 GB -> FP4 20 GB)
Load Time 110 seconds (streaming, 3.1 GB CPU RAM)
Speedup 124x over naive PyTorch baseline

Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts)

Configuration tok/s vs Base Notes
L0: Baseline (FP4 + Packed MoE + Flat KV) 43.3 1.00x Core engine, all FP4 experts
L1: + FP8 KV cache 41.8 0.97x 50% KV VRAM savings
L2: + L2 layer prefetch 41.4 0.96x Pins next layer in L2 cache
L3: + Atlas gatekeeper 39.7 0.92x Expert banning + MoDES skip
L4: + FE-XC cold experts (518 demoted) 37.4 0.86x Codebook 2-bit cold experts
L5: + INT2 coldest experts (399 demoted) 37.4 0.86x Scalar 2-bit coldest experts
L6: + CUDA Graph decode 57.1 1.32x Graph-captured 48-layer forward

Peak VRAM: 21.5 GB | Baseline forward: ~23.1ms/tok | Full-stack forward: ~17.5ms/tok

Note: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding β€” with a trained EAGLE-3 head at 70% acceptance, projected throughput is ~457 tok/s.

Speed Optimization History

Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds:

Step Optimization tok/s Cumulative Speedup
0 Baseline (128-expert Python loop) 0.4 1x
1 Grouped dispatch + TF32 + autotune 7.7 19x
2 Fused gate_up_proj (2->1 matmul) 9.5 24x
3 Single-token decode fast path 12.6 32x
4 Multi-expert Goliath kernel 18.8 47x
5 Packed MoE (contiguous buffer) 30.8 77x
6 Flat KV cache (zero torch.cat) 40.9 102x
7 CUDA Graph + FlashDecode 49.4 124x

The Goliath Kernel: Why It's Fast

Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic.

Goliath FP4 dequantizes inside the Triton matmul tile loop β€” in registers, with zero global memory writes:

# Simplified Goliath FP4 inner loop
for k_block in range(0, K, BLOCK_K):
    w_packed = tl.load(weight_ptr + offsets)     # Load FP4 packed bytes
    w_lo = (w_packed & 0xF) * scale              # Dequant low nibble in-register
    w_hi = (w_packed >> 4) * scale               # Dequant high nibble in-register
    acc += tl.dot(a_tile, w_tile)                # Tensor core matmul

Packed MoE eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous [128, K//2, N] buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer β€” zero .item() calls, zero CPU-GPU synchronization.

Quantization Formats

Format Bits Compression Quality Used For
BF16 16 1x Perfect Attention Q/K/V/O
Goliath FP4 4 4x Near-perfect Hot MoE experts
FE-XC 2 8x Very good (codebook) Cold MoE experts
INT2 2 8x Acceptable (scalar) Coldest MoE experts
Goliath FP8 8 2x Excellent FP8 KV cache
FE-MX 4-8 2-4x Adaptive Hebbian memory

Unique Features

Hebbian Memory

Biologically-inspired fast weights that learn during inference (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA.

Atlas Gatekeeper (FE-AGK)

Runtime expert management:

  • Ban & Pick: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts)
  • MoDES: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers)

FE-XC / INT2 Cold Expert Demotion

Automatically compresses rarely-used experts to 2-bit:

  • FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel)
  • INT2: Scalar 2-bit (simple but lower quality)
  • Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2

FlashDecode

Custom Triton attention kernel for M=1 decoding:

  • Online softmax (no separate softmax pass)
  • Reads only valid KV positions (no padding waste)
  • GQA support (4 KV heads -> 32 query heads)
  • 15.8ms per token (48 layers, 4096-token context)

EAGLE-3 Speculative Decoding (infrastructure ready)

Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress.

Quick Start

from fireecho_kernel import FireEchoEngine

# Load (streams layer-by-layer, 110s, 20 GB VRAM)
engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B")

# Enable optimizations
engine.enable_flat_decode(kv_dtype='fp8')    # FP8 KV cache
engine.enable_cuda_graph_decode()            # CUDA Graph

# Generate
input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda()
output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7)
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

  • GPU: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended
  • CUDA: 12.4+
  • Python: 3.10-3.12
  • PyTorch: 2.4.0+
  • Triton: 3.0+
  • OS: Linux x86_64

Hardware Independence

FireEcho uses zero NVIDIA proprietary libraries:

  • No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention)
  • All custom kernels are pure Triton (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU)
  • Runs anywhere Triton runs

Architecture

FireEcho Engine
β”œβ”€β”€ fireecho_kernel.py      # Main engine (9000+ lines)
β”‚   β”œβ”€β”€ FireEchoEngine      # Load, generate, speculate
β”‚   β”œβ”€β”€ MoEFFN              # Packed MoE with fused dispatch
β”‚   β”œβ”€β”€ HebbianMemory       # Fast weights (learn at inference)
β”‚   β”œβ”€β”€ FlashDecode         # Triton M=1 GQA attention
β”‚   └── CUDA Graph          # Graph-captured decode
β”œβ”€β”€ goliath_kernel.py       # Quantized GEMM kernels (3000+ lines)
β”‚   β”œβ”€β”€ GoliathFP4          # FP4 fused dequant-matmul
β”‚   β”œβ”€β”€ GoliathFP8          # FP8 fused dequant-matmul
β”‚   β”œβ”€β”€ GoliathINT2         # INT2 scalar quantization
β”‚   β”œβ”€β”€ GoliathFEXC         # FE-XC codebook 2-bit
β”‚   └── Packed MoE          # Contiguous expert buffers
β”œβ”€β”€ triton_hebbian.py       # Fused Hebbian kernels
β”œβ”€β”€ femx_storage.py         # Block floating point storage
└── persistent_memory.py    # AGI-like persistent memory

License

CC BY-NC 4.0 β€” Free for research and non-commercial use with attribution.

For commercial licensing: @Joysulem on X/Twitter

Citation

@software{fireecho2026,
  author = {Davila Flores, Luis E.},
  title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models},
  year = {2026},
  url = {https://github.com/Joysulem/FireEcho}
}