| | --- |
| | library_name: fireecho |
| | tags: |
| | - inference |
| | - triton |
| | - quantization |
| | - moe |
| | - fp4 |
| | - fp8 |
| | - int2 |
| | - single-gpu |
| | - blackwell |
| | - hebbian |
| | - speculative-decoding |
| | - custom-kernel |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: text-generation |
| | datasets: |
| | - Qwen/Qwen3-Omni-30B-A3B-Instruct |
| | model-index: |
| | - name: FireEcho Engine |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Inference Throughput |
| | dataset: |
| | name: FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each) |
| | type: custom |
| | metrics: |
| | - name: Decode Speed (Baseline FP4) |
| | type: tokens_per_second |
| | value: 43.3 |
| | verified: false |
| | - name: Decode Speed (Full-Stack + CUDA Graph) |
| | type: tokens_per_second |
| | value: 57.1 |
| | verified: false |
| | - name: Speedup vs Naive PyTorch |
| | type: speedup |
| | value: 124 |
| | verified: false |
| | - name: VRAM Usage (Model) |
| | type: gpu_memory_gb |
| | value: 20.0 |
| | verified: false |
| | - name: VRAM Usage (Peak) |
| | type: gpu_memory_gb |
| | value: 21.5 |
| | verified: false |
| | - name: Model Load Time (seconds) |
| | type: latency |
| | value: 110 |
| | verified: false |
| | - name: Compression Ratio (BF16 to FP4) |
| | type: compression |
| | value: 4.0 |
| | verified: false |
| | - task: |
| | type: text-generation |
| | name: Optimization Stack Ablation |
| | dataset: |
| | name: FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy) |
| | type: custom |
| | metrics: |
| | - name: "L0: Baseline (FP4 + Packed MoE + Flat KV)" |
| | type: tokens_per_second |
| | value: 43.3 |
| | - name: "L1: + FP8 KV Cache" |
| | type: tokens_per_second |
| | value: 41.8 |
| | - name: "L2: + L2 Layer Prefetch" |
| | type: tokens_per_second |
| | value: 41.4 |
| | - name: "L3: + Atlas Ban & Pick + MoDES" |
| | type: tokens_per_second |
| | value: 39.7 |
| | - name: "L4: + FE-XC Cold Experts (518 demoted)" |
| | type: tokens_per_second |
| | value: 37.4 |
| | - name: "L5: + INT2 Coldest Experts (399 demoted)" |
| | type: tokens_per_second |
| | value: 37.4 |
| | - name: "L6: + CUDA Graph Decode" |
| | type: tokens_per_second |
| | value: 57.1 |
| | - task: |
| | type: text-generation |
| | name: Optimization History (0.4 to 49.4 tok/s) |
| | dataset: |
| | name: FireEcho Kernel Optimization Log (RTX 5090) |
| | type: custom |
| | metrics: |
| | - name: "Step 0: Naive Python Loop (128 experts)" |
| | type: tokens_per_second |
| | value: 0.4 |
| | - name: "Step 1: Grouped Dispatch + TF32" |
| | type: tokens_per_second |
| | value: 7.7 |
| | - name: "Step 2: Fused gate_up_proj" |
| | type: tokens_per_second |
| | value: 9.5 |
| | - name: "Step 3: Single-Token Decode Fast Path" |
| | type: tokens_per_second |
| | value: 12.6 |
| | - name: "Step 4: Multi-Expert Goliath Kernel" |
| | type: tokens_per_second |
| | value: 18.8 |
| | - name: "Step 5: Packed MoE (Contiguous Buffer)" |
| | type: tokens_per_second |
| | value: 30.8 |
| | - name: "Step 6: Flat KV Cache (Zero torch.cat)" |
| | type: tokens_per_second |
| | value: 40.9 |
| | - name: "Step 7: CUDA Graph + FlashDecode" |
| | type: tokens_per_second |
| | value: 49.4 |
| | --- |
| | |
| | # FireEcho Engine |
| |
|
| | **High-performance single-GPU inference kernel for 30B+ MoE models** |
| |
|
| | Created by [Luis E. Davila Flores](https://x.com/Joysulem) |
| |
|
| | ## What is FireEcho? |
| |
|
| | FireEcho is a from-scratch inference engine that runs **Qwen3-Omni-30B** (30.5 billion parameters, 128-expert MoE) on a **single RTX 5090** at **45+ tokens/second** using only **20 GB VRAM**. |
| |
|
| | It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop β no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries. |
| |
|
| | ## Key Results |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | Model | Qwen3-Omni-30B-A3B-Instruct | |
| | | Parameters | 30.5B total, ~3.3B active/token | |
| | | GPU | NVIDIA RTX 5090 (32 GB, Blackwell) | |
| | | VRAM Usage | 20.0 GB (model) + 3.1 GB (KV cache) | |
| | | Decode Speed | **45+ tok/s** (single user, greedy) | |
| | | Compression | 4x (BF16 61 GB -> FP4 20 GB) | |
| | | Load Time | 110 seconds (streaming, 3.1 GB CPU RAM) | |
| | | Speedup | **124x** over naive PyTorch baseline | |
| |
|
| | ## Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts) |
| |
|
| | | Configuration | tok/s | vs Base | Notes | |
| | |--------------|-------|---------|-------| |
| | | L0: Baseline (FP4 + Packed MoE + Flat KV) | 43.3 | 1.00x | Core engine, all FP4 experts | |
| | | L1: + FP8 KV cache | 41.8 | 0.97x | 50% KV VRAM savings | |
| | | L2: + L2 layer prefetch | 41.4 | 0.96x | Pins next layer in L2 cache | |
| | | L3: + Atlas gatekeeper | 39.7 | 0.92x | Expert banning + MoDES skip | |
| | | L4: + FE-XC cold experts (518 demoted) | 37.4 | 0.86x | Codebook 2-bit cold experts | |
| | | L5: + INT2 coldest experts (399 demoted) | 37.4 | 0.86x | Scalar 2-bit coldest experts | |
| | | **L6: + CUDA Graph decode** | **57.1** | **1.32x** | **Graph-captured 48-layer forward** | |
| |
|
| | **Peak VRAM**: 21.5 GB | **Baseline forward**: ~23.1ms/tok | **Full-stack forward**: ~17.5ms/tok |
| |
|
| | **Note**: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding β with a trained EAGLE-3 head at 70% acceptance, projected throughput is **~457 tok/s**. |
| |
|
| | ## Speed Optimization History |
| |
|
| | Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds: |
| |
|
| | | Step | Optimization | tok/s | Cumulative Speedup | |
| | |------|-------------|-------|--------------------| |
| | | 0 | Baseline (128-expert Python loop) | 0.4 | 1x | |
| | | 1 | Grouped dispatch + TF32 + autotune | 7.7 | 19x | |
| | | 2 | Fused gate_up_proj (2->1 matmul) | 9.5 | 24x | |
| | | 3 | Single-token decode fast path | 12.6 | 32x | |
| | | 4 | Multi-expert Goliath kernel | 18.8 | 47x | |
| | | 5 | Packed MoE (contiguous buffer) | 30.8 | 77x | |
| | | 6 | Flat KV cache (zero torch.cat) | 40.9 | 102x | |
| | | 7 | CUDA Graph + FlashDecode | 49.4 | **124x** | |
| |
|
| | ## The Goliath Kernel: Why It's Fast |
| |
|
| | Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic. |
| |
|
| | **Goliath FP4** dequantizes **inside** the Triton matmul tile loop β in registers, with zero global memory writes: |
| |
|
| | ```python |
| | # Simplified Goliath FP4 inner loop |
| | for k_block in range(0, K, BLOCK_K): |
| | w_packed = tl.load(weight_ptr + offsets) # Load FP4 packed bytes |
| | w_lo = (w_packed & 0xF) * scale # Dequant low nibble in-register |
| | w_hi = (w_packed >> 4) * scale # Dequant high nibble in-register |
| | acc += tl.dot(a_tile, w_tile) # Tensor core matmul |
| | ``` |
| |
|
| | **Packed MoE** eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous `[128, K//2, N]` buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer β zero `.item()` calls, zero CPU-GPU synchronization. |
| |
|
| | ## Quantization Formats |
| |
|
| | | Format | Bits | Compression | Quality | Used For | |
| | |--------|------|-------------|---------|----------| |
| | | BF16 | 16 | 1x | Perfect | Attention Q/K/V/O | |
| | | Goliath FP4 | 4 | 4x | Near-perfect | Hot MoE experts | |
| | | FE-XC | 2 | 8x | Very good (codebook) | Cold MoE experts | |
| | | INT2 | 2 | 8x | Acceptable (scalar) | Coldest MoE experts | |
| | | Goliath FP8 | 8 | 2x | Excellent | FP8 KV cache | |
| | | FE-MX | 4-8 | 2-4x | Adaptive | Hebbian memory | |
| |
|
| | ## Unique Features |
| |
|
| | ### Hebbian Memory |
| | Biologically-inspired fast weights that **learn during inference** (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA. |
| |
|
| | ### Atlas Gatekeeper (FE-AGK) |
| | Runtime expert management: |
| | - **Ban & Pick**: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts) |
| | - **MoDES**: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers) |
| |
|
| | ### FE-XC / INT2 Cold Expert Demotion |
| | Automatically compresses rarely-used experts to 2-bit: |
| | - FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel) |
| | - INT2: Scalar 2-bit (simple but lower quality) |
| | - Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2 |
| |
|
| | ### FlashDecode |
| | Custom Triton attention kernel for M=1 decoding: |
| | - Online softmax (no separate softmax pass) |
| | - Reads only valid KV positions (no padding waste) |
| | - GQA support (4 KV heads -> 32 query heads) |
| | - 15.8ms per token (48 layers, 4096-token context) |
| |
|
| | ### EAGLE-3 Speculative Decoding (infrastructure ready) |
| | Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress. |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from fireecho_kernel import FireEchoEngine |
| | |
| | # Load (streams layer-by-layer, 110s, 20 GB VRAM) |
| | engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B") |
| | |
| | # Enable optimizations |
| | engine.enable_flat_decode(kv_dtype='fp8') # FP8 KV cache |
| | engine.enable_cuda_graph_decode() # CUDA Graph |
| | |
| | # Generate |
| | input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda() |
| | output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7) |
| | print(engine.tokenizer.decode(output[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Requirements |
| |
|
| | - **GPU**: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended |
| | - **CUDA**: 12.4+ |
| | - **Python**: 3.10-3.12 |
| | - **PyTorch**: 2.4.0+ |
| | - **Triton**: 3.0+ |
| | - **OS**: Linux x86_64 |
| | |
| | ## Hardware Independence |
| | |
| | FireEcho uses **zero NVIDIA proprietary libraries**: |
| | - No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention) |
| | - All custom kernels are pure **Triton** (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU) |
| | - Runs anywhere Triton runs |
| | |
| | ## Architecture |
| | |
| | ``` |
| | FireEcho Engine |
| | βββ fireecho_kernel.py # Main engine (9000+ lines) |
| | β βββ FireEchoEngine # Load, generate, speculate |
| | β βββ MoEFFN # Packed MoE with fused dispatch |
| | β βββ HebbianMemory # Fast weights (learn at inference) |
| | β βββ FlashDecode # Triton M=1 GQA attention |
| | β βββ CUDA Graph # Graph-captured decode |
| | βββ goliath_kernel.py # Quantized GEMM kernels (3000+ lines) |
| | β βββ GoliathFP4 # FP4 fused dequant-matmul |
| | β βββ GoliathFP8 # FP8 fused dequant-matmul |
| | β βββ GoliathINT2 # INT2 scalar quantization |
| | β βββ GoliathFEXC # FE-XC codebook 2-bit |
| | β βββ Packed MoE # Contiguous expert buffers |
| | βββ triton_hebbian.py # Fused Hebbian kernels |
| | βββ femx_storage.py # Block floating point storage |
| | βββ persistent_memory.py # AGI-like persistent memory |
| | ``` |
| | |
| | ## License |
| | |
| | CC BY-NC 4.0 β Free for research and non-commercial use with attribution. |
| | |
| | For commercial licensing: [@Joysulem on X/Twitter](https://x.com/Joysulem) |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @software{fireecho2026, |
| | author = {Davila Flores, Luis E.}, |
| | title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models}, |
| | year = {2026}, |
| | url = {https://github.com/Joysulem/FireEcho} |
| | } |
| | ``` |
| | |