README.md · Joysulem/FireEcho at main

FireEcho / README.md

Joysulem

Create README.md

71bbd32 verified 14 days ago

preview code

raw

history blame contribute delete

11.6 kB

	---
	library_name: fireecho
	tags:
	- inference
	- triton
	- quantization
	- moe
	- fp4
	- fp8
	- int2
	- single-gpu
	- blackwell
	- hebbian
	- speculative-decoding
	- custom-kernel
	license: cc-by-nc-4.0
	pipeline_tag: text-generation
	datasets:
	- Qwen/Qwen3-Omni-30B-A3B-Instruct
	model-index:
	- name: FireEcho Engine
	results:
	- task:
	type: text-generation
	name: Inference Throughput
	dataset:
	name: FireEcho Full-Stack Benchmark (8 diverse prompts, 200 tok each)
	type: custom
	metrics:
	- name: Decode Speed (Baseline FP4)
	type: tokens_per_second
	value: 43.3
	verified: false
	- name: Decode Speed (Full-Stack + CUDA Graph)
	type: tokens_per_second
	value: 57.1
	verified: false
	- name: Speedup vs Naive PyTorch
	type: speedup
	value: 124
	verified: false
	- name: VRAM Usage (Model)
	type: gpu_memory_gb
	value: 20.0
	verified: false
	- name: VRAM Usage (Peak)
	type: gpu_memory_gb
	value: 21.5
	verified: false
	- name: Model Load Time (seconds)
	type: latency
	value: 110
	verified: false
	- name: Compression Ratio (BF16 to FP4)
	type: compression
	value: 4.0
	verified: false
	- task:
	type: text-generation
	name: Optimization Stack Ablation
	dataset:
	name: FireEcho Ablation (RTX 5090, 200 tok/prompt, greedy)
	type: custom
	metrics:
	- name: "L0: Baseline (FP4 + Packed MoE + Flat KV)"
	type: tokens_per_second
	value: 43.3
	- name: "L1: + FP8 KV Cache"
	type: tokens_per_second
	value: 41.8
	- name: "L2: + L2 Layer Prefetch"
	type: tokens_per_second
	value: 41.4
	- name: "L3: + Atlas Ban & Pick + MoDES"
	type: tokens_per_second
	value: 39.7
	- name: "L4: + FE-XC Cold Experts (518 demoted)"
	type: tokens_per_second
	value: 37.4
	- name: "L5: + INT2 Coldest Experts (399 demoted)"
	type: tokens_per_second
	value: 37.4
	- name: "L6: + CUDA Graph Decode"
	type: tokens_per_second
	value: 57.1
	- task:
	type: text-generation
	name: Optimization History (0.4 to 49.4 tok/s)
	dataset:
	name: FireEcho Kernel Optimization Log (RTX 5090)
	type: custom
	metrics:
	- name: "Step 0: Naive Python Loop (128 experts)"
	type: tokens_per_second
	value: 0.4
	- name: "Step 1: Grouped Dispatch + TF32"
	type: tokens_per_second
	value: 7.7
	- name: "Step 2: Fused gate_up_proj"
	type: tokens_per_second
	value: 9.5
	- name: "Step 3: Single-Token Decode Fast Path"
	type: tokens_per_second
	value: 12.6
	- name: "Step 4: Multi-Expert Goliath Kernel"
	type: tokens_per_second
	value: 18.8
	- name: "Step 5: Packed MoE (Contiguous Buffer)"
	type: tokens_per_second
	value: 30.8
	- name: "Step 6: Flat KV Cache (Zero torch.cat)"
	type: tokens_per_second
	value: 40.9
	- name: "Step 7: CUDA Graph + FlashDecode"
	type: tokens_per_second
	value: 49.4
	---

	# FireEcho Engine

	High-performance single-GPU inference kernel for 30B+ MoE models

	Created by [Luis E. Davila Flores](https://x.com/Joysulem)

	## What is FireEcho?

	FireEcho is a from-scratch inference engine that runs Qwen3-Omni-30B (30.5 billion parameters, 128-expert MoE) on a single RTX 5090 at 45+ tokens/second using only 20 GB VRAM.

	It achieves this through custom Triton kernels that fuse dequantization inside the matmul loop — no separate dequantization step, no global memory writes, no NVIDIA proprietary libraries.

	## Key Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Model \| Qwen3-Omni-30B-A3B-Instruct \|
	\| Parameters \| 30.5B total, ~3.3B active/token \|
	\| GPU \| NVIDIA RTX 5090 (32 GB, Blackwell) \|
	\| VRAM Usage \| 20.0 GB (model) + 3.1 GB (KV cache) \|
	\| Decode Speed \| 45+ tok/s (single user, greedy) \|
	\| Compression \| 4x (BF16 61 GB -> FP4 20 GB) \|
	\| Load Time \| 110 seconds (streaming, 3.1 GB CPU RAM) \|
	\| Speedup \| 124x over naive PyTorch baseline \|

	## Benchmark Results (RTX 5090, 200 tokens/prompt, 8 diverse prompts)

	\| Configuration \| tok/s \| vs Base \| Notes \|
	\|--------------\|-------\|---------\|-------\|
	\| L0: Baseline (FP4 + Packed MoE + Flat KV) \| 43.3 \| 1.00x \| Core engine, all FP4 experts \|
	\| L1: + FP8 KV cache \| 41.8 \| 0.97x \| 50% KV VRAM savings \|
	\| L2: + L2 layer prefetch \| 41.4 \| 0.96x \| Pins next layer in L2 cache \|
	\| L3: + Atlas gatekeeper \| 39.7 \| 0.92x \| Expert banning + MoDES skip \|
	\| L4: + FE-XC cold experts (518 demoted) \| 37.4 \| 0.86x \| Codebook 2-bit cold experts \|
	\| L5: + INT2 coldest experts (399 demoted) \| 37.4 \| 0.86x \| Scalar 2-bit coldest experts \|
	\| L6: + CUDA Graph decode \| 57.1 \| 1.32x \| Graph-captured 48-layer forward \|

	Peak VRAM: 21.5 GB \| Baseline forward: ~23.1ms/tok \| Full-stack forward: ~17.5ms/tok

	Note: L1-L5 show slight overhead vs L0 due to additional dispatch logic. CUDA Graph (L6) eliminates all Python overhead and captures the full 48-layer forward as a single graph replay. The compression layers (FE-XC/INT2) reduce memory bandwidth which compounds with speculative decoding — with a trained EAGLE-3 head at 70% acceptance, projected throughput is ~457 tok/s.

	## Speed Optimization History

	Starting from a naive Python loop over 128 MoE experts (0.4 tok/s), each optimization layer compounds:

	\| Step \| Optimization \| tok/s \| Cumulative Speedup \|
	\|------\|-------------\|-------\|--------------------\|
	\| 0 \| Baseline (128-expert Python loop) \| 0.4 \| 1x \|
	\| 1 \| Grouped dispatch + TF32 + autotune \| 7.7 \| 19x \|
	\| 2 \| Fused gate_up_proj (2->1 matmul) \| 9.5 \| 24x \|
	\| 3 \| Single-token decode fast path \| 12.6 \| 32x \|
	\| 4 \| Multi-expert Goliath kernel \| 18.8 \| 47x \|
	\| 5 \| Packed MoE (contiguous buffer) \| 30.8 \| 77x \|
	\| 6 \| Flat KV cache (zero torch.cat) \| 40.9 \| 102x \|
	\| 7 \| CUDA Graph + FlashDecode \| 49.4 \| 124x \|

	## The Goliath Kernel: Why It's Fast

	Standard quantized inference dequantizes weights to BF16 in global memory, then runs a matmul. This doubles memory traffic.

	Goliath FP4 dequantizes inside the Triton matmul tile loop — in registers, with zero global memory writes:

	```python
	# Simplified Goliath FP4 inner loop
	for k_block in range(0, K, BLOCK_K):
	w_packed = tl.load(weight_ptr + offsets) # Load FP4 packed bytes
	w_lo = (w_packed & 0xF) * scale # Dequant low nibble in-register
	w_hi = (w_packed >> 4) * scale # Dequant high nibble in-register
	acc += tl.dot(a_tile, w_tile) # Tensor core matmul
	```

	Packed MoE eliminates the Python expert loop entirely. All 128 experts are packed into one contiguous `[128, K//2, N]` buffer. A single Triton kernel launch reads expert IDs from a GPU tensor and indexes into the buffer — zero `.item()` calls, zero CPU-GPU synchronization.

	## Quantization Formats

	\| Format \| Bits \| Compression \| Quality \| Used For \|
	\|--------\|------\|-------------\|---------\|----------\|
	\| BF16 \| 16 \| 1x \| Perfect \| Attention Q/K/V/O \|
	\| Goliath FP4 \| 4 \| 4x \| Near-perfect \| Hot MoE experts \|
	\| FE-XC \| 2 \| 8x \| Very good (codebook) \| Cold MoE experts \|
	\| INT2 \| 2 \| 8x \| Acceptable (scalar) \| Coldest MoE experts \|
	\| Goliath FP8 \| 8 \| 2x \| Excellent \| FP8 KV cache \|
	\| FE-MX \| 4-8 \| 2-4x \| Adaptive \| Hebbian memory \|

	## Unique Features

	### Hebbian Memory
	Biologically-inspired fast weights that learn during inference (no backpropagation). Implements competitive learning, STDP traces, intrinsic plasticity, PMI correction, and GHA decorrelation. Papers: Lansner BCPNN, Triesch 2005, Sanger's GHA.

	### Atlas Gatekeeper (FE-AGK)
	Runtime expert management:
	- Ban & Pick: Profiles expert impact, bans bottom 25% per layer (8->6 effective experts)
	- MoDES: Skips entire MoE computation for uncertain tokens (saves ~50% compute on many layers)

	### FE-XC / INT2 Cold Expert Demotion
	Automatically compresses rarely-used experts to 2-bit:
	- FE-XC: Codebook 2-bit (2x8 AQLM-style, near-FP16 quality, 5.3x faster kernel)
	- INT2: Scalar 2-bit (simple but lower quality)
	- Age-adaptive: hot->FP4, cold->FE-XC, coldest->INT2

	### FlashDecode
	Custom Triton attention kernel for M=1 decoding:
	- Online softmax (no separate softmax pass)
	- Reads only valid KV positions (no padding waste)
	- GQA support (4 KV heads -> 32 query heads)
	- 15.8ms per token (48 layers, 4096-token context)

	### EAGLE-3 Speculative Decoding (infrastructure ready)
	Draft-then-verify acceleration. Draft head predicts K=5 tokens, target model verifies all 6 in one forward pass. Infrastructure complete, draft head training in progress.

	## Quick Start

	```python
	from fireecho_kernel import FireEchoEngine

	# Load (streams layer-by-layer, 110s, 20 GB VRAM)
	engine = FireEchoEngine.from_pretrained("path/to/Qwen3-Omni-30B")

	# Enable optimizations
	engine.enable_flat_decode(kv_dtype='fp8') # FP8 KV cache
	engine.enable_cuda_graph_decode() # CUDA Graph

	# Generate
	input_ids = engine.tokenizer.encode("Hello, world!", return_tensors='pt').cuda()
	output = engine.generate(input_ids, max_new_tokens=200, temperature=0.7)
	print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
	```

	## Requirements

	- GPU: RTX 4090 (24 GB) minimum, RTX 5090 (32 GB) recommended
	- CUDA: 12.4+
	- Python: 3.10-3.12
	- PyTorch: 2.4.0+
	- Triton: 3.0+
	- OS: Linux x86_64

	## Hardware Independence

	FireEcho uses zero NVIDIA proprietary libraries:
	- No cuQuantizer, CUTLASS, TensorRT, cuBLAS (except via torch.matmul for attention)
	- All custom kernels are pure Triton (compiles to NVIDIA CUDA, AMD ROCm, Intel XPU)
	- Runs anywhere Triton runs

	## Architecture

	```
	FireEcho Engine
	├── fireecho_kernel.py # Main engine (9000+ lines)
	│ ├── FireEchoEngine # Load, generate, speculate
	│ ├── MoEFFN # Packed MoE with fused dispatch
	│ ├── HebbianMemory # Fast weights (learn at inference)
	│ ├── FlashDecode # Triton M=1 GQA attention
	│ └── CUDA Graph # Graph-captured decode
	├── goliath_kernel.py # Quantized GEMM kernels (3000+ lines)
	│ ├── GoliathFP4 # FP4 fused dequant-matmul
	│ ├── GoliathFP8 # FP8 fused dequant-matmul
	│ ├── GoliathINT2 # INT2 scalar quantization
	│ ├── GoliathFEXC # FE-XC codebook 2-bit
	│ └── Packed MoE # Contiguous expert buffers
	├── triton_hebbian.py # Fused Hebbian kernels
	├── femx_storage.py # Block floating point storage
	└── persistent_memory.py # AGI-like persistent memory
	```

	## License

	CC BY-NC 4.0 — Free for research and non-commercial use with attribution.

	For commercial licensing: [@Joysulem on X/Twitter](https://x.com/Joysulem)

	## Citation

	```bibtex
	@software{fireecho2026,
	author = {Davila Flores, Luis E.},
	title = {FireEcho Engine: High-Performance Single-GPU Inference for 30B+ MoE Models},
	year = {2026},
	url = {https://github.com/Joysulem/FireEcho}
	}
	```