Upload 3258 files
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +7 -0
- FireEcho Engine/About FireEcho.md +33 -0
- FireEcho Engine/Bible Readme.txt +667 -0
- FireEcho Engine/__pycache__/cutlass_kernels.cpython-312.pyc +3 -0
- FireEcho Engine/__pycache__/dsmem_ops.cpython-312.pyc +0 -0
- FireEcho Engine/__pycache__/femx_storage.cpython-312.pyc +0 -0
- FireEcho Engine/__pycache__/fireecho_kernel.cpython-312.pyc +3 -0
- FireEcho Engine/__pycache__/goliath_kernel.cpython-312.pyc +3 -0
- FireEcho Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc +3 -0
- FireEcho Engine/__pycache__/triton_hebbian.cpython-312.pyc +0 -0
- FireEcho Engine/bench_fusion.py +39 -0
- FireEcho Engine/benchmark_eagle.py +231 -0
- FireEcho Engine/benchmark_fullstack.py +323 -0
- FireEcho Engine/benchmark_perplexity.py +358 -0
- FireEcho Engine/calibrate_fexc.py +173 -0
- FireEcho Engine/calibrate_fexvq.py +227 -0
- FireEcho Engine/csrc/cluster_launch.cpp +53 -0
- FireEcho Engine/csrc/cluster_launch.h +194 -0
- FireEcho Engine/csrc/dsmem_cluster.cuh +344 -0
- FireEcho Engine/csrc/femx_bindings.cpp +48 -0
- FireEcho Engine/csrc/femx_kernels.cu +422 -0
- FireEcho Engine/csrc/fireecho_preproc.cpp +54 -0
- FireEcho Engine/csrc/fireecho_preproc_cuda.cu +316 -0
- FireEcho Engine/cutlass_kernels.py +2418 -0
- FireEcho Engine/debug_acceptance.log +92 -0
- FireEcho Engine/debug_acceptance.py +152 -0
- FireEcho Engine/debug_bisect.log +78 -0
- FireEcho Engine/debug_bisect.py +149 -0
- FireEcho Engine/debug_d8_isolate.log +79 -0
- FireEcho Engine/debug_d8_isolate.py +156 -0
- FireEcho Engine/debug_eval_flow.log +75 -0
- FireEcho Engine/debug_eval_flow.py +186 -0
- FireEcho Engine/debug_nan_isolate.log +57 -0
- FireEcho Engine/debug_nan_isolate.py +174 -0
- FireEcho Engine/debug_promptlen.py +110 -0
- FireEcho Engine/debug_seqlen.py +65 -0
- FireEcho Engine/debug_seqlen_threshold.py +61 -0
- FireEcho Engine/debug_specgen_trace.py +171 -0
- FireEcho Engine/dsmem_ops.py +789 -0
- FireEcho Engine/eagle_data_codemix_cache.pt +3 -0
- FireEcho Engine/eagle_data_codemix_cache.pt.bak +3 -0
- FireEcho Engine/eagle_data_codemix_cache_old.pt +3 -0
- FireEcho Engine/eagle_data_selfgen_cache.pt +3 -0
- FireEcho Engine/eagle_data_selfgen_cache.pt.old +3 -0
- FireEcho Engine/eagle_precompute.log +0 -0
- FireEcho Engine/eagle_precompute_goddess.log +0 -0
- FireEcho Engine/eagle_precompute_v2.log +1220 -0
- FireEcho Engine/eagle_test.py +164 -0
- FireEcho Engine/eagle_train_d8.log +212 -0
- FireEcho Engine/eagle_train_goddess.log +973 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
FireEcho[[:space:]]Engine/__pycache__/cutlass_kernels.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
FireEcho[[:space:]]Engine/__pycache__/fireecho_kernel.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
FireEcho[[:space:]]Engine/__pycache__/goliath_kernel.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
FireEcho[[:space:]]Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
FireEcho[[:space:]]Engine/eagle_data_codemix_cache.pt.bak filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
FireEcho[[:space:]]Engine/eagle_data_selfgen_cache.pt.old filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
FireEcho[[:space:]]Engine/yay/src/gopath/pkg/mod/github.com/deckarep/golang-set/v2@v2.8.0/new_improved.jpeg filter=lfs diff=lfs merge=lfs -text
|
FireEcho Engine/About FireEcho.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
About FireEcho:
|
| 2 |
+
|
| 3 |
+
FireEcho is not a base model. She's a work-in-progress fast engine that connects to LLMs to reach AGI with short & long memory that never forgets. Helps humanity progress faster in calculations, adapt, learn, take notes & find new ones.) To reduce VRAM, no accuracy loss, speed, 0 hallucinations, 0 drift. 30B → 20GB VRAM.
|
| 4 |
+
|
| 5 |
+
//////////////////////////// FE quantization names: ////////////////////////////
|
| 6 |
+
|
| 7 |
+
1. FE-MX — FireEcho Mixed-Exponent (block floating point, femx_storage.py)
|
| 8 |
+
2. FE-XC — FireEcho Xtreme Compress (codebook 2-bit, AQLM k-means, goliath_kernel.py)
|
| 9 |
+
3. FE-XVQ — FireEcho XVector Quantization (VPTQ-inspired, Hessian-weighted codebooks)
|
| 10 |
+
4. FE-XC = FireEcho Xtreme Compress — codebook 2-bit quantization (AQLM-style, CodeGEMM psumbook kernel)
|
| 11 |
+
5. FE-XT = FireEcho Xturbo — tree speculative decoding with dynamic branch tuning (Scylla-inspired)
|
| 12 |
+
6. FE-H = FireEcho Hayabusa — async prefetch/offload for scaling draft head layers to CPU (SP-MoE-inspire)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
//////////////////////// FireEcho Quantization Stack: /////////////////////////
|
| 17 |
+
|
| 18 |
+
FE-MX = FireEcho Mixed-Exponent (adaptive precision: cold→FP4, warm→FP6, hot→FP8)
|
| 19 |
+
FE-XC = FireEcho Xtreme Compress (codebook 2-bit, AQLM-style)
|
| 20 |
+
FE-XVQ = FireEcho Xtreme Vector Quant (Hessian-weighted codebook 2 bit)
|
| 21 |
+
FE-XT = FireEcho Xturbo (tree speculative decoding)
|
| 22 |
+
FE-H = FireEcho Hayabusa (async prefetch offload)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
FE-MX (Mixed-Exponent) — Adaptive block floating-point precision. Hot experts (frequently used) stay at FP8, warm at FP6, cold drop to FP4. Uses shared block exponents per group — like HDR for weights, more precision where activity demands it.
|
| 26 |
+
|
| 27 |
+
FE-XVQ (Xtreme Vector Quant) — Hessian-weighted codebook 2-bit. Like FE-XC but uses second-order gradient info (the Hessian matrix) to learn smarter codebooks — weight groups that impact output most get more precise codebook entries. Same 2 bits/weight but better quality through calibration-data-driven optimization.
|
| 28 |
+
|
| 29 |
+
FE-XC (Xtreme Compress) — Learned codebook 2-bit quantization. Instead of crude rounding to 2-bit integers (which destroys quality), it learns 256 codebook vectors via k-means, then stores 2 uint8 indices per weight group. Same 2 bits/weight storage as INT2, but much better quality. Uses a "psumbook" trick: precomputes dot(codebook, input) once per token, then all 8 active experts just do scalar lookups instead of vector math. Result: 5.3x faster than FP4 at same bandwidth.
|
| 30 |
+
|
| 31 |
+
FE-XT (Xturbo) — Tree speculative decoding. Instead of predicting one token chain, the draft head explores b=4-16 branches simultaneously (like a tree). The target model verifies all branches in a single batched forward pass. Accepts the longest correct branch. Dynamic b tuning adjusts branch width based on acceptance rate (Scylla Eq.4). Target: 3-5x speedup over standard speculation.
|
| 32 |
+
|
| 33 |
+
FE-H (Hayabusa) — Async CPU offload for the draft head. When the draft head gets large (D=8-16 layers, 357M-1.2B params), it doesn't all fit in VRAM alongside the 20GB model. Hayabusa offloads deep draft layers to CPU RAM and JIT-prefetches them to GPU during the verification step (when GPU is busy with the target model anyway). Overlaps CPU→GPU transfer with GPU compute = free memory savings.
|
FireEcho Engine/Bible Readme.txt
ADDED
|
@@ -0,0 +1,667 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
================================================================================
|
| 2 |
+
FIREECHO ENGINE
|
| 3 |
+
High-Performance Single-GPU Inference Kernel for 30B+ MoE Models
|
| 4 |
+
================================================================================
|
| 5 |
+
|
| 6 |
+
Creator & Sole Author: Luis E. Davila Flores (@Joysulem)
|
| 7 |
+
License: CC BY-NC 4.0 (free for research, attribution required)
|
| 8 |
+
Status: Production-quality single-GPU decode, research extensions active
|
| 9 |
+
|
| 10 |
+
================================================================================
|
| 11 |
+
WHAT IS FIREECHO?
|
| 12 |
+
================================================================================
|
| 13 |
+
|
| 14 |
+
FireEcho is a custom inference engine that runs a 30 BILLION parameter
|
| 15 |
+
Mixture-of-Experts model (Qwen3-Omni-30B) on a SINGLE consumer GPU at
|
| 16 |
+
45+ tokens/second — using only 20 GB VRAM.
|
| 17 |
+
|
| 18 |
+
No multi-GPU. No cloud. No NVIDIA proprietary libraries.
|
| 19 |
+
Just Triton + PyTorch + one GPU.
|
| 20 |
+
|
| 21 |
+
Key numbers:
|
| 22 |
+
- 30.5B total params, ~3.3B active per token (128 experts, top-8 routing)
|
| 23 |
+
- 4x compression via Goliath FP4 fused dequant-matmul (61 GB -> 20 GB)
|
| 24 |
+
- 124x speedup from baseline (0.4 -> 49.4 tok/s) through 7 optimization layers
|
| 25 |
+
- Zero NVIDIA proprietary dependencies (no cuQuantizer, CUTLASS, TensorRT)
|
| 26 |
+
- Runs anywhere Triton compiles: NVIDIA CUDA, AMD ROCm, Intel XPU
|
| 27 |
+
|
| 28 |
+
What makes FireEcho different from vLLM/TGI/llama.cpp:
|
| 29 |
+
- Goliath kernel: FP4 dequantization INSIDE the Triton matmul loop (no separate
|
| 30 |
+
dequant step, no global memory materialization)
|
| 31 |
+
- Packed MoE: All 128 experts packed into one contiguous buffer per layer,
|
| 32 |
+
expert IDs stay on GPU — zero CPU-GPU sync during decode
|
| 33 |
+
- FlashDecode: Custom Triton attention kernel with online softmax for M=1 GQA
|
| 34 |
+
- Hebbian Memory: Biologically-inspired fast weights that learn at inference time
|
| 35 |
+
- FE-XC/INT2: Cold experts auto-demote to 2-bit (codebook or scalar) — further
|
| 36 |
+
bandwidth savings without touching hot experts
|
| 37 |
+
- CUDA Graph decode: Entire decode step captured as a graph, ~15.8ms/step
|
| 38 |
+
|
| 39 |
+
================================================================================
|
| 40 |
+
CURRENT STATUS & REALISTIC EXPECTATIONS
|
| 41 |
+
================================================================================
|
| 42 |
+
|
| 43 |
+
WHAT WORKS (production-quality):
|
| 44 |
+
[x] Full Qwen3-Omni-30B inference at 45+ tok/s on RTX 5090
|
| 45 |
+
[x] Goliath FP4 quantization (20 GB VRAM, FP16-quality output)
|
| 46 |
+
[x] Packed MoE with fused dequant-matmul (zero CPU sync)
|
| 47 |
+
[x] FlashDecode attention (online softmax, valid_len masking)
|
| 48 |
+
[x] CUDA Graph decode (graph-captured forward pass)
|
| 49 |
+
[x] Flat KV cache (pre-allocated, zero allocation per token)
|
| 50 |
+
[x] FP8 KV cache (50% VRAM savings on attention)
|
| 51 |
+
[x] FE-XC cold expert demotion (codebook 2-bit, 5.3x faster kernel)
|
| 52 |
+
[x] INT2 ultra-cold expert demotion (scalar 2-bit)
|
| 53 |
+
[x] Hebbian persistent memory (learns during inference)
|
| 54 |
+
[x] Atlas gatekeeper (expert banning + MoDES skipping)
|
| 55 |
+
[x] Streaming shard loader (110s cold start, 3.1 GB CPU RAM)
|
| 56 |
+
|
| 57 |
+
WHAT'S RESEARCH/EXPERIMENTAL:
|
| 58 |
+
[ ] EAGLE-3 speculative decoding (infrastructure done, head needs training)
|
| 59 |
+
[ ] FE-XT tree speculation (code done, needs trained draft head)
|
| 60 |
+
[ ] FE-H Hayabusa async prefetch (code done, needs benchmarking)
|
| 61 |
+
[ ] Batched speculative decode (infrastructure done)
|
| 62 |
+
[ ] Multi-GPU (not implemented — single-GPU is the design philosophy)
|
| 63 |
+
|
| 64 |
+
WILL NOT WORK ON:
|
| 65 |
+
- GPUs with < 24 GB VRAM (model is 20 GB + KV cache)
|
| 66 |
+
- CUDA < 12.4 (BF16 atomics, FP8 support needed)
|
| 67 |
+
- CPU-only (Triton compiles to GPU targets)
|
| 68 |
+
|
| 69 |
+
================================================================================
|
| 70 |
+
HARDWARE & SOFTWARE REQUIREMENTS
|
| 71 |
+
================================================================================
|
| 72 |
+
|
| 73 |
+
Component Minimum Recommended
|
| 74 |
+
───────────────── ─────────────────── ────────────────────────
|
| 75 |
+
GPU RTX 4090 (24 GB)* RTX 5090 (32 GB)
|
| 76 |
+
GPU VRAM 24 GB 32 GB
|
| 77 |
+
CPU Any modern x86_64 Ryzen 9 9950X / i9-14900K
|
| 78 |
+
System RAM 32 GB 64 GB
|
| 79 |
+
CUDA 12.4+ 12.8+
|
| 80 |
+
Python 3.10 - 3.12 3.12
|
| 81 |
+
PyTorch 2.4.0+ 2.6.0+cu128
|
| 82 |
+
Triton 3.0+ 3.2+
|
| 83 |
+
OS Linux (x86_64) Arch Linux / Ubuntu 22.04+
|
| 84 |
+
|
| 85 |
+
* RTX 4090: Will work but FP4 kernels may be slower (no Blackwell tensor cores)
|
| 86 |
+
* RTX 3090: Marginal — 24 GB VRAM is tight, FP8 not supported
|
| 87 |
+
* AMD GPUs: Triton compiles to ROCm — untested but architecturally supported
|
| 88 |
+
|
| 89 |
+
Tested configuration (author's machine):
|
| 90 |
+
AMD Ryzen 9 9950X + NVIDIA RTX 5090 32 GB + 64 GB DDR5
|
| 91 |
+
Arch Linux, CUDA 12.8, Python 3.12, PyTorch 2.6.0+cu128, Triton 3.2
|
| 92 |
+
|
| 93 |
+
================================================================================
|
| 94 |
+
INSTALLATION
|
| 95 |
+
================================================================================
|
| 96 |
+
|
| 97 |
+
Step 1: Clone the repository
|
| 98 |
+
─────────────────────────────
|
| 99 |
+
git clone https://github.com/Joysulem/FireEcho.git
|
| 100 |
+
cd FireEcho
|
| 101 |
+
|
| 102 |
+
Step 2: Create a Python virtual environment
|
| 103 |
+
────────────────────────────────────────────
|
| 104 |
+
python3.12 -m venv .venv
|
| 105 |
+
source .venv/bin/activate
|
| 106 |
+
|
| 107 |
+
Step 3: Install dependencies
|
| 108 |
+
─────────────────────────────
|
| 109 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
|
| 110 |
+
pip install triton transformers tokenizers safetensors sentencepiece
|
| 111 |
+
|
| 112 |
+
Step 4: Verify installation
|
| 113 |
+
────────────────────────────
|
| 114 |
+
python -c "import torch; print('CUDA:', torch.cuda.is_available(), '|', torch.version.cuda)"
|
| 115 |
+
python -c "import triton; print('Triton:', triton.__version__)"
|
| 116 |
+
|
| 117 |
+
Expected output:
|
| 118 |
+
CUDA: True | 12.8
|
| 119 |
+
Triton: 3.2.0
|
| 120 |
+
|
| 121 |
+
Step 5: Download a model (Qwen3-Omni-30B recommended)
|
| 122 |
+
──────────────────────────────────────────────────────
|
| 123 |
+
# Option A: Via huggingface-cli
|
| 124 |
+
pip install huggingface-hub
|
| 125 |
+
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model/Qwen3-Omni
|
| 126 |
+
|
| 127 |
+
# Option B: Via git lfs
|
| 128 |
+
git lfs install
|
| 129 |
+
git clone https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ./model/Qwen3-Omni
|
| 130 |
+
|
| 131 |
+
================================================================================
|
| 132 |
+
QUICK SMOKE TEST (run this first!)
|
| 133 |
+
================================================================================
|
| 134 |
+
|
| 135 |
+
cd FireEcho/kernel/FireEcho\ Engine/
|
| 136 |
+
|
| 137 |
+
python -c "
|
| 138 |
+
from fireecho_kernel import FireEchoEngine
|
| 139 |
+
import torch
|
| 140 |
+
|
| 141 |
+
# Load model (takes ~110 seconds, streams layer-by-layer)
|
| 142 |
+
engine = FireEchoEngine.from_pretrained('./model/Qwen3-Omni')
|
| 143 |
+
|
| 144 |
+
# Quick generation test
|
| 145 |
+
tokens = engine.tokenizer.encode('The capital of France is', return_tensors='pt').cuda()
|
| 146 |
+
output = engine.generate(tokens, max_new_tokens=20, temperature=0.0)
|
| 147 |
+
print(engine.tokenizer.decode(output[0]))
|
| 148 |
+
print(f'VRAM used: {torch.cuda.max_memory_allocated()/1e9:.1f} GB')
|
| 149 |
+
"
|
| 150 |
+
|
| 151 |
+
Expected output:
|
| 152 |
+
The capital of France is Paris. Paris is the largest city in France...
|
| 153 |
+
VRAM used: 23.1 GB
|
| 154 |
+
|
| 155 |
+
If this works, your setup is correct. If not, check:
|
| 156 |
+
- CUDA version matches PyTorch build (torch.version.cuda)
|
| 157 |
+
- GPU has enough VRAM (nvidia-smi)
|
| 158 |
+
- Model path is correct
|
| 159 |
+
|
| 160 |
+
================================================================================
|
| 161 |
+
BASIC INFERENCE USAGE
|
| 162 |
+
================================================================================
|
| 163 |
+
|
| 164 |
+
─── Minimal example ───
|
| 165 |
+
|
| 166 |
+
from fireecho_kernel import FireEchoEngine
|
| 167 |
+
|
| 168 |
+
# Load model with FP4 quantization (automatic for Qwen3-Omni)
|
| 169 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 170 |
+
|
| 171 |
+
# Encode input
|
| 172 |
+
input_ids = engine.tokenizer.encode(
|
| 173 |
+
"Explain quantum computing in simple terms",
|
| 174 |
+
return_tensors='pt'
|
| 175 |
+
).cuda()
|
| 176 |
+
|
| 177 |
+
# Generate
|
| 178 |
+
output = engine.generate(
|
| 179 |
+
input_ids,
|
| 180 |
+
max_new_tokens=200,
|
| 181 |
+
temperature=0.7,
|
| 182 |
+
top_p=0.9,
|
| 183 |
+
use_cache=True
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
# Decode and print
|
| 187 |
+
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
─── High-performance decode (all optimizations) ───
|
| 191 |
+
|
| 192 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 193 |
+
|
| 194 |
+
# Enable flat KV cache (eliminates torch.cat overhead)
|
| 195 |
+
engine.enable_flat_decode() # +403 MB VRAM, BF16 KV
|
| 196 |
+
|
| 197 |
+
# Or FP8 KV cache (half the VRAM, same speed)
|
| 198 |
+
engine.enable_flat_decode(kv_dtype='fp8') # +208 MB VRAM
|
| 199 |
+
|
| 200 |
+
# Enable CUDA Graph decode (captures forward pass as graph)
|
| 201 |
+
engine.enable_cuda_graph_decode() # +0 VRAM, ~10% faster
|
| 202 |
+
|
| 203 |
+
# Enable Atlas gatekeeper (prunes cold experts at runtime)
|
| 204 |
+
engine.enable_atlas(
|
| 205 |
+
profile_prompts=8,
|
| 206 |
+
ban_pct=0.25, # Ban bottom 25% of experts per layer
|
| 207 |
+
modes_threshold=2.0 # MoDES: skip MoE for uncertain tokens
|
| 208 |
+
)
|
| 209 |
+
|
| 210 |
+
# Enable FE-XC cold expert demotion (2-bit codebook)
|
| 211 |
+
engine.enable_auto_fexc_demotion(cold_threshold=0.10)
|
| 212 |
+
|
| 213 |
+
# Enable INT2 ultra-cold expert demotion
|
| 214 |
+
engine.enable_auto_int2_demotion(cold_threshold=0.05)
|
| 215 |
+
|
| 216 |
+
# Generate with everything enabled
|
| 217 |
+
output = engine.generate(input_ids, max_new_tokens=500)
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
─── Interactive chat loop ───
|
| 221 |
+
|
| 222 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 223 |
+
engine.enable_flat_decode()
|
| 224 |
+
engine.enable_cuda_graph_decode()
|
| 225 |
+
|
| 226 |
+
print("FireEcho Chat (type 'quit' to exit)")
|
| 227 |
+
while True:
|
| 228 |
+
user_input = input("\nYou: ")
|
| 229 |
+
if user_input.lower() == 'quit':
|
| 230 |
+
break
|
| 231 |
+
|
| 232 |
+
# Format as chat (Qwen3 format)
|
| 233 |
+
prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
|
| 234 |
+
input_ids = engine.tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 235 |
+
|
| 236 |
+
output = engine.generate(
|
| 237 |
+
input_ids,
|
| 238 |
+
max_new_tokens=500,
|
| 239 |
+
temperature=0.7,
|
| 240 |
+
top_p=0.9
|
| 241 |
+
)
|
| 242 |
+
|
| 243 |
+
response = engine.tokenizer.decode(
|
| 244 |
+
output[0][input_ids.shape[1]:],
|
| 245 |
+
skip_special_tokens=True
|
| 246 |
+
)
|
| 247 |
+
print(f"\nFireEcho: {response}")
|
| 248 |
+
|
| 249 |
+
================================================================================
|
| 250 |
+
BENCHMARKING
|
| 251 |
+
================================================================================
|
| 252 |
+
|
| 253 |
+
─── Quick speed test ───
|
| 254 |
+
|
| 255 |
+
python benchmark_fullstack.py
|
| 256 |
+
|
| 257 |
+
This runs 7 optimization layers, stacking each one:
|
| 258 |
+
L0: Baseline (FP4 + packed MoE + flat KV BF16) ~45 tok/s
|
| 259 |
+
L1: + FP8 KV cache ~42 tok/s
|
| 260 |
+
L2: + L2 layer prefetch ~42 tok/s
|
| 261 |
+
L3: + Atlas Ban & Pick (8->~5 experts) ~40 tok/s
|
| 262 |
+
L4: + FE-XC cold experts (2-bit codebook) ~39 tok/s
|
| 263 |
+
L5: + INT2 coldest experts (2-bit scalar) ~38 tok/s
|
| 264 |
+
L6: + CUDA Graph decode ~TBD
|
| 265 |
+
|
| 266 |
+
Note: L1-L5 are slightly slower than L0 due to overhead from
|
| 267 |
+
additional dispatch logic. The REAL benefit comes when combined
|
| 268 |
+
with speculative decoding (EAGLE-3) — the bandwidth savings from
|
| 269 |
+
FE-XC/INT2 allow more tokens to be verified per unit time.
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
─── EAGLE-3 benchmark (speculative decode) ───
|
| 273 |
+
|
| 274 |
+
python benchmark_eagle.py --checkpoint eagle_checkpoints/eagle_best.pt
|
| 275 |
+
|
| 276 |
+
Note: Requires a trained draft head. See "EAGLE-3 Training" section.
|
| 277 |
+
|
| 278 |
+
================================================================================
|
| 279 |
+
FEATURE REFERENCE (Cheat Sheet)
|
| 280 |
+
================================================================================
|
| 281 |
+
|
| 282 |
+
Feature How to enable VRAM cost
|
| 283 |
+
─────────────────────── ───────────────────────────────── ──────────
|
| 284 |
+
Flat KV cache (BF16) engine.enable_flat_decode() +403 MB
|
| 285 |
+
Flat KV cache (FP8) engine.enable_flat_decode('fp8') +208 MB
|
| 286 |
+
CUDA Graph decode engine.enable_cuda_graph_decode() ~0
|
| 287 |
+
Atlas gatekeeper engine.enable_atlas() ~0
|
| 288 |
+
FE-XC cold demotion engine.enable_auto_fexc_demotion() ~0*
|
| 289 |
+
INT2 cold demotion engine.enable_auto_int2_demotion() ~0*
|
| 290 |
+
L2 layer prefetch engine.enable_l2_prefetch() ~0
|
| 291 |
+
Hebbian memory engine.enable_hebbian() +50 MB
|
| 292 |
+
EAGLE-3 speculation engine.enable_eagle(checkpoint) +200 MB
|
| 293 |
+
|
| 294 |
+
* FE-XC/INT2 actually SAVES VRAM by compressing cold expert weights
|
| 295 |
+
|
| 296 |
+
Quantization formats available:
|
| 297 |
+
- Goliath FP4: 4-bit fused dequant (default for MoE experts)
|
| 298 |
+
- Goliath FP8: 8-bit fused dequant (optional for attention)
|
| 299 |
+
- Goliath INT2: 2-bit scalar quantization (coldest experts)
|
| 300 |
+
- FE-XC: 2-bit codebook (2x8 AQLM-style, near-FP16 quality)
|
| 301 |
+
- FE-XVQ: Hessian-weighted 2-bit codebook (VPTQ-inspired)
|
| 302 |
+
- FE-MX: Block floating point (FEMX4/FEMX6/FEMX8 for Hebbian)
|
| 303 |
+
|
| 304 |
+
================================================================================
|
| 305 |
+
HOW THE ENGINE WORKS (Architecture Overview)
|
| 306 |
+
================================================================================
|
| 307 |
+
|
| 308 |
+
FireEcho loads a model and replaces standard PyTorch operations with
|
| 309 |
+
custom Triton kernels at every level:
|
| 310 |
+
|
| 311 |
+
1. LOADING (from_pretrained)
|
| 312 |
+
- Streams model shards one layer at a time (3.1 GB CPU RAM peak)
|
| 313 |
+
- Quantizes each layer to Goliath FP4 on GPU as it loads
|
| 314 |
+
- Packs all 128 MoE experts into contiguous buffers per layer
|
| 315 |
+
- Total: 61 GB BF16 -> 20 GB FP4 in 110 seconds
|
| 316 |
+
|
| 317 |
+
2. PREFILL (processing the input prompt)
|
| 318 |
+
- Standard attention + MoE forward pass
|
| 319 |
+
- Uses FlashAttention-style Triton kernel for long sequences
|
| 320 |
+
- Builds KV cache for all layers
|
| 321 |
+
|
| 322 |
+
3. DECODE (generating tokens one at a time)
|
| 323 |
+
- Each token goes through 48 transformer layers:
|
| 324 |
+
|
| 325 |
+
For each layer:
|
| 326 |
+
a) RMSNorm
|
| 327 |
+
b) Attention: Q/K/V projection (BF16 matmul) -> RoPE -> FlashDecode
|
| 328 |
+
(custom Triton kernel, M=1, online softmax, reads only valid KV)
|
| 329 |
+
c) RMSNorm
|
| 330 |
+
d) MoE Router: softmax over 128 experts -> top-8 selection
|
| 331 |
+
e) Expert FFN: Goliath FP4 packed matmul (gate_up + down)
|
| 332 |
+
- Hot experts: FP4 (highest quality)
|
| 333 |
+
- Cold experts: FE-XC 2-bit codebook (5.3x faster kernel)
|
| 334 |
+
- Coldest experts: INT2 2-bit scalar
|
| 335 |
+
f) Residual connection
|
| 336 |
+
|
| 337 |
+
- With CUDA Graph: entire 48-layer forward captured as one graph
|
| 338 |
+
launch -> ~15.8ms per token
|
| 339 |
+
|
| 340 |
+
4. SPECULATIVE DECODE (EAGLE-3, when draft head is trained)
|
| 341 |
+
- Draft head predicts next K tokens (K=5 default)
|
| 342 |
+
- Target model verifies all K+1 tokens in one forward pass
|
| 343 |
+
- Accepts matching tokens, rejects and rolls back on mismatch
|
| 344 |
+
- Expected: 3-5x speedup with 70%+ acceptance rate
|
| 345 |
+
|
| 346 |
+
Memory layout during decode:
|
| 347 |
+
┌──────────────────────────────────────────────────┐
|
| 348 |
+
│ GPU VRAM (32 GB total) │
|
| 349 |
+
├──────────────────────────────────────────────────┤
|
| 350 |
+
│ Model weights (FP4 quantized) 19.6 GB │
|
| 351 |
+
│ KV cache (flat, FP8) 0.2 GB │
|
| 352 |
+
│ Hebbian memory 0.05 GB │
|
| 353 |
+
│ CUDA Graph buffers 0.1 GB │
|
| 354 |
+
│ Activations + workspace 1.0 GB │
|
| 355 |
+
│ ───────────────────────────────────────────── │
|
| 356 |
+
│ Total ~21.0 GB │
|
| 357 |
+
│ Free ~11.0 GB │
|
| 358 |
+
└──────────────────────────────────────────────────┘
|
| 359 |
+
|
| 360 |
+
================================================================================
|
| 361 |
+
FILE STRUCTURE
|
| 362 |
+
================================================================================
|
| 363 |
+
|
| 364 |
+
FireEcho Engine/
|
| 365 |
+
├── fireecho_kernel.py Main engine (9000+ lines)
|
| 366 |
+
│ - FireEchoEngine: load, generate, speculate
|
| 367 |
+
│ - FireEchoConfig: model configuration
|
| 368 |
+
│ - MoEFFN: mixture-of-experts with packed dispatch
|
| 369 |
+
│ - HebbianMemory: biologically-inspired fast weights
|
| 370 |
+
│ - FireEchoEagleHead: EAGLE-3 draft head
|
| 371 |
+
│ - FlashDecode Triton kernel
|
| 372 |
+
│ - CUDA Graph capture/replay
|
| 373 |
+
│
|
| 374 |
+
├── goliath_kernel.py Quantized GEMM kernels (3000+ lines)
|
| 375 |
+
│ - GoliathFP4Weights: FP4 fused dequant-matmul
|
| 376 |
+
│ - GoliathFP8Weights: FP8 fused dequant-matmul
|
| 377 |
+
│ - GoliathINT2Weights: INT2 scalar quantization
|
| 378 |
+
│ - GoliathFEXCWeights: FE-XC codebook 2-bit
|
| 379 |
+
│ - GoliathFEXVQWeights: Hessian-weighted codebook
|
| 380 |
+
│ - Packed MoE kernels (FP4, INT2, FE-XC)
|
| 381 |
+
│ - Fused SwiGLU+Down kernel
|
| 382 |
+
│ - GoliathQuantumLinear (training)
|
| 383 |
+
│
|
| 384 |
+
├── triton_hebbian.py Fused Triton kernels for Hebbian memory
|
| 385 |
+
│ - fused_competition, fused_soft_hebbian
|
| 386 |
+
│ - fused_traces_update, fused_gate_output
|
| 387 |
+
│
|
| 388 |
+
├── femx_storage.py FE-MX block floating point storage
|
| 389 |
+
│ - FEMX2, FEMX4, FEMX6, FEMX8 tiers
|
| 390 |
+
│ - Stochastic rounding, age-adaptive precision
|
| 391 |
+
│
|
| 392 |
+
├── persistent_memory.py AGI-like persistent memory
|
| 393 |
+
│ - EpisodicLog: raw experience buffer
|
| 394 |
+
│ - SemanticJournal: compressed knowledge
|
| 395 |
+
│ - ReflectionEngine: self-evaluation
|
| 396 |
+
│
|
| 397 |
+
├── benchmark_fullstack.py Full-stack benchmark (L0-L6)
|
| 398 |
+
├── benchmark_eagle.py EAGLE-3 speculative decode benchmark
|
| 399 |
+
├── train_eagle_head.py EAGLE-3 draft head training script
|
| 400 |
+
└── calibrate_fexc.py FE-XC codebook calibration
|
| 401 |
+
|
| 402 |
+
================================================================================
|
| 403 |
+
THE GOLIATH KERNEL (What Makes It Fast)
|
| 404 |
+
================================================================================
|
| 405 |
+
|
| 406 |
+
Standard quantized inference:
|
| 407 |
+
1. Load FP4 weights from VRAM
|
| 408 |
+
2. Dequantize to BF16 in global memory (writes 61 GB!)
|
| 409 |
+
3. Run matmul on the BF16 weights
|
| 410 |
+
Problem: Step 2 doubles memory traffic and VRAM usage
|
| 411 |
+
|
| 412 |
+
Goliath approach:
|
| 413 |
+
1. Load FP4 weights directly into Triton registers
|
| 414 |
+
2. Dequantize INSIDE the matmul tile loop (in registers, zero global write)
|
| 415 |
+
3. Accumulate in FP32
|
| 416 |
+
Problem: None. This is strictly better.
|
| 417 |
+
|
| 418 |
+
Code path (simplified):
|
| 419 |
+
for k_block in range(0, K, BLOCK_K):
|
| 420 |
+
# Load FP4 packed bytes (2 values per byte)
|
| 421 |
+
w_packed = tl.load(weight_ptr + offsets)
|
| 422 |
+
|
| 423 |
+
# Dequantize in-register
|
| 424 |
+
w_lo = (w_packed & 0xF).to(tl.float32) * scale # low nibble
|
| 425 |
+
w_hi = (w_packed >> 4).to(tl.float32) * scale # high nibble
|
| 426 |
+
|
| 427 |
+
# Matmul tile (tensor core)
|
| 428 |
+
acc += tl.dot(a_tile, w_tile)
|
| 429 |
+
|
| 430 |
+
Result: 4x less memory traffic, same numerical quality.
|
| 431 |
+
|
| 432 |
+
Packed MoE:
|
| 433 |
+
Standard approach: Loop over 8 active experts, one matmul each = 16 kernel
|
| 434 |
+
launches per layer (gate_up + down per expert).
|
| 435 |
+
|
| 436 |
+
Goliath Packed MoE: All 128 experts packed into one [128, K//2, N] buffer.
|
| 437 |
+
Single kernel launch reads expert_id from GPU tensor, indexes into buffer.
|
| 438 |
+
Result: 2 kernel launches per layer (gate_up + down), expert selection
|
| 439 |
+
stays entirely on GPU.
|
| 440 |
+
|
| 441 |
+
================================================================================
|
| 442 |
+
HEBBIAN MEMORY (What Makes It Smart)
|
| 443 |
+
================================================================================
|
| 444 |
+
|
| 445 |
+
Standard LLMs: Frozen weights after training. Context window is the only memory.
|
| 446 |
+
|
| 447 |
+
FireEcho Hebbian Memory:
|
| 448 |
+
- Fast weights that update DURING inference (no backpropagation)
|
| 449 |
+
- Inspired by biological synaptic plasticity (Hebb's rule: "neurons that
|
| 450 |
+
fire together wire together")
|
| 451 |
+
- Stores patterns from the current conversation
|
| 452 |
+
- Retrieves relevant patterns to augment generation
|
| 453 |
+
|
| 454 |
+
How it works:
|
| 455 |
+
1. Input token embedding is projected to query/key/value
|
| 456 |
+
2. Query matches against stored memory slots (competitive retrieval)
|
| 457 |
+
3. Top-K most relevant memories are retrieved
|
| 458 |
+
4. Retrieved context is mixed with transformer hidden state
|
| 459 |
+
5. Memory slots are updated via Hebbian learning rule
|
| 460 |
+
|
| 461 |
+
Updates use:
|
| 462 |
+
- Soft competitive learning (winner-take-most)
|
| 463 |
+
- Three-factor STDP (spike-timing dependent plasticity)
|
| 464 |
+
- Intrinsic plasticity (per-slot gain adaptation)
|
| 465 |
+
- PMI correction (pointwise mutual information bias)
|
| 466 |
+
- GHA decorrelation (prevent redundant memories)
|
| 467 |
+
- Kappa switching (amplified encoding for novel patterns)
|
| 468 |
+
|
| 469 |
+
Enable:
|
| 470 |
+
engine.enable_hebbian()
|
| 471 |
+
|
| 472 |
+
The memory persists within a session and can be saved/loaded:
|
| 473 |
+
engine.save_persistent_memory("memory.pt")
|
| 474 |
+
engine.load_persistent_memory("memory.pt")
|
| 475 |
+
|
| 476 |
+
================================================================================
|
| 477 |
+
COMPRESSION STACK (Why 30B Fits in 20 GB)
|
| 478 |
+
================================================================================
|
| 479 |
+
|
| 480 |
+
Level Format Bits Compression Quality Used For
|
| 481 |
+
────── ───────── ──── ─────────── ──────────── ────────────────
|
| 482 |
+
Base BF16 16 1x Perfect Attention Q/K/V/O
|
| 483 |
+
Hot Goliath 4 4x Near-perfect Active MoE experts
|
| 484 |
+
FP4
|
| 485 |
+
Cold FE-XC 2 8x Very good Rarely-used experts
|
| 486 |
+
(codebook)
|
| 487 |
+
Coldest INT2 2 8x Acceptable Least-used experts
|
| 488 |
+
(scalar)
|
| 489 |
+
|
| 490 |
+
Combined with MoE sparsity (8/128 active = 6.25%):
|
| 491 |
+
Effective model size per token:
|
| 492 |
+
Attention: 8 × (4 projections × 2048 × 128 × 2 bytes) = 16 MB
|
| 493 |
+
MoE: 8 experts × 3 projections × 768 × 2048 × 0.5 bytes = 18.9 MB
|
| 494 |
+
Other: embeddings, norms, router = ~13 MB
|
| 495 |
+
Total per token: ~48 MB
|
| 496 |
+
|
| 497 |
+
RTX 5090 bandwidth: 1.79 TB/s
|
| 498 |
+
Theoretical max: 1,790,000 / 48 = 37,291 tok/s (compute-bound limit)
|
| 499 |
+
Practical (30% utilization): ~45 tok/s (memory-bound, current result)
|
| 500 |
+
|
| 501 |
+
With FE-XC/INT2 cold experts replacing 80%+ of inactive expert weights:
|
| 502 |
+
MoE bandwidth: 18.9 MB * 0.5 (half are 2-bit) = ~10 MB
|
| 503 |
+
Total per token: ~39 MB
|
| 504 |
+
At 30% utilization: ~55 tok/s
|
| 505 |
+
|
| 506 |
+
With EAGLE-3 (70% acceptance, K=5 draft):
|
| 507 |
+
Effective throughput: 55 * 3.5 (average accepted tokens per verify) = ~193 tok/s
|
| 508 |
+
|
| 509 |
+
================================================================================
|
| 510 |
+
EAGLE-3 SPECULATIVE DECODING
|
| 511 |
+
================================================================================
|
| 512 |
+
|
| 513 |
+
EAGLE-3 is a draft-then-verify acceleration technique:
|
| 514 |
+
|
| 515 |
+
Normal decode: 1 token per forward pass through 48 MoE layers
|
| 516 |
+
EAGLE-3: Draft head predicts 5 tokens cheaply, target model verifies all 6
|
| 517 |
+
in one forward pass. If 4/5 match -> 5 tokens for the cost of ~2.
|
| 518 |
+
|
| 519 |
+
Architecture of draft head:
|
| 520 |
+
- Takes hidden states from layers 8, 24, 47 + token embedding
|
| 521 |
+
- Compresses via FC layer (8192 -> 2048)
|
| 522 |
+
- Passes through D transformer layers (D=2 to D=50)
|
| 523 |
+
- Shares lm_head with target model
|
| 524 |
+
- Total params: 115M (D=2) to 2.12B (D=50)
|
| 525 |
+
|
| 526 |
+
Training:
|
| 527 |
+
python train_eagle_head.py \
|
| 528 |
+
--offline \ # Use precomputed hidden states
|
| 529 |
+
--num_head_layers 50 \ # D=50 layers
|
| 530 |
+
--draft_depth 5 \ # K=5 draft steps
|
| 531 |
+
--lr 5e-4 \ # Learning rate
|
| 532 |
+
--epochs 5 \ # Training epochs
|
| 533 |
+
--loss_type ce \ # Cross-entropy loss
|
| 534 |
+
--batch_positions \ # Batched M=64 (10x faster)
|
| 535 |
+
--use_quantum_linear \ # Goliath FP8 forward + Quantum Gold backward
|
| 536 |
+
--compile # torch.compile the head
|
| 537 |
+
|
| 538 |
+
Usage after training:
|
| 539 |
+
engine.enable_eagle("eagle_checkpoints/eagle_best.pt")
|
| 540 |
+
output = engine.speculative_generate(input_ids, max_new_tokens=500)
|
| 541 |
+
|
| 542 |
+
================================================================================
|
| 543 |
+
SPEED OPTIMIZATION HISTORY
|
| 544 |
+
================================================================================
|
| 545 |
+
|
| 546 |
+
Step Optimization tok/s Speedup
|
| 547 |
+
─���── ──────────────────────────────────────── ────── ───────
|
| 548 |
+
0 Baseline (128-expert Python loop) 0.4 1x
|
| 549 |
+
1 Grouped dispatch + TF32 + Triton autotune 7.7 19x
|
| 550 |
+
2 Fused gate_up_proj (2->1 matmul/expert) 9.5 24x
|
| 551 |
+
3 Single-token decode fast path 12.6 32x
|
| 552 |
+
4 Multi-expert Goliath kernel (2 launches) 18.8 47x
|
| 553 |
+
5 Packed MoE (contiguous buffer, GPU IDs) 30.8 77x
|
| 554 |
+
6 Flat decode KV cache (zero torch.cat) 40.9 102x
|
| 555 |
+
7 CUDA Graph + FlashDecode 49.4 124x
|
| 556 |
+
|
| 557 |
+
Where the time goes at 45 tok/s (22ms per token):
|
| 558 |
+
Attention (FlashDecode): 0.28ms/layer x 48 = 13.4ms (61%)
|
| 559 |
+
MoE (Goliath FP4): 0.17ms/layer x 48 = 8.2ms (37%)
|
| 560 |
+
Other (norms, router): 0.4ms (2%)
|
| 561 |
+
|
| 562 |
+
================================================================================
|
| 563 |
+
KNOWN LIMITATIONS & GOTCHAS
|
| 564 |
+
================================================================================
|
| 565 |
+
|
| 566 |
+
- Single-GPU only (by design — multi-GPU adds complexity for marginal gain)
|
| 567 |
+
- Minimum 24 GB VRAM (model alone is 20 GB)
|
| 568 |
+
- FP4 quantization has ~0.05-0.15 relative error vs BF16 (negligible in practice)
|
| 569 |
+
- First 10+ forward passes are slow (Triton kernel compilation/autotuning)
|
| 570 |
+
- CUDA Graph capture requires fixed tensor shapes (only decode, not prefill)
|
| 571 |
+
- Hebbian memory adds ~50 MB VRAM and slight latency
|
| 572 |
+
- FE-XC codebook learning takes 1-2 minutes on first enable
|
| 573 |
+
- No pip package yet (source install only)
|
| 574 |
+
- Tested primarily on RTX 5090 — other GPUs may need Triton autotune re-run
|
| 575 |
+
- MoDES expert skipping can hurt quality if threshold is too aggressive
|
| 576 |
+
|
| 577 |
+
================================================================================
|
| 578 |
+
TROUBLESHOOTING
|
| 579 |
+
================================================================================
|
| 580 |
+
|
| 581 |
+
Problem: "CUDA out of memory"
|
| 582 |
+
Fix: Check nvidia-smi for other processes using VRAM. Kill them.
|
| 583 |
+
Or reduce max_kv_blocks in config (default 256 = 4K tokens = 3.1 GB).
|
| 584 |
+
|
| 585 |
+
Problem: Very slow first few generations
|
| 586 |
+
Fix: Normal — Triton is compiling and autotuning kernels. Wait ~10 forward
|
| 587 |
+
passes for warmup. Subsequent runs use cached kernels.
|
| 588 |
+
|
| 589 |
+
Problem: "No module named 'triton'"
|
| 590 |
+
Fix: pip install triton (requires CUDA toolkit installed)
|
| 591 |
+
|
| 592 |
+
Problem: "RuntimeError: Triton compilation failed"
|
| 593 |
+
Fix: Check CUDA version matches PyTorch: python -c "import torch; print(torch.version.cuda)"
|
| 594 |
+
Triton 3.0+ needs CUDA 12.0+.
|
| 595 |
+
|
| 596 |
+
Problem: NaN in output
|
| 597 |
+
Fix: Check if using prefill with >20 tokens (packed MoE kernel needs 3D grid).
|
| 598 |
+
This was a fixed bug — update to latest code.
|
| 599 |
+
|
| 600 |
+
Problem: CUDA Graph capture crashes
|
| 601 |
+
Fix: Atlas .item() calls conflict with graph capture. The engine auto-skips
|
| 602 |
+
these during capture (fixed). Update to latest code.
|
| 603 |
+
|
| 604 |
+
================================================================================
|
| 605 |
+
RESEARCH PAPERS & REFERENCES
|
| 606 |
+
================================================================================
|
| 607 |
+
|
| 608 |
+
FireEcho builds on ideas from:
|
| 609 |
+
|
| 610 |
+
Quantization:
|
| 611 |
+
- AQLM (arxiv 2401.06118): Additive quantization for LLMs -> FE-XC codebook
|
| 612 |
+
- VPTQ (Hessian-weighted): Second-order optimal codebooks -> FE-XVQ
|
| 613 |
+
- FP4 Training (arxiv 2501.17116): Gradient flow through FP4
|
| 614 |
+
|
| 615 |
+
Speculative Decoding:
|
| 616 |
+
- EAGLE-3 (Li et al.): Draft-then-verify with shared lm_head
|
| 617 |
+
- Scylla (arxiv 2505.07858): Tree-based multi-candidate verification -> FE-XT
|
| 618 |
+
- Medusa: Multi-head parallel drafting
|
| 619 |
+
|
| 620 |
+
MoE Optimization:
|
| 621 |
+
- SP-MoE (arxiv 2510.10302): Async expert prefetch -> FE-H Hayabusa
|
| 622 |
+
- MoE-Inference-Bench: Expert sizing analysis
|
| 623 |
+
|
| 624 |
+
Hebbian/Neuroscience:
|
| 625 |
+
- Lansner BCPNN: Bayesian confidence propagation neural networks
|
| 626 |
+
- Triesch 2005: Intrinsic plasticity
|
| 627 |
+
- Sanger's GHA: Generalized Hebbian algorithm
|
| 628 |
+
- McClelland et al. 1995: Complementary learning systems
|
| 629 |
+
|
| 630 |
+
Tensor Decomposition:
|
| 631 |
+
- MPS/TT decomposition: Quantum-inspired weight compression
|
| 632 |
+
|
| 633 |
+
================================================================================
|
| 634 |
+
WHERE TO GET HELP
|
| 635 |
+
================================================================================
|
| 636 |
+
|
| 637 |
+
GitHub Issues: https://github.com/Joysulem/FireEcho/issues
|
| 638 |
+
Include: GPU model, CUDA version, PyTorch version, full error traceback
|
| 639 |
+
|
| 640 |
+
X / Twitter: @Joysulem
|
| 641 |
+
Tag me with questions, benchmarks, or usage reports
|
| 642 |
+
|
| 643 |
+
Email: (floresluise1988@gmail.com)
|
| 644 |
+
|
| 645 |
+
================================================================================
|
| 646 |
+
LICENSE
|
| 647 |
+
================================================================================
|
| 648 |
+
|
| 649 |
+
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
|
| 650 |
+
|
| 651 |
+
You are free to:
|
| 652 |
+
- Share: copy and redistribute the material in any medium or format
|
| 653 |
+
- Adapt: remix, transform, and build upon the material
|
| 654 |
+
|
| 655 |
+
Under the following terms:
|
| 656 |
+
- Attribution: You must give appropriate credit to Luis E. Davila Flores,
|
| 657 |
+
provide a link to the license, and indicate if changes were made.
|
| 658 |
+
- NonCommercial: You may not use the material for commercial purposes.
|
| 659 |
+
|
| 660 |
+
Full license: https://creativecommons.org/licenses/by-nc/4.0/
|
| 661 |
+
|
| 662 |
+
For commercial licensing inquiries, contact: @Joysulem on X/Twitter
|
| 663 |
+
|
| 664 |
+
================================================================================
|
| 665 |
+
FireEcho Engine — Created by Luis E. Davila Flores
|
| 666 |
+
"One GPU. One file. One import. Full pipeline."
|
| 667 |
+
================================================================================
|
FireEcho Engine/__pycache__/cutlass_kernels.cpython-312.pyc
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a4baeab19c5823d68cfa41ebbb0754cf7aeedc546d25247acdfcef8b75c5c383
|
| 3 |
+
size 104083
|
FireEcho Engine/__pycache__/dsmem_ops.cpython-312.pyc
ADDED
|
Binary file (26.1 kB). View file
|
|
|
FireEcho Engine/__pycache__/femx_storage.cpython-312.pyc
ADDED
|
Binary file (21.7 kB). View file
|
|
|
FireEcho Engine/__pycache__/fireecho_kernel.cpython-312.pyc
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:34de898847f5fd027b2726515d35b46da6c694ca651a99f827992062af8b4b7f
|
| 3 |
+
size 703662
|
FireEcho Engine/__pycache__/goliath_kernel.cpython-312.pyc
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:996c50c408ca615417071331d98d070fa0557d35ef1f63fff51792ba27ae84fb
|
| 3 |
+
size 126662
|
FireEcho Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:68631075853c27682a27cc8d2d202408148f220de706cede76ddd77cf371ff84
|
| 3 |
+
size 148146
|
FireEcho Engine/__pycache__/triton_hebbian.cpython-312.pyc
ADDED
|
Binary file (33.9 kB). View file
|
|
|
FireEcho Engine/bench_fusion.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
FireEcho Fusion Benchmark — Goliath vs Legacy FFN
|
| 4 |
+
===================================================
|
| 5 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 6 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 7 |
+
|
| 8 |
+
Quick benchmark: Goliath fusion vs legacy in FusedFFN.
|
| 9 |
+
"""
|
| 10 |
+
import torch, time
|
| 11 |
+
from fireecho_kernel import FusedFFN, _GOLIATH_AVAILABLE, _CUTLASS_AVAILABLE
|
| 12 |
+
|
| 13 |
+
print("GPU:", torch.cuda.get_device_name(0))
|
| 14 |
+
print("Goliath:", _GOLIATH_AVAILABLE, " CUTLASS:", _CUTLASS_AVAILABLE)
|
| 15 |
+
print()
|
| 16 |
+
|
| 17 |
+
dim, ffn_dim, B, S = 4096, 11008, 4, 64
|
| 18 |
+
x = torch.randn(B, S, dim, device="cuda", dtype=torch.bfloat16)
|
| 19 |
+
warmup, iters = 10, 50
|
| 20 |
+
total_flops = 3 * 2 * B * S * dim * ffn_dim
|
| 21 |
+
|
| 22 |
+
for name, bits, goliath in [
|
| 23 |
+
("Goliath FP4", 4, True),
|
| 24 |
+
("Goliath FP8", 8, True),
|
| 25 |
+
("Legacy quant", 4, False),
|
| 26 |
+
("BF16 no-quant", 4, False),
|
| 27 |
+
]:
|
| 28 |
+
use_q = name != "BF16 no-quant"
|
| 29 |
+
ffn = FusedFFN(dim, ffn_dim, use_nvfp4=use_q, goliath_bits=bits, use_goliath=goliath).cuda().eval()
|
| 30 |
+
with torch.no_grad():
|
| 31 |
+
for _ in range(warmup):
|
| 32 |
+
ffn(x)
|
| 33 |
+
torch.cuda.synchronize()
|
| 34 |
+
t0 = time.perf_counter()
|
| 35 |
+
for _ in range(iters):
|
| 36 |
+
ffn(x)
|
| 37 |
+
torch.cuda.synchronize()
|
| 38 |
+
t = (time.perf_counter() - t0) / iters
|
| 39 |
+
print(f" {name:16s}: {t*1000:.2f}ms ({total_flops/t/1e12:.1f} TFLOPS)")
|
FireEcho Engine/benchmark_eagle.py
ADDED
|
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
# =============================================================================
|
| 3 |
+
# Copyright (c) 2024-2026 Luis E. Davila Flores. All rights reserved.
|
| 4 |
+
#
|
| 5 |
+
# FireEcho Engine — High-Performance Inference Kernel
|
| 6 |
+
# Creator & Sole Author: Luis E. Davila Flores
|
| 7 |
+
#
|
| 8 |
+
# Licensed under Creative Commons Attribution-NonCommercial 4.0 International
|
| 9 |
+
# (CC BY-NC 4.0). You may share and adapt this work for non-commercial
|
| 10 |
+
# purposes with proper attribution. Full license terms:
|
| 11 |
+
# https://creativecommons.org/licenses/by-nc/4.0/
|
| 12 |
+
# =============================================================================
|
| 13 |
+
"""
|
| 14 |
+
FireEcho EAGLE-3 Benchmark — Speculative vs Standard Decode
|
| 15 |
+
=============================================================
|
| 16 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 17 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 18 |
+
|
| 19 |
+
Benchmark EAGLE-3 speculative decoding vs standard decode.
|
| 20 |
+
|
| 21 |
+
Compares:
|
| 22 |
+
1. Standard generate() (baseline tok/s)
|
| 23 |
+
2. Speculative generate() with trained EAGLE head
|
| 24 |
+
3. Reports acceptance rate, speedup, tok/s
|
| 25 |
+
|
| 26 |
+
Usage:
|
| 27 |
+
PYTHONUNBUFFERED=1 python benchmark_eagle.py [--checkpoint eagle_best.pt]
|
| 28 |
+
"""
|
| 29 |
+
|
| 30 |
+
import sys, os, time, argparse, torch
|
| 31 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 32 |
+
|
| 33 |
+
from hebbian_finetune_demo import load_engine
|
| 34 |
+
|
| 35 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 36 |
+
EAGLE_DIR = os.path.join(os.path.dirname(__file__), "eagle_checkpoints")
|
| 37 |
+
|
| 38 |
+
TEST_PROMPTS = [
|
| 39 |
+
"Explain the theory of general relativity in simple terms.",
|
| 40 |
+
"Write a Python function to find the longest palindromic substring.",
|
| 41 |
+
"What are the main differences between TCP and UDP protocols?",
|
| 42 |
+
"Describe the process of photosynthesis step by step.",
|
| 43 |
+
"What caused the fall of the Roman Empire?",
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def load_benchmark_engine():
|
| 48 |
+
"""Load Qwen3-Omni with Goliath FP4 quantization via load_engine()."""
|
| 49 |
+
print("=" * 60)
|
| 50 |
+
print("Loading Qwen3-Omni engine...")
|
| 51 |
+
print("=" * 60)
|
| 52 |
+
|
| 53 |
+
engine, tokenizer, config = load_engine(
|
| 54 |
+
MODEL_PATH, max_seq_len=4096, device="cuda",
|
| 55 |
+
)
|
| 56 |
+
engine.pack_all_experts()
|
| 57 |
+
engine.kv_cache.enable_flat_decode()
|
| 58 |
+
engine.eval()
|
| 59 |
+
|
| 60 |
+
return engine, tokenizer
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def benchmark_standard(engine, tokenizer, prompts, max_tokens=100, warmup=2):
|
| 64 |
+
"""Benchmark standard generate()."""
|
| 65 |
+
print("\n" + "=" * 60)
|
| 66 |
+
print("Benchmark: Standard generate()")
|
| 67 |
+
print("=" * 60)
|
| 68 |
+
|
| 69 |
+
# Warmup
|
| 70 |
+
for i in range(warmup):
|
| 71 |
+
ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
|
| 72 |
+
engine.generate(ids, max_new_tokens=20, temperature=0.0, top_k=0, top_p=1.0)
|
| 73 |
+
print(f" Warmup {i+1}/{warmup}")
|
| 74 |
+
|
| 75 |
+
results = []
|
| 76 |
+
for prompt in prompts:
|
| 77 |
+
input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 78 |
+
prompt_len = input_ids.shape[1]
|
| 79 |
+
|
| 80 |
+
torch.cuda.synchronize()
|
| 81 |
+
t0 = time.perf_counter()
|
| 82 |
+
|
| 83 |
+
output = engine.generate(
|
| 84 |
+
input_ids, max_new_tokens=max_tokens, temperature=0.0,
|
| 85 |
+
top_k=0, top_p=1.0) # Pure greedy for fair comparison
|
| 86 |
+
|
| 87 |
+
torch.cuda.synchronize()
|
| 88 |
+
elapsed = time.perf_counter() - t0
|
| 89 |
+
|
| 90 |
+
gen_len = output.shape[1] - prompt_len
|
| 91 |
+
tok_s = gen_len / elapsed
|
| 92 |
+
|
| 93 |
+
text = tokenizer.decode(output[0, prompt_len:], skip_special_tokens=True)
|
| 94 |
+
results.append({
|
| 95 |
+
'prompt': prompt[:50],
|
| 96 |
+
'gen_len': gen_len,
|
| 97 |
+
'elapsed': elapsed,
|
| 98 |
+
'tok_s': tok_s,
|
| 99 |
+
})
|
| 100 |
+
print(f" [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
|
| 101 |
+
|
| 102 |
+
avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
|
| 103 |
+
avg_gen = sum(r['gen_len'] for r in results) / len(results)
|
| 104 |
+
print(f"\n Standard avg: {avg_tok_s:.1f} tok/s, {avg_gen:.0f} tokens/prompt")
|
| 105 |
+
return avg_tok_s, results
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def benchmark_speculative(engine, tokenizer, prompts, checkpoint_path,
|
| 109 |
+
max_tokens=100, warmup=2, draft_depth=5,
|
| 110 |
+
num_head_layers=2):
|
| 111 |
+
"""Benchmark speculative generate() with EAGLE head."""
|
| 112 |
+
print("\n" + "=" * 60)
|
| 113 |
+
print(f"Benchmark: Speculative generate() (depth={draft_depth}, D={num_head_layers})")
|
| 114 |
+
print(f" Checkpoint: {os.path.basename(checkpoint_path)}")
|
| 115 |
+
print("=" * 60)
|
| 116 |
+
|
| 117 |
+
# Enable EAGLE
|
| 118 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), draft_depth=draft_depth,
|
| 119 |
+
num_head_layers=num_head_layers)
|
| 120 |
+
|
| 121 |
+
# Load checkpoint to CPU first (avoid OOM from double-loading to GPU)
|
| 122 |
+
ckpt = torch.load(checkpoint_path, weights_only=False, map_location='cpu')
|
| 123 |
+
engine.eagle_head.load_state_dict(ckpt['eagle_head'], strict=False)
|
| 124 |
+
step = ckpt.get('step', '?')
|
| 125 |
+
loss = ckpt.get('loss', '?')
|
| 126 |
+
del ckpt # Free CPU copy immediately
|
| 127 |
+
print(f" Loaded step {step}, loss={loss}")
|
| 128 |
+
|
| 129 |
+
# Warmup (also warms Triton kernels)
|
| 130 |
+
for i in range(warmup):
|
| 131 |
+
ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
|
| 132 |
+
engine.speculative_generate(ids, max_new_tokens=20, temperature=0.0)
|
| 133 |
+
print(f" Warmup {i+1}/{warmup}")
|
| 134 |
+
|
| 135 |
+
results = []
|
| 136 |
+
total_drafted = 0
|
| 137 |
+
total_accepted = 0
|
| 138 |
+
|
| 139 |
+
for prompt in prompts:
|
| 140 |
+
input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 141 |
+
prompt_len = input_ids.shape[1]
|
| 142 |
+
|
| 143 |
+
torch.cuda.synchronize()
|
| 144 |
+
t0 = time.perf_counter()
|
| 145 |
+
|
| 146 |
+
output = engine.speculative_generate(
|
| 147 |
+
input_ids, max_new_tokens=max_tokens, temperature=0.0,
|
| 148 |
+
draft_depth=draft_depth)
|
| 149 |
+
|
| 150 |
+
torch.cuda.synchronize()
|
| 151 |
+
elapsed = time.perf_counter() - t0
|
| 152 |
+
|
| 153 |
+
gen_len = output.shape[1] - prompt_len
|
| 154 |
+
tok_s = gen_len / elapsed
|
| 155 |
+
|
| 156 |
+
results.append({
|
| 157 |
+
'prompt': prompt[:50],
|
| 158 |
+
'gen_len': gen_len,
|
| 159 |
+
'elapsed': elapsed,
|
| 160 |
+
'tok_s': tok_s,
|
| 161 |
+
})
|
| 162 |
+
print(f" [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
|
| 163 |
+
|
| 164 |
+
avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
|
| 165 |
+
avg_gen = sum(r['gen_len'] for r in results) / len(results)
|
| 166 |
+
print(f"\n Speculative avg: {avg_tok_s:.1f} tok/s, {avg_gen:.0f} tokens/prompt")
|
| 167 |
+
return avg_tok_s, results
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
def main():
|
| 171 |
+
parser = argparse.ArgumentParser()
|
| 172 |
+
parser.add_argument('--checkpoint', default='eagle_best.pt',
|
| 173 |
+
help='EAGLE checkpoint filename')
|
| 174 |
+
parser.add_argument('--max-tokens', type=int, default=100)
|
| 175 |
+
parser.add_argument('--warmup', type=int, default=3)
|
| 176 |
+
parser.add_argument('--depth', type=int, default=5)
|
| 177 |
+
parser.add_argument('--num_head_layers', type=int, default=2,
|
| 178 |
+
help='Number of layers in eagle head (D)')
|
| 179 |
+
args = parser.parse_args()
|
| 180 |
+
|
| 181 |
+
checkpoint_path = os.path.join(EAGLE_DIR, args.checkpoint)
|
| 182 |
+
if not os.path.exists(checkpoint_path):
|
| 183 |
+
print(f"ERROR: Checkpoint not found: {checkpoint_path}")
|
| 184 |
+
sys.exit(1)
|
| 185 |
+
|
| 186 |
+
# Load engine + tokenizer
|
| 187 |
+
engine, tokenizer = load_benchmark_engine()
|
| 188 |
+
|
| 189 |
+
# Benchmark standard
|
| 190 |
+
std_tok_s, std_results = benchmark_standard(
|
| 191 |
+
engine, tokenizer, TEST_PROMPTS,
|
| 192 |
+
max_tokens=args.max_tokens, warmup=args.warmup)
|
| 193 |
+
|
| 194 |
+
# Benchmark speculative
|
| 195 |
+
spec_tok_s, spec_results = benchmark_speculative(
|
| 196 |
+
engine, tokenizer, TEST_PROMPTS, checkpoint_path,
|
| 197 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 198 |
+
draft_depth=args.depth,
|
| 199 |
+
num_head_layers=args.num_head_layers)
|
| 200 |
+
|
| 201 |
+
# Also try depth=3 (less wasted compute with low acceptance)
|
| 202 |
+
spec3_tok_s, _ = benchmark_speculative(
|
| 203 |
+
engine, tokenizer, TEST_PROMPTS, checkpoint_path,
|
| 204 |
+
max_tokens=args.max_tokens, warmup=1,
|
| 205 |
+
draft_depth=3,
|
| 206 |
+
num_head_layers=args.num_head_layers)
|
| 207 |
+
|
| 208 |
+
# Read checkpoint step for summary
|
| 209 |
+
ckpt_info = torch.load(checkpoint_path, weights_only=False, map_location='cpu')
|
| 210 |
+
ckpt_step = ckpt_info.get('step', '?')
|
| 211 |
+
del ckpt_info
|
| 212 |
+
|
| 213 |
+
# Summary
|
| 214 |
+
print("\n" + "=" * 60)
|
| 215 |
+
print("SUMMARY")
|
| 216 |
+
print("=" * 60)
|
| 217 |
+
print(f" Standard generate(): {std_tok_s:6.1f} tok/s")
|
| 218 |
+
print(f" Speculative (depth=5): {spec_tok_s:6.1f} tok/s "
|
| 219 |
+
f"({spec_tok_s/std_tok_s:.2f}x)")
|
| 220 |
+
print(f" Speculative (depth=3): {spec3_tok_s:6.1f} tok/s "
|
| 221 |
+
f"({spec3_tok_s/std_tok_s:.2f}x)")
|
| 222 |
+
print(f" Checkpoint: {args.checkpoint} (step {ckpt_step})")
|
| 223 |
+
print("=" * 60)
|
| 224 |
+
|
| 225 |
+
# VRAM
|
| 226 |
+
vram_gb = torch.cuda.max_memory_allocated() / 1e9
|
| 227 |
+
print(f" Peak VRAM: {vram_gb:.2f} GB")
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
if __name__ == '__main__':
|
| 231 |
+
main()
|
FireEcho Engine/benchmark_fullstack.py
ADDED
|
@@ -0,0 +1,323 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
# =============================================================================
|
| 3 |
+
# Copyright (c) 2024-2026 Luis E. Davila Flores. All rights reserved.
|
| 4 |
+
#
|
| 5 |
+
# FireEcho Engine — High-Performance Inference Kernel
|
| 6 |
+
# Creator & Sole Author: Luis E. Davila Flores
|
| 7 |
+
#
|
| 8 |
+
# Licensed under Creative Commons Attribution-NonCommercial 4.0 International
|
| 9 |
+
# (CC BY-NC 4.0). You may share and adapt this work for non-commercial
|
| 10 |
+
# purposes with proper attribution. Full license terms:
|
| 11 |
+
# https://creativecommons.org/licenses/by-nc/4.0/
|
| 12 |
+
# =============================================================================
|
| 13 |
+
"""
|
| 14 |
+
FireEcho Full-Stack Benchmark — Path B: Every Optimization Stacked
|
| 15 |
+
===================================================================
|
| 16 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 17 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 18 |
+
|
| 19 |
+
Stacks ALL FireEcho architecture optimizations and benchmarks each layer:
|
| 20 |
+
|
| 21 |
+
Already in baseline:
|
| 22 |
+
- Goliath FP4 packed MoE (dequant-matmul Triton kernels)
|
| 23 |
+
- Fused SwiGLU+Down (1 kernel launch, not 3)
|
| 24 |
+
- FlashDecode attention (Triton online softmax)
|
| 25 |
+
- Flat KV cache (zero torch.cat, pre-allocated)
|
| 26 |
+
|
| 27 |
+
Layer 0: Baseline (all above) — current ~37 tok/s
|
| 28 |
+
Layer 1: + FP8 KV cache (half attention bandwidth)
|
| 29 |
+
Layer 2: + L2 prefetch (next layer pre-staged in L2 cache)
|
| 30 |
+
Layer 3: + Atlas Ban & Pick + MoDES (8→~5 experts + skip easy tokens)
|
| 31 |
+
Layer 4: + FE-XC cold expert demotion (5.3x faster 2-bit codebook kernel)
|
| 32 |
+
Layer 5: + CUDA Graph decode (zero Python overhead, single graph replay)
|
| 33 |
+
|
| 34 |
+
Target: 15.8ms → ~8ms base forward = 125+ tok/s (no speculation)
|
| 35 |
+
|
| 36 |
+
Usage:
|
| 37 |
+
PYTHONUNBUFFERED=1 python benchmark_fullstack.py
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
import sys, os, time, argparse, torch
|
| 41 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 42 |
+
|
| 43 |
+
from hebbian_finetune_demo import load_engine
|
| 44 |
+
|
| 45 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 46 |
+
|
| 47 |
+
TEST_PROMPTS = [
|
| 48 |
+
"Explain the theory of general relativity in simple terms.",
|
| 49 |
+
"Write a Python function to find the longest palindromic substring.",
|
| 50 |
+
"What are the main differences between TCP and UDP protocols?",
|
| 51 |
+
"Describe the process of photosynthesis step by step.",
|
| 52 |
+
"What caused the fall of the Roman Empire?",
|
| 53 |
+
"How does a compiler optimize code?",
|
| 54 |
+
"Explain how public key cryptography works.",
|
| 55 |
+
"What is the difference between a stack and a queue?",
|
| 56 |
+
]
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def benchmark_generate(engine, tokenizer, prompts, max_tokens=100, warmup=3,
|
| 60 |
+
label="Standard"):
|
| 61 |
+
"""Benchmark generate() with current engine config."""
|
| 62 |
+
print(f"\n{'=' * 60}")
|
| 63 |
+
print(f"Benchmark: {label}")
|
| 64 |
+
print(f"{'=' * 60}")
|
| 65 |
+
|
| 66 |
+
# Warmup (critical for Triton kernel compilation)
|
| 67 |
+
for i in range(warmup):
|
| 68 |
+
ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
|
| 69 |
+
engine.generate(ids, max_new_tokens=20, temperature=0.0, top_k=0, top_p=1.0)
|
| 70 |
+
print(f" Warmup {i+1}/{warmup}")
|
| 71 |
+
|
| 72 |
+
results = []
|
| 73 |
+
for prompt in prompts:
|
| 74 |
+
input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 75 |
+
prompt_len = input_ids.shape[1]
|
| 76 |
+
|
| 77 |
+
torch.cuda.synchronize()
|
| 78 |
+
t0 = time.perf_counter()
|
| 79 |
+
|
| 80 |
+
output = engine.generate(
|
| 81 |
+
input_ids, max_new_tokens=max_tokens, temperature=0.0,
|
| 82 |
+
top_k=0, top_p=1.0)
|
| 83 |
+
|
| 84 |
+
torch.cuda.synchronize()
|
| 85 |
+
elapsed = time.perf_counter() - t0
|
| 86 |
+
|
| 87 |
+
gen_len = output.shape[1] - prompt_len
|
| 88 |
+
tok_s = gen_len / elapsed
|
| 89 |
+
|
| 90 |
+
results.append({
|
| 91 |
+
'prompt': prompt[:50],
|
| 92 |
+
'gen_len': gen_len,
|
| 93 |
+
'elapsed': elapsed,
|
| 94 |
+
'tok_s': tok_s,
|
| 95 |
+
})
|
| 96 |
+
print(f" [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
|
| 97 |
+
|
| 98 |
+
avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
|
| 99 |
+
avg_gen = sum(r['gen_len'] for r in results) / len(results)
|
| 100 |
+
print(f"\n >> {label}: {avg_tok_s:.1f} tok/s avg, {avg_gen:.0f} tokens/prompt")
|
| 101 |
+
return avg_tok_s
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def main():
|
| 105 |
+
parser = argparse.ArgumentParser(description="FireEcho Full-Stack Benchmark")
|
| 106 |
+
parser.add_argument('--max-tokens', type=int, default=200)
|
| 107 |
+
parser.add_argument('--warmup', type=int, default=3)
|
| 108 |
+
parser.add_argument('--atlas-prompts', type=int, default=50,
|
| 109 |
+
help='Number of prompts for Atlas profiling')
|
| 110 |
+
parser.add_argument('--ban-ratio', type=float, default=0.25,
|
| 111 |
+
help='Atlas Ban & Pick: fraction of experts to ban')
|
| 112 |
+
parser.add_argument('--modes-threshold', type=float, default=2.0,
|
| 113 |
+
help='Atlas MoDES: multiplier on uniform baseline (2.0 = skip when max_prob < 2/128)')
|
| 114 |
+
parser.add_argument('--fexc-cold-pct', type=float, default=0.10,
|
| 115 |
+
help='FE-XC: fraction of experts to demote to 2-bit codebook')
|
| 116 |
+
parser.add_argument('--int2-cold-pct', type=float, default=0.05,
|
| 117 |
+
help='INT2: fraction of coldest experts to demote to 2-bit scalar')
|
| 118 |
+
args = parser.parse_args()
|
| 119 |
+
|
| 120 |
+
summary = {}
|
| 121 |
+
|
| 122 |
+
# =====================================================================
|
| 123 |
+
# Load engine — baseline config (Goliath FP4 + packed MoE + flat KV BF16)
|
| 124 |
+
# =====================================================================
|
| 125 |
+
print("=" * 60)
|
| 126 |
+
print("FireEcho Full-Stack Benchmark — Path B")
|
| 127 |
+
print("Stacking ALL optimizations, measuring each layer")
|
| 128 |
+
print("=" * 60)
|
| 129 |
+
print("\nLoading Qwen3-Omni engine...")
|
| 130 |
+
|
| 131 |
+
engine, tokenizer, config = load_engine(
|
| 132 |
+
MODEL_PATH, max_seq_len=4096, device="cuda",
|
| 133 |
+
)
|
| 134 |
+
engine.pack_all_experts()
|
| 135 |
+
engine.kv_cache.enable_flat_decode() # BF16 flat KV (baseline)
|
| 136 |
+
engine.eval()
|
| 137 |
+
|
| 138 |
+
# Suppress FE-MX tier updates during benchmarking (prints + overhead kill GPU perf)
|
| 139 |
+
# Set tier interval to effectively infinite so the modulo check never triggers
|
| 140 |
+
for layer in engine.layers:
|
| 141 |
+
if hasattr(layer, 'ffn'):
|
| 142 |
+
layer.ffn._quiet = True
|
| 143 |
+
layer.ffn.femx_tier_interval = 10_000_000 # Never trigger during benchmark
|
| 144 |
+
|
| 145 |
+
vram_base = torch.cuda.max_memory_allocated() / 1e9
|
| 146 |
+
print(f" Base VRAM: {vram_base:.2f} GB")
|
| 147 |
+
|
| 148 |
+
# =====================================================================
|
| 149 |
+
# Layer 0: Baseline
|
| 150 |
+
# =====================================================================
|
| 151 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 152 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 153 |
+
label="Layer 0: Baseline (FP4 + packed MoE + flat KV BF16)")
|
| 154 |
+
summary['L0_baseline'] = tok_s
|
| 155 |
+
|
| 156 |
+
# =====================================================================
|
| 157 |
+
# Layer 1: FP8 KV cache
|
| 158 |
+
# =====================================================================
|
| 159 |
+
print("\n>> Enabling FP8 KV cache...")
|
| 160 |
+
engine.kv_cache.enable_flat_decode(kv_dtype='fp8')
|
| 161 |
+
print(" [FP8 KV] Enabled — 50% attention bandwidth reduction")
|
| 162 |
+
|
| 163 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 164 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 165 |
+
label="Layer 1: + FP8 KV cache")
|
| 166 |
+
summary['L1_fp8_kv'] = tok_s
|
| 167 |
+
|
| 168 |
+
# =====================================================================
|
| 169 |
+
# Layer 2: L2 prefetch
|
| 170 |
+
# =====================================================================
|
| 171 |
+
print("\n>> Enabling L2 layer-ahead prefetch...")
|
| 172 |
+
engine.enable_l2_prefetch()
|
| 173 |
+
|
| 174 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 175 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 176 |
+
label="Layer 2: + L2 prefetch")
|
| 177 |
+
summary['L2_l2_prefetch'] = tok_s
|
| 178 |
+
|
| 179 |
+
# =====================================================================
|
| 180 |
+
# Layer 3: Atlas Ban & Pick (requires profiling first)
|
| 181 |
+
# =====================================================================
|
| 182 |
+
print("\n>> Enabling Atlas the Gatekeeper (Ban & Pick)...")
|
| 183 |
+
engine.enable_atlas(ban_threshold=0.01, modes_threshold=args.modes_threshold)
|
| 184 |
+
engine.atlas_profile(tokenizer, num_prompts=args.atlas_prompts)
|
| 185 |
+
engine.atlas_ban(ban_ratio=args.ban_ratio)
|
| 186 |
+
engine.atlas_stats()
|
| 187 |
+
|
| 188 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 189 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 190 |
+
label="Layer 3: + Atlas Ban & Pick (8→~5 experts)")
|
| 191 |
+
summary['L3_atlas_ban'] = tok_s
|
| 192 |
+
|
| 193 |
+
# =====================================================================
|
| 194 |
+
# Layer 4: FE-XC cold expert demotion
|
| 195 |
+
# =====================================================================
|
| 196 |
+
print("\n>> Enabling FE-XC cold expert demotion...")
|
| 197 |
+
engine.enable_auto_fexc_demotion(cold_threshold_pct=args.fexc_cold_pct)
|
| 198 |
+
|
| 199 |
+
# Build up expert usage statistics with enough tokens to establish cold/hot
|
| 200 |
+
# Need usage > femx_cold_threshold(50) for hot experts, so run ~1000 tokens
|
| 201 |
+
print(" Building expert usage statistics (8 prompts × 50 tokens)...")
|
| 202 |
+
for prompt in TEST_PROMPTS:
|
| 203 |
+
ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 204 |
+
with torch.no_grad():
|
| 205 |
+
engine.generate(ids, max_new_tokens=50, temperature=0.0,
|
| 206 |
+
top_k=0, top_p=1.0)
|
| 207 |
+
|
| 208 |
+
# Trigger tier updates + FE-XC demotion on each MoE layer
|
| 209 |
+
# This may take a few seconds as codebooks are learned per-layer
|
| 210 |
+
print(" Triggering FE-XC demotion (learning codebooks)...")
|
| 211 |
+
fexc_count = 0
|
| 212 |
+
for layer in engine.layers:
|
| 213 |
+
if hasattr(layer.ffn, 'update_expert_tiers'):
|
| 214 |
+
layer.ffn.update_expert_tiers()
|
| 215 |
+
if hasattr(layer.ffn, '_expert_is_fexc'):
|
| 216 |
+
fexc_count += layer.ffn._expert_is_fexc.sum().item()
|
| 217 |
+
print(f" [FE-XC] {fexc_count} total experts demoted across all layers")
|
| 218 |
+
|
| 219 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 220 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 221 |
+
label="Layer 4: + FE-XC cold experts (2-bit codebook)")
|
| 222 |
+
summary['L4_fexc'] = tok_s
|
| 223 |
+
|
| 224 |
+
# =====================================================================
|
| 225 |
+
# Layer 5: INT2 coldest expert demotion (three-way: FP4/FE-XC/INT2)
|
| 226 |
+
# =====================================================================
|
| 227 |
+
print("\n>> Enabling INT2 coldest expert demotion...")
|
| 228 |
+
engine.enable_auto_int2_demotion(cold_threshold_pct=args.int2_cold_pct)
|
| 229 |
+
|
| 230 |
+
# Trigger tier update to demote coldest experts to INT2
|
| 231 |
+
int2_count = 0
|
| 232 |
+
for layer in engine.layers:
|
| 233 |
+
if hasattr(layer.ffn, 'update_expert_tiers'):
|
| 234 |
+
layer.ffn.update_expert_tiers()
|
| 235 |
+
if hasattr(layer.ffn, '_expert_is_int2'):
|
| 236 |
+
int2_count += layer.ffn._expert_is_int2.sum().item()
|
| 237 |
+
print(f" [INT2] {int2_count} coldest experts demoted across all layers")
|
| 238 |
+
|
| 239 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 240 |
+
max_tokens=args.max_tokens, warmup=args.warmup,
|
| 241 |
+
label="Layer 5: + INT2 coldest experts (2-bit scalar)")
|
| 242 |
+
summary['L5_int2'] = tok_s
|
| 243 |
+
|
| 244 |
+
# =====================================================================
|
| 245 |
+
# Layer 6: CUDA Graph decode (captures entire 48-layer forward as one graph)
|
| 246 |
+
# Must be LAST — captures the current state of all optimizations
|
| 247 |
+
# =====================================================================
|
| 248 |
+
print("\n>> Enabling CUDA Graph decode...")
|
| 249 |
+
engine.enable_cuda_graph_decode(max_seq_len=4096)
|
| 250 |
+
print(" [CUDA Graph] Capturing full 48-layer decode as single graph replay")
|
| 251 |
+
|
| 252 |
+
tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
|
| 253 |
+
max_tokens=args.max_tokens, warmup=args.warmup + 2,
|
| 254 |
+
label="Layer 6: + CUDA Graph (zero Python overhead)")
|
| 255 |
+
summary['L6_cuda_graph'] = tok_s
|
| 256 |
+
|
| 257 |
+
# =====================================================================
|
| 258 |
+
# SUMMARY
|
| 259 |
+
# =====================================================================
|
| 260 |
+
vram_final = torch.cuda.max_memory_allocated() / 1e9
|
| 261 |
+
final_key = 'L6_cuda_graph'
|
| 262 |
+
|
| 263 |
+
print("\n" + "=" * 70)
|
| 264 |
+
print("FIREECHO FULL-STACK BENCHMARK SUMMARY")
|
| 265 |
+
print("=" * 70)
|
| 266 |
+
print()
|
| 267 |
+
print(" Components already in baseline:")
|
| 268 |
+
print(" - Goliath FP4 packed MoE (Triton dequant-matmul)")
|
| 269 |
+
print(" - Fused SwiGLU+Down (1 kernel launch per expert)")
|
| 270 |
+
print(" - FlashDecode attention (Triton online softmax)")
|
| 271 |
+
print(" - Flat KV cache (zero torch.cat)")
|
| 272 |
+
print()
|
| 273 |
+
print(f" {'Layer':<55s} {'tok/s':>8s} {'vs base':>8s}")
|
| 274 |
+
print(f" {'-'*55} {'-'*8} {'-'*8}")
|
| 275 |
+
|
| 276 |
+
base = summary['L0_baseline']
|
| 277 |
+
display_order = [
|
| 278 |
+
('L0_baseline', 'Baseline (Goliath FP4 + packed MoE + fused SwiGLU)'),
|
| 279 |
+
('L1_fp8_kv', '+ FP8 KV cache (half attention bandwidth)'),
|
| 280 |
+
('L2_l2_prefetch', '+ L2 layer-ahead prefetch'),
|
| 281 |
+
('L3_atlas_ban', '+ Atlas Ban & Pick + MoDES (FE-AGK)'),
|
| 282 |
+
('L4_fexc', '+ FE-XC cold expert demotion (2-bit codebook)'),
|
| 283 |
+
('L5_int2', '+ INT2 coldest experts (2-bit scalar)'),
|
| 284 |
+
('L6_cuda_graph', '+ CUDA Graph decode (zero Python overhead)'),
|
| 285 |
+
]
|
| 286 |
+
|
| 287 |
+
for key, name in display_order:
|
| 288 |
+
val = summary[key]
|
| 289 |
+
speedup = val / base if base > 0 else 0
|
| 290 |
+
print(f" {name:<55s} {val:>7.1f} {speedup:>6.2f}x")
|
| 291 |
+
|
| 292 |
+
final = summary[final_key]
|
| 293 |
+
print(f"\n Base VRAM: {vram_base:.2f} GB")
|
| 294 |
+
print(f" Peak VRAM: {vram_final:.2f} GB")
|
| 295 |
+
print(f" Total speedup: {final / base:.2f}x over baseline")
|
| 296 |
+
print(f"\n Baseline forward: ~{1000/base:.1f}ms/token")
|
| 297 |
+
print(f" Full-stack forward: ~{1000/final:.1f}ms/token")
|
| 298 |
+
print(f"\n With 50% speculation acceptance: ~{final * 6 / 1:.0f} tok/s (est.)")
|
| 299 |
+
print(f" With 70% speculation acceptance: ~{final * 8 / 1:.0f} tok/s (est.)")
|
| 300 |
+
print("=" * 70)
|
| 301 |
+
|
| 302 |
+
# Save results
|
| 303 |
+
results_path = os.path.join(os.path.dirname(__file__), "fullstack_benchmark_results.txt")
|
| 304 |
+
with open(results_path, 'w') as f:
|
| 305 |
+
f.write("FireEcho Full-Stack Benchmark Results\n")
|
| 306 |
+
f.write(f"Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
|
| 307 |
+
f.write(f"GPU: RTX 5090 32GB\n\n")
|
| 308 |
+
f.write("Components in baseline:\n")
|
| 309 |
+
f.write(" Goliath FP4 packed MoE, Fused SwiGLU+Down,\n")
|
| 310 |
+
f.write(" FlashDecode attention, Flat KV cache\n\n")
|
| 311 |
+
for key, name in display_order:
|
| 312 |
+
val = summary[key]
|
| 313 |
+
speedup = val / base
|
| 314 |
+
f.write(f"{name}: {val:.1f} tok/s ({speedup:.2f}x)\n")
|
| 315 |
+
f.write(f"\nBaseline: {base:.1f} tok/s\n")
|
| 316 |
+
f.write(f"Full-stack: {final:.1f} tok/s\n")
|
| 317 |
+
f.write(f"Speedup: {final/base:.2f}x\n")
|
| 318 |
+
f.write(f"Peak VRAM: {vram_final:.2f} GB\n")
|
| 319 |
+
print(f"\n Results saved to: {results_path}")
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
if __name__ == '__main__':
|
| 323 |
+
main()
|
FireEcho Engine/benchmark_perplexity.py
ADDED
|
@@ -0,0 +1,358 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Perplexity benchmark for FireEcho quantization formats.
|
| 3 |
+
|
| 4 |
+
Evaluates WikiText-2 perplexity across quantization configs:
|
| 5 |
+
1. FP4 baseline (Goliath FP4, all experts)
|
| 6 |
+
2. FE-XC 10% cold (codebook 2-bit, plain k-means)
|
| 7 |
+
3. FE-XVQ 10% cold (codebook 2-bit, Hessian-weighted k-means)
|
| 8 |
+
4. INT2 10% cold (scalar 2-bit)
|
| 9 |
+
|
| 10 |
+
Each config runs in a SEPARATE SUBPROCESS to guarantee clean CUDA context
|
| 11 |
+
(PyTorch's memory allocator doesn't fully release between del+gc.collect).
|
| 12 |
+
|
| 13 |
+
Usage:
|
| 14 |
+
python benchmark_perplexity.py [--max_tokens 50000] [--stride 256]
|
| 15 |
+
|
| 16 |
+
Output: PPL comparison table suitable for paper.
|
| 17 |
+
|
| 18 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
import sys
|
| 22 |
+
import os
|
| 23 |
+
import time
|
| 24 |
+
import math
|
| 25 |
+
import json
|
| 26 |
+
import argparse
|
| 27 |
+
import subprocess
|
| 28 |
+
import tempfile
|
| 29 |
+
|
| 30 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 31 |
+
|
| 32 |
+
MODEL_DIR = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
|
| 33 |
+
FEXVQ_CODEBOOKS = os.path.join(os.path.dirname(os.path.abspath(__file__)),
|
| 34 |
+
'fexvq_codebooks.pt')
|
| 35 |
+
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# ===== Worker code (runs in subprocess) =====
|
| 39 |
+
|
| 40 |
+
def run_single_config(config, max_tokens, stride, max_len, cold_pct, result_file):
|
| 41 |
+
"""Run a single config evaluation. Called in subprocess."""
|
| 42 |
+
import torch
|
| 43 |
+
import torch.nn.functional as F
|
| 44 |
+
|
| 45 |
+
sys.path.insert(0, SCRIPT_DIR)
|
| 46 |
+
|
| 47 |
+
print(f"\n{'=' * 70}")
|
| 48 |
+
print(f" Config: {config.upper()}")
|
| 49 |
+
print(f"{'=' * 70}")
|
| 50 |
+
|
| 51 |
+
# Load model
|
| 52 |
+
from fireecho_kernel import FireEchoEngine
|
| 53 |
+
from transformers import AutoTokenizer
|
| 54 |
+
|
| 55 |
+
print("[1] Loading model...")
|
| 56 |
+
engine = FireEchoEngine.from_pretrained(MODEL_DIR)
|
| 57 |
+
engine.pack_all_experts()
|
| 58 |
+
engine.eval()
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True)
|
| 60 |
+
|
| 61 |
+
# Load WikiText-2
|
| 62 |
+
from datasets import load_dataset
|
| 63 |
+
print(" Loading WikiText-2 test set...")
|
| 64 |
+
ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
|
| 65 |
+
text = "\n\n".join([t for t in ds["text"] if t.strip()])
|
| 66 |
+
print(f" Text length: {len(text):,} chars")
|
| 67 |
+
tokens = tokenizer.encode(text, add_special_tokens=False)
|
| 68 |
+
if max_tokens > 0 and len(tokens) > max_tokens:
|
| 69 |
+
tokens = tokens[:max_tokens]
|
| 70 |
+
print(f" Tokenized: {len(tokens):,} tokens")
|
| 71 |
+
token_ids = torch.tensor(tokens, dtype=torch.long)
|
| 72 |
+
|
| 73 |
+
# Warmup usage counters
|
| 74 |
+
warmup_prompts = [
|
| 75 |
+
"Explain how neural networks learn from data.",
|
| 76 |
+
"Write a Python function that sorts a list.",
|
| 77 |
+
"What are the main causes of climate change?",
|
| 78 |
+
"Describe the architecture of a transformer.",
|
| 79 |
+
"How does public key cryptography work?",
|
| 80 |
+
"What is the halting problem?",
|
| 81 |
+
"Explain quantum computing simply.",
|
| 82 |
+
"Write a recursive Fibonacci function.",
|
| 83 |
+
"What are the fundamental forces in physics?",
|
| 84 |
+
"How does the human immune system work?",
|
| 85 |
+
"Describe the process of photosynthesis.",
|
| 86 |
+
"What is the P vs NP problem?",
|
| 87 |
+
"How does GPS determine your location?",
|
| 88 |
+
"Explain machine learning overfitting.",
|
| 89 |
+
"What are design patterns in software?",
|
| 90 |
+
"How do search engines rank pages?",
|
| 91 |
+
"Describe the lifecycle of a star.",
|
| 92 |
+
"What is Shannon's information theory?",
|
| 93 |
+
"How do operating systems manage memory?",
|
| 94 |
+
"Explain the CAP theorem.",
|
| 95 |
+
]
|
| 96 |
+
print(f" Warming up expert usage (20 prompts)...")
|
| 97 |
+
for prompt in warmup_prompts:
|
| 98 |
+
ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 99 |
+
engine.reset_cache()
|
| 100 |
+
engine._current_seq_id = 0
|
| 101 |
+
engine.generate(ids, max_new_tokens=32, temperature=0.0)
|
| 102 |
+
|
| 103 |
+
ffn = engine.layers[0].ffn
|
| 104 |
+
if hasattr(ffn, 'expert_usage'):
|
| 105 |
+
usage = ffn.expert_usage
|
| 106 |
+
top5 = usage.topk(5)
|
| 107 |
+
bot5 = usage.topk(5, largest=False)
|
| 108 |
+
print(f" Layer 0 usage: top5={top5.values.tolist()}, bot5={bot5.values.tolist()}")
|
| 109 |
+
|
| 110 |
+
# Apply quantization config
|
| 111 |
+
if config == 'fp4':
|
| 112 |
+
print(" [FP4 baseline — no demotion]")
|
| 113 |
+
elif config == 'fexc':
|
| 114 |
+
engine.enable_auto_fexc_demotion(cold_threshold_pct=cold_pct)
|
| 115 |
+
total = 0
|
| 116 |
+
for layer in engine.layers:
|
| 117 |
+
layer.ffn._maybe_demote_to_fexc()
|
| 118 |
+
if hasattr(layer.ffn, '_expert_is_fexc'):
|
| 119 |
+
total += layer.ffn._expert_is_fexc.sum().item()
|
| 120 |
+
print(f" FE-XC demoted: {total} experts ({total // len(engine.layers)}/layer)")
|
| 121 |
+
elif config == 'fexvq':
|
| 122 |
+
if os.path.exists(FEXVQ_CODEBOOKS):
|
| 123 |
+
print(f" Loading pre-calibrated FE-XVQ codebooks...")
|
| 124 |
+
ckpt = torch.load(FEXVQ_CODEBOOKS, weights_only=True)
|
| 125 |
+
codebooks = ckpt['codebooks']
|
| 126 |
+
engine.enable_auto_fexc_demotion(cold_threshold_pct=cold_pct)
|
| 127 |
+
# Force init + inject Hessian-weighted codebooks BEFORE demotion
|
| 128 |
+
for li, layer in enumerate(engine.layers):
|
| 129 |
+
ffn_l = layer.ffn
|
| 130 |
+
if not getattr(ffn_l, '_fexc_enabled', False):
|
| 131 |
+
ffn_l._init_fexc_buffers()
|
| 132 |
+
if li in codebooks:
|
| 133 |
+
ffn_l.gu_codebooks = codebooks[li]['gate_up'].cuda().half()
|
| 134 |
+
ffn_l.dn_codebooks = codebooks[li]['down'].cuda().half()
|
| 135 |
+
total = 0
|
| 136 |
+
for layer in engine.layers:
|
| 137 |
+
layer.ffn._maybe_demote_to_fexc()
|
| 138 |
+
if hasattr(layer.ffn, '_expert_is_fexc'):
|
| 139 |
+
total += layer.ffn._expert_is_fexc.sum().item()
|
| 140 |
+
print(f" FE-XVQ demoted: {total} experts ({total // len(engine.layers)}/layer)")
|
| 141 |
+
else:
|
| 142 |
+
print(f" ERROR: No pre-calibrated codebooks at {FEXVQ_CODEBOOKS}")
|
| 143 |
+
json.dump({'error': 'no codebooks'}, open(result_file, 'w'))
|
| 144 |
+
return
|
| 145 |
+
elif config == 'int2':
|
| 146 |
+
engine.enable_auto_int2_demotion(cold_threshold_pct=cold_pct)
|
| 147 |
+
total = 0
|
| 148 |
+
for layer in engine.layers:
|
| 149 |
+
layer.ffn._maybe_demote_to_int2()
|
| 150 |
+
if hasattr(layer.ffn, '_expert_is_int2'):
|
| 151 |
+
total += layer.ffn._expert_is_int2.sum().item()
|
| 152 |
+
print(f" INT2 demoted: {total} experts ({total // len(engine.layers)}/layer)")
|
| 153 |
+
|
| 154 |
+
vram_gb = torch.cuda.memory_allocated() / 1e9
|
| 155 |
+
print(f" VRAM: {vram_gb:.1f} GB")
|
| 156 |
+
|
| 157 |
+
# Evaluate perplexity
|
| 158 |
+
print(f"\n Evaluating perplexity...")
|
| 159 |
+
t0 = time.time()
|
| 160 |
+
|
| 161 |
+
total_nll = 0.0
|
| 162 |
+
total_tokens = 0
|
| 163 |
+
num_windows = 0
|
| 164 |
+
seq_len = token_ids.shape[0]
|
| 165 |
+
num_windows_total = max(1, (seq_len - max_len) // stride + 1)
|
| 166 |
+
|
| 167 |
+
for begin in range(0, seq_len - 1, stride):
|
| 168 |
+
end = min(begin + max_len, seq_len)
|
| 169 |
+
input_ids = token_ids[begin:end].unsqueeze(0).cuda()
|
| 170 |
+
|
| 171 |
+
engine.reset_cache()
|
| 172 |
+
engine._current_seq_id = 0
|
| 173 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 174 |
+
engine.kv_cache._graph_mode = False
|
| 175 |
+
|
| 176 |
+
with torch.no_grad():
|
| 177 |
+
logits = engine.forward(input_ids, use_cache=False)
|
| 178 |
+
|
| 179 |
+
shift_logits = logits[:, :-1, :].contiguous()
|
| 180 |
+
shift_labels = input_ids[:, 1:].contiguous()
|
| 181 |
+
|
| 182 |
+
if begin > 0:
|
| 183 |
+
overlap = max_len - stride
|
| 184 |
+
shift_logits = shift_logits[:, overlap:, :]
|
| 185 |
+
shift_labels = shift_labels[:, overlap:]
|
| 186 |
+
|
| 187 |
+
if shift_labels.numel() == 0:
|
| 188 |
+
continue
|
| 189 |
+
|
| 190 |
+
loss = F.cross_entropy(
|
| 191 |
+
shift_logits.view(-1, shift_logits.size(-1)),
|
| 192 |
+
shift_labels.view(-1),
|
| 193 |
+
reduction='sum'
|
| 194 |
+
)
|
| 195 |
+
|
| 196 |
+
total_nll += loss.item()
|
| 197 |
+
total_tokens += shift_labels.numel()
|
| 198 |
+
num_windows += 1
|
| 199 |
+
|
| 200 |
+
if num_windows % 20 == 0 or num_windows == 1:
|
| 201 |
+
elapsed = time.time() - t0
|
| 202 |
+
current_ppl = math.exp(total_nll / total_tokens)
|
| 203 |
+
tok_per_s = total_tokens / elapsed
|
| 204 |
+
print(f" Window {num_windows}/{num_windows_total}: "
|
| 205 |
+
f"PPL={current_ppl:.2f}, {total_tokens} tok, "
|
| 206 |
+
f"{tok_per_s:.0f} tok/s eval")
|
| 207 |
+
|
| 208 |
+
elapsed = time.time() - t0
|
| 209 |
+
ppl = math.exp(total_nll / total_tokens) if total_tokens > 0 else float('inf')
|
| 210 |
+
print(f" Final: PPL={ppl:.2f}, {total_tokens} tok, "
|
| 211 |
+
f"{num_windows} windows, {elapsed:.1f}s")
|
| 212 |
+
|
| 213 |
+
# Write result
|
| 214 |
+
result = {
|
| 215 |
+
'config': config,
|
| 216 |
+
'ppl': ppl,
|
| 217 |
+
'tokens': total_tokens,
|
| 218 |
+
'vram_gb': vram_gb,
|
| 219 |
+
'time_s': elapsed,
|
| 220 |
+
}
|
| 221 |
+
with open(result_file, 'w') as f:
|
| 222 |
+
json.dump(result, f)
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
# ===== Main orchestrator =====
|
| 226 |
+
|
| 227 |
+
def main():
|
| 228 |
+
parser = argparse.ArgumentParser(description='FireEcho Perplexity Benchmark')
|
| 229 |
+
parser.add_argument('--max_tokens', type=int, default=50000,
|
| 230 |
+
help='Max tokens from WikiText-2 (default: 50000)')
|
| 231 |
+
parser.add_argument('--stride', type=int, default=256,
|
| 232 |
+
help='Sliding window stride (default: 256)')
|
| 233 |
+
parser.add_argument('--max_len', type=int, default=512,
|
| 234 |
+
help='Max context per window (default: 512)')
|
| 235 |
+
parser.add_argument('--configs', type=str, default='fp4,fexc,fexvq,int2',
|
| 236 |
+
help='Comma-separated configs to test (default: fp4,fexc,fexvq,int2)')
|
| 237 |
+
parser.add_argument('--cold_pct', type=float, default=0.10,
|
| 238 |
+
help='Fraction of experts to demote (default: 0.10)')
|
| 239 |
+
parser.add_argument('--_worker', type=str, default=None,
|
| 240 |
+
help=argparse.SUPPRESS) # Internal: run single config
|
| 241 |
+
parser.add_argument('--_result_file', type=str, default=None,
|
| 242 |
+
help=argparse.SUPPRESS)
|
| 243 |
+
args = parser.parse_args()
|
| 244 |
+
|
| 245 |
+
# Worker mode: run single config in subprocess
|
| 246 |
+
if args._worker:
|
| 247 |
+
run_single_config(args._worker, args.max_tokens, args.stride,
|
| 248 |
+
args.max_len, args.cold_pct, args._result_file)
|
| 249 |
+
return
|
| 250 |
+
|
| 251 |
+
# Orchestrator mode: spawn subprocesses
|
| 252 |
+
configs = [c.strip() for c in args.configs.split(',')]
|
| 253 |
+
|
| 254 |
+
print("=" * 70)
|
| 255 |
+
print(" FireEcho Perplexity Benchmark")
|
| 256 |
+
print(" WikiText-2 | Qwen3-Omni 30B MoE | RTX 5090")
|
| 257 |
+
print("=" * 70)
|
| 258 |
+
print(f" Max tokens: {args.max_tokens:,}")
|
| 259 |
+
print(f" Window: {args.max_len}, stride: {args.stride}")
|
| 260 |
+
print(f" Cold threshold: {args.cold_pct*100:.0f}%")
|
| 261 |
+
print(f" Configs: {configs}")
|
| 262 |
+
print(f" Subprocess isolation: enabled (clean CUDA context per config)")
|
| 263 |
+
|
| 264 |
+
results = {}
|
| 265 |
+
script_path = os.path.abspath(__file__)
|
| 266 |
+
python = sys.executable
|
| 267 |
+
|
| 268 |
+
for config in configs:
|
| 269 |
+
# Create temp file for result
|
| 270 |
+
fd, result_file = tempfile.mkstemp(suffix='.json', prefix=f'ppl_{config}_')
|
| 271 |
+
os.close(fd)
|
| 272 |
+
|
| 273 |
+
try:
|
| 274 |
+
cmd = [
|
| 275 |
+
python, '-u', script_path,
|
| 276 |
+
'--_worker', config,
|
| 277 |
+
'--_result_file', result_file,
|
| 278 |
+
'--max_tokens', str(args.max_tokens),
|
| 279 |
+
'--stride', str(args.stride),
|
| 280 |
+
'--max_len', str(args.max_len),
|
| 281 |
+
'--cold_pct', str(args.cold_pct),
|
| 282 |
+
]
|
| 283 |
+
ret = subprocess.run(cmd, cwd=SCRIPT_DIR)
|
| 284 |
+
|
| 285 |
+
if ret.returncode != 0:
|
| 286 |
+
print(f"\n SUBPROCESS FAILED for {config.upper()} (exit code {ret.returncode})")
|
| 287 |
+
results[config] = {'error': f'exit code {ret.returncode}'}
|
| 288 |
+
continue
|
| 289 |
+
|
| 290 |
+
# Read result
|
| 291 |
+
with open(result_file) as f:
|
| 292 |
+
r = json.load(f)
|
| 293 |
+
if 'error' in r:
|
| 294 |
+
results[config] = r
|
| 295 |
+
else:
|
| 296 |
+
results[config] = r
|
| 297 |
+
print(f" >> {config.upper()}: PPL={r['ppl']:.2f}, "
|
| 298 |
+
f"VRAM={r['vram_gb']:.1f}G, {r['time_s']:.0f}s")
|
| 299 |
+
|
| 300 |
+
except Exception as e:
|
| 301 |
+
print(f"\n ERROR launching {config.upper()}: {e}")
|
| 302 |
+
results[config] = {'error': str(e)}
|
| 303 |
+
finally:
|
| 304 |
+
if os.path.exists(result_file):
|
| 305 |
+
os.unlink(result_file)
|
| 306 |
+
|
| 307 |
+
# === Results Table ===
|
| 308 |
+
print(f"\n{'=' * 70}")
|
| 309 |
+
print(f" RESULTS — WikiText-2 Perplexity")
|
| 310 |
+
print(f"{'=' * 70}")
|
| 311 |
+
print(f"\n{'Config':<12} {'PPL':>8} {'Δ PPL':>8} {'VRAM':>8} {'Tokens':>10} {'bits/w':>7} {'Time':>7}")
|
| 312 |
+
print(f"{'─' * 66}")
|
| 313 |
+
|
| 314 |
+
baseline_ppl = results.get('fp4', {}).get('ppl', None)
|
| 315 |
+
for config in configs:
|
| 316 |
+
if config not in results:
|
| 317 |
+
continue
|
| 318 |
+
r = results[config]
|
| 319 |
+
if r.get('error'):
|
| 320 |
+
print(f"{config.upper():<12} {'ERROR':>8} {'—':>8} {'—':>8} {'—':>10} {'—':>7} {'—':>7}")
|
| 321 |
+
continue
|
| 322 |
+
delta = f"+{r['ppl'] - baseline_ppl:.2f}" if baseline_ppl and config != 'fp4' else "—"
|
| 323 |
+
bits = {'fp4': '4.0', 'fexc': '~2.2', 'fexvq': '~2.2', 'int2': '2.0'}.get(config, '?')
|
| 324 |
+
time_s = f"{r.get('time_s', 0):.0f}s"
|
| 325 |
+
print(f"{config.upper():<12} {r['ppl']:>8.2f} {delta:>8} {r['vram_gb']:>7.1f}G "
|
| 326 |
+
f"{r['tokens']:>10,} {bits:>7} {time_s:>7}")
|
| 327 |
+
|
| 328 |
+
# Ablation analysis: FE-XC vs FE-XVQ
|
| 329 |
+
if (baseline_ppl and 'fexc' in results and 'fexvq' in results
|
| 330 |
+
and not results['fexc'].get('error') and not results['fexvq'].get('error')):
|
| 331 |
+
fexc_delta = results['fexc']['ppl'] - baseline_ppl
|
| 332 |
+
fexvq_delta = results['fexvq']['ppl'] - baseline_ppl
|
| 333 |
+
print(f"\n Ablation: Hessian-weighted codebooks (FE-XVQ vs FE-XC)")
|
| 334 |
+
print(f" FE-XC (plain k-means): +{fexc_delta:.2f} PPL")
|
| 335 |
+
print(f" FE-XVQ (Hessian-weighted): +{fexvq_delta:.2f} PPL")
|
| 336 |
+
if fexc_delta > 0:
|
| 337 |
+
hessian_gain = (1 - fexvq_delta / fexc_delta) * 100
|
| 338 |
+
print(f" Hessian reduces {hessian_gain:.0f}% of codebook PPL degradation")
|
| 339 |
+
|
| 340 |
+
# FE-XVQ vs INT2
|
| 341 |
+
if (baseline_ppl and 'fexvq' in results and 'int2' in results
|
| 342 |
+
and not results['fexvq'].get('error') and not results['int2'].get('error')):
|
| 343 |
+
fexvq_delta = results['fexvq']['ppl'] - baseline_ppl
|
| 344 |
+
int2_delta = results['int2']['ppl'] - baseline_ppl
|
| 345 |
+
if int2_delta > 0:
|
| 346 |
+
improvement = (1 - fexvq_delta / int2_delta) * 100
|
| 347 |
+
print(f"\n FE-XVQ recovers {improvement:.0f}% of INT2's PPL degradation")
|
| 348 |
+
print(f" (same 2-bit storage, codebook quality advantage)")
|
| 349 |
+
|
| 350 |
+
# Note about BF16
|
| 351 |
+
print(f"\n Note: BF16 baseline omitted — Qwen3-Omni 30B BF16 = ~61GB,")
|
| 352 |
+
print(f" exceeds RTX 5090 32GB. FP4 (Goliath) is practical baseline.")
|
| 353 |
+
|
| 354 |
+
print(f"\n{'=' * 70}")
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
if __name__ == '__main__':
|
| 358 |
+
main()
|
FireEcho Engine/calibrate_fexc.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""FE-XC Offline Calibration — Learn codebooks for all 48 MoE layers.
|
| 3 |
+
|
| 4 |
+
Reads packed FP4 expert weights from a loaded FireEchoEngine, learns shared
|
| 5 |
+
codebooks per layer via residual k-means, then saves them to disk.
|
| 6 |
+
|
| 7 |
+
This is a one-time offline step (~2-5 minutes on GPU). The saved codebooks are
|
| 8 |
+
reused by enable_auto_fexc_demotion() during inference to demote cold experts.
|
| 9 |
+
|
| 10 |
+
Usage:
|
| 11 |
+
python calibrate_fexc.py [--output fexc_codebooks.pt] [--sample_experts 8] [--n_iters 20]
|
| 12 |
+
|
| 13 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
import sys
|
| 17 |
+
import os
|
| 18 |
+
import time
|
| 19 |
+
import argparse
|
| 20 |
+
|
| 21 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 22 |
+
|
| 23 |
+
import torch
|
| 24 |
+
from goliath_kernel import GoliathFP4Weights, GoliathFEXCWeights
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def calibrate_layer_codebooks(packed_w, packed_s, packed_ts, shape_K, shape_N,
|
| 28 |
+
sample_experts=8, n_iters=20, total_experts=128):
|
| 29 |
+
"""Learn shared codebooks for one projection type in one MoE layer.
|
| 30 |
+
|
| 31 |
+
Args:
|
| 32 |
+
packed_w: [E, K//2, N] uint8 — packed FP4 weights
|
| 33 |
+
packed_s: [E, ...] — block scales
|
| 34 |
+
packed_ts: [E] — tensor scales
|
| 35 |
+
shape_K, shape_N: original weight dimensions
|
| 36 |
+
sample_experts: number of experts to pool for k-means
|
| 37 |
+
n_iters: k-means iterations
|
| 38 |
+
total_experts: total number of experts in layer
|
| 39 |
+
|
| 40 |
+
Returns:
|
| 41 |
+
codebooks: [2, 256, 8] float16 — shared codebooks for this projection
|
| 42 |
+
"""
|
| 43 |
+
n_sample = min(sample_experts, total_experts)
|
| 44 |
+
perm = torch.randperm(total_experts, device='cpu')[:n_sample]
|
| 45 |
+
|
| 46 |
+
# Dequantize sampled experts and collect weight groups
|
| 47 |
+
groups_list = []
|
| 48 |
+
for e_idx in perm:
|
| 49 |
+
fp4 = GoliathFP4Weights(
|
| 50 |
+
packed=packed_w[e_idx],
|
| 51 |
+
block_scales=packed_s[e_idx],
|
| 52 |
+
tensor_scale=packed_ts[e_idx].item(),
|
| 53 |
+
shape=(shape_K, shape_N),
|
| 54 |
+
)
|
| 55 |
+
w_float = fp4.to_float() # [K, N] on GPU
|
| 56 |
+
groups_list.append(w_float.view(-1, 8)) # [K*N/8, 8]
|
| 57 |
+
|
| 58 |
+
# Pool all groups
|
| 59 |
+
all_groups = torch.cat(groups_list, dim=0) # [n_sample * K*N/8, 8]
|
| 60 |
+
|
| 61 |
+
# Learn codebooks via GoliathFEXCWeights.from_float (residual k-means)
|
| 62 |
+
# We pass a fake [K, N] matrix reshaped from pooled groups
|
| 63 |
+
# Just need one expert's worth of groups to get codebooks
|
| 64 |
+
ref_expert = GoliathFP4Weights(
|
| 65 |
+
packed=packed_w[perm[0]],
|
| 66 |
+
block_scales=packed_s[perm[0]],
|
| 67 |
+
tensor_scale=packed_ts[perm[0]].item(),
|
| 68 |
+
shape=(shape_K, shape_N),
|
| 69 |
+
)
|
| 70 |
+
fexc = GoliathFEXCWeights.from_float(ref_expert.to_float(), n_iters=n_iters)
|
| 71 |
+
return fexc.codebooks # [2, 256, 8] float16
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def main():
|
| 75 |
+
parser = argparse.ArgumentParser(description='FE-XC Codebook Calibration')
|
| 76 |
+
parser.add_argument('--output', type=str, default='fexc_codebooks.pt',
|
| 77 |
+
help='Output path for codebooks (default: fexc_codebooks.pt)')
|
| 78 |
+
parser.add_argument('--sample_experts', type=int, default=8,
|
| 79 |
+
help='Number of experts to sample per layer for k-means')
|
| 80 |
+
parser.add_argument('--n_iters', type=int, default=20,
|
| 81 |
+
help='K-means iterations')
|
| 82 |
+
parser.add_argument('--model_dir', type=str, default=None,
|
| 83 |
+
help='Model directory (default: auto-detect from config)')
|
| 84 |
+
args = parser.parse_args()
|
| 85 |
+
|
| 86 |
+
# Lazy import — heavy
|
| 87 |
+
from fireecho_kernel import FireEchoEngine
|
| 88 |
+
|
| 89 |
+
print("=" * 70)
|
| 90 |
+
print("FE-XC Codebook Calibration")
|
| 91 |
+
print("=" * 70)
|
| 92 |
+
|
| 93 |
+
# Load engine (FP4 quantized)
|
| 94 |
+
model_dir = args.model_dir
|
| 95 |
+
if model_dir is None:
|
| 96 |
+
# Default Qwen3-Omni path
|
| 97 |
+
model_dir = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
|
| 98 |
+
|
| 99 |
+
print(f"Loading model from: {model_dir}")
|
| 100 |
+
engine = FireEchoEngine.from_pretrained(model_dir)
|
| 101 |
+
engine.pack_all_experts()
|
| 102 |
+
print(f"Model loaded. {len(engine.layers)} layers.")
|
| 103 |
+
|
| 104 |
+
# Calibrate each MoE layer
|
| 105 |
+
codebooks = {} # layer_idx -> {'gate_up': [2,256,8], 'down': [2,256,8]}
|
| 106 |
+
total_layers = len(engine.layers)
|
| 107 |
+
t_start = time.time()
|
| 108 |
+
|
| 109 |
+
for layer_idx, layer in enumerate(engine.layers):
|
| 110 |
+
ffn = layer.ffn
|
| 111 |
+
if not hasattr(ffn, 'packed_gu_w'):
|
| 112 |
+
print(f" Layer {layer_idx}: skipping (not MoE or not packed)")
|
| 113 |
+
continue
|
| 114 |
+
|
| 115 |
+
K_gu = ffn.packed_gu_w.shape[1] * 2
|
| 116 |
+
N_gu = ffn.packed_gu_w.shape[2]
|
| 117 |
+
K_dn = ffn.packed_dn_w.shape[1] * 2
|
| 118 |
+
N_dn = ffn.packed_dn_w.shape[2]
|
| 119 |
+
n_experts = ffn.packed_gu_w.shape[0]
|
| 120 |
+
|
| 121 |
+
t0 = time.time()
|
| 122 |
+
|
| 123 |
+
# gate_up codebooks
|
| 124 |
+
gu_cb = calibrate_layer_codebooks(
|
| 125 |
+
ffn.packed_gu_w, ffn.packed_gu_s, ffn.packed_gu_ts,
|
| 126 |
+
K_gu, N_gu,
|
| 127 |
+
sample_experts=args.sample_experts,
|
| 128 |
+
n_iters=args.n_iters,
|
| 129 |
+
total_experts=n_experts)
|
| 130 |
+
|
| 131 |
+
# down codebooks
|
| 132 |
+
dn_cb = calibrate_layer_codebooks(
|
| 133 |
+
ffn.packed_dn_w, ffn.packed_dn_s, ffn.packed_dn_ts,
|
| 134 |
+
K_dn, N_dn,
|
| 135 |
+
sample_experts=args.sample_experts,
|
| 136 |
+
n_iters=args.n_iters,
|
| 137 |
+
total_experts=n_experts)
|
| 138 |
+
|
| 139 |
+
codebooks[layer_idx] = {
|
| 140 |
+
'gate_up': gu_cb.cpu(),
|
| 141 |
+
'down': dn_cb.cpu(),
|
| 142 |
+
}
|
| 143 |
+
|
| 144 |
+
elapsed = time.time() - t0
|
| 145 |
+
print(f" Layer {layer_idx}/{total_layers}: "
|
| 146 |
+
f"gate_up=[{K_gu}x{N_gu}] down=[{K_dn}x{N_dn}] "
|
| 147 |
+
f"— {elapsed:.1f}s")
|
| 148 |
+
|
| 149 |
+
total_time = time.time() - t_start
|
| 150 |
+
print(f"\nCalibration complete: {len(codebooks)} layers in {total_time:.1f}s")
|
| 151 |
+
|
| 152 |
+
# Save
|
| 153 |
+
output_path = args.output
|
| 154 |
+
if not os.path.isabs(output_path):
|
| 155 |
+
output_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
|
| 156 |
+
output_path)
|
| 157 |
+
torch.save({
|
| 158 |
+
'codebooks': codebooks,
|
| 159 |
+
'config': {
|
| 160 |
+
'sample_experts': args.sample_experts,
|
| 161 |
+
'n_iters': args.n_iters,
|
| 162 |
+
'n_centroids': 256,
|
| 163 |
+
'group_size': 8,
|
| 164 |
+
'num_codebooks': 2,
|
| 165 |
+
},
|
| 166 |
+
'num_layers': len(codebooks),
|
| 167 |
+
}, output_path)
|
| 168 |
+
print(f"Saved codebooks to: {output_path}")
|
| 169 |
+
print(f"File size: {os.path.getsize(output_path) / 1024:.1f} KB")
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
if __name__ == '__main__':
|
| 173 |
+
main()
|
FireEcho Engine/calibrate_fexvq.py
ADDED
|
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""FE-XVQ Calibration — Hessian-weighted codebook learning for all 48 MoE layers.
|
| 3 |
+
|
| 4 |
+
Runs calibration prompts through the model, collects Hessian diagonals
|
| 5 |
+
(input covariance) at each MoE layer, then learns Hessian-weighted codebooks
|
| 6 |
+
via GoliathFEXVQWeights. Saves codebooks to disk for later use.
|
| 7 |
+
|
| 8 |
+
This is a one-time offline step:
|
| 9 |
+
1. Load model (~2 min)
|
| 10 |
+
2. Run calibration prompts (~2-5 min for 50 prompts)
|
| 11 |
+
3. Learn codebooks (~5-10 min on CPU)
|
| 12 |
+
4. Save to fexvq_codebooks.pt
|
| 13 |
+
|
| 14 |
+
The codebooks can then be loaded by enable_auto_fexvq_demotion() during
|
| 15 |
+
inference to demote cold experts with Hessian-optimal quality.
|
| 16 |
+
|
| 17 |
+
Usage:
|
| 18 |
+
python calibrate_fexvq.py [--output fexvq_codebooks.pt] [--n_prompts 50]
|
| 19 |
+
|
| 20 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
import sys
|
| 24 |
+
import os
|
| 25 |
+
import time
|
| 26 |
+
import argparse
|
| 27 |
+
|
| 28 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 29 |
+
|
| 30 |
+
import torch
|
| 31 |
+
from goliath_kernel import GoliathFP4Weights, GoliathFEXVQWeights
|
| 32 |
+
|
| 33 |
+
# Calibration prompts — diverse to capture broad input distribution
|
| 34 |
+
CALIBRATION_PROMPTS = [
|
| 35 |
+
"Explain the theory of general relativity in simple terms.",
|
| 36 |
+
"Write a Python function to sort a list using quicksort.",
|
| 37 |
+
"What are the main causes of climate change?",
|
| 38 |
+
"Describe the process of photosynthesis step by step.",
|
| 39 |
+
"How does a neural network learn from data?",
|
| 40 |
+
"What is the difference between TCP and UDP protocols?",
|
| 41 |
+
"Explain quantum computing to a 10 year old.",
|
| 42 |
+
"Write a recursive function to compute Fibonacci numbers.",
|
| 43 |
+
"What were the main events of World War II?",
|
| 44 |
+
"How does the human immune system fight infections?",
|
| 45 |
+
"Describe the architecture of a modern CPU.",
|
| 46 |
+
"What is the significance of the Turing test?",
|
| 47 |
+
"Explain how blockchain technology works.",
|
| 48 |
+
"Write a Python class for a binary search tree.",
|
| 49 |
+
"What are the fundamental forces in physics?",
|
| 50 |
+
"How do vaccines work at the molecular level?",
|
| 51 |
+
"Describe the water cycle and its importance.",
|
| 52 |
+
"What is the P vs NP problem in computer science?",
|
| 53 |
+
"Explain the concept of entropy in thermodynamics.",
|
| 54 |
+
"How does natural language processing work?",
|
| 55 |
+
"What are the principles of object-oriented programming?",
|
| 56 |
+
"Describe the structure of DNA and how it replicates.",
|
| 57 |
+
"What is the significance of Euler's identity?",
|
| 58 |
+
"How do operating systems manage memory?",
|
| 59 |
+
"Explain the concept of dark matter and dark energy.",
|
| 60 |
+
"Write a function to find the shortest path in a graph.",
|
| 61 |
+
"What are the key differences between Python and C++?",
|
| 62 |
+
"How does the internet route packets between networks?",
|
| 63 |
+
"Explain the CAP theorem in distributed systems.",
|
| 64 |
+
"What is the role of mitochondria in cellular respiration?",
|
| 65 |
+
"Describe how a compiler transforms source code to machine code.",
|
| 66 |
+
"What are the main branches of mathematics?",
|
| 67 |
+
"How do electric vehicles work compared to combustion engines?",
|
| 68 |
+
"Explain the concept of recursion with examples.",
|
| 69 |
+
"What is CRISPR and how does it edit genes?",
|
| 70 |
+
"How does public key cryptography ensure security?",
|
| 71 |
+
"Describe the lifecycle of a star from birth to death.",
|
| 72 |
+
"What are design patterns in software engineering?",
|
| 73 |
+
"How does the human brain process visual information?",
|
| 74 |
+
"Explain the concept of containerization in DevOps.",
|
| 75 |
+
"What are the ethical considerations of artificial intelligence?",
|
| 76 |
+
"How do search engines rank web pages?",
|
| 77 |
+
"Describe the process of protein folding.",
|
| 78 |
+
"What is the halting problem and why is it important?",
|
| 79 |
+
"How does GPS determine your location?",
|
| 80 |
+
"Explain the concept of machine learning overfitting.",
|
| 81 |
+
"What are the properties of prime numbers?",
|
| 82 |
+
"How does a quantum computer differ from a classical computer?",
|
| 83 |
+
"Describe the architecture of a transformer neural network.",
|
| 84 |
+
"What is the significance of Shannon's information theory?",
|
| 85 |
+
]
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def main():
|
| 89 |
+
parser = argparse.ArgumentParser(description='FE-XVQ Hessian Codebook Calibration')
|
| 90 |
+
parser.add_argument('--output', type=str, default='fexvq_codebooks.pt',
|
| 91 |
+
help='Output path for codebooks (default: fexvq_codebooks.pt)')
|
| 92 |
+
parser.add_argument('--n_prompts', type=int, default=50,
|
| 93 |
+
help='Number of calibration prompts (default: 50)')
|
| 94 |
+
parser.add_argument('--max_tokens', type=int, default=32,
|
| 95 |
+
help='Max tokens per calibration prompt (default: 32)')
|
| 96 |
+
parser.add_argument('--n_iters', type=int, default=20,
|
| 97 |
+
help='K-means iterations (default: 20)')
|
| 98 |
+
parser.add_argument('--model_dir', type=str, default=None,
|
| 99 |
+
help='Model directory')
|
| 100 |
+
args = parser.parse_args()
|
| 101 |
+
|
| 102 |
+
from fireecho_kernel import FireEchoEngine
|
| 103 |
+
|
| 104 |
+
print("=" * 70)
|
| 105 |
+
print("FE-XVQ Hessian Codebook Calibration")
|
| 106 |
+
print("=" * 70)
|
| 107 |
+
|
| 108 |
+
# Load engine
|
| 109 |
+
model_dir = args.model_dir
|
| 110 |
+
if model_dir is None:
|
| 111 |
+
model_dir = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
|
| 112 |
+
|
| 113 |
+
print(f"Loading model from: {model_dir}")
|
| 114 |
+
engine = FireEchoEngine.from_pretrained(model_dir)
|
| 115 |
+
engine.pack_all_experts()
|
| 116 |
+
print(f"Model loaded. {len(engine.layers)} layers.")
|
| 117 |
+
|
| 118 |
+
# Enable Hessian collection
|
| 119 |
+
print(f"\n--- Phase 1: Collecting Hessian ({args.n_prompts} prompts) ---")
|
| 120 |
+
engine.enable_auto_fexvq_demotion(cold_threshold_pct=0.10)
|
| 121 |
+
|
| 122 |
+
# Tokenize and run calibration prompts
|
| 123 |
+
from transformers import AutoTokenizer
|
| 124 |
+
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
| 125 |
+
|
| 126 |
+
prompts = CALIBRATION_PROMPTS[:args.n_prompts]
|
| 127 |
+
t_start = time.time()
|
| 128 |
+
|
| 129 |
+
for i, prompt in enumerate(prompts):
|
| 130 |
+
input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 131 |
+
with torch.no_grad():
|
| 132 |
+
engine.generate(input_ids, max_new_tokens=args.max_tokens, temperature=0.0)
|
| 133 |
+
if (i + 1) % 10 == 0 or i == 0:
|
| 134 |
+
elapsed = time.time() - t_start
|
| 135 |
+
print(f" Prompt {i+1}/{len(prompts)} ({elapsed:.1f}s)")
|
| 136 |
+
|
| 137 |
+
calib_time = time.time() - t_start
|
| 138 |
+
print(f" Calibration done: {len(prompts)} prompts in {calib_time:.1f}s")
|
| 139 |
+
|
| 140 |
+
# Report Hessian stats
|
| 141 |
+
for li in [0, 1, len(engine.layers) - 1]:
|
| 142 |
+
ffn = engine.layers[li].ffn
|
| 143 |
+
h_gu, h_dn = ffn.get_hessian_diag()
|
| 144 |
+
if h_gu is not None:
|
| 145 |
+
print(f" Layer {li}: Hessian gu samples={ffn._hessian_samples_gu}, "
|
| 146 |
+
f"mean={h_gu.mean():.4f}, max/min ratio={h_gu.max()/h_gu.min().clamp(min=1e-10):.1f}")
|
| 147 |
+
|
| 148 |
+
# Learn Hessian-weighted codebooks for each layer
|
| 149 |
+
print(f"\n--- Phase 2: Learning Hessian-weighted codebooks ---")
|
| 150 |
+
codebooks = {}
|
| 151 |
+
t_start = time.time()
|
| 152 |
+
|
| 153 |
+
for layer_idx, layer in enumerate(engine.layers):
|
| 154 |
+
ffn = layer.ffn
|
| 155 |
+
if not hasattr(ffn, 'packed_gu_w'):
|
| 156 |
+
continue
|
| 157 |
+
|
| 158 |
+
goliath_K_gu = ffn.packed_gu_w.shape[1] * 2
|
| 159 |
+
goliath_N_gu = ffn.packed_gu_w.shape[2]
|
| 160 |
+
goliath_K_dn = ffn.packed_dn_w.shape[1] * 2
|
| 161 |
+
goliath_N_dn = ffn.packed_dn_w.shape[2]
|
| 162 |
+
|
| 163 |
+
h_gu, h_dn = ffn.get_hessian_diag()
|
| 164 |
+
|
| 165 |
+
t0 = time.time()
|
| 166 |
+
|
| 167 |
+
# gate_up codebooks (Hessian-weighted)
|
| 168 |
+
perm = torch.randperm(ffn.num_experts)[:1]
|
| 169 |
+
gu_ref = GoliathFEXVQWeights.from_float(
|
| 170 |
+
GoliathFP4Weights(
|
| 171 |
+
packed=ffn.packed_gu_w[perm[0]],
|
| 172 |
+
block_scales=ffn.packed_gu_s[perm[0]],
|
| 173 |
+
tensor_scale=ffn.packed_gu_ts[perm[0]].item(),
|
| 174 |
+
shape=(goliath_K_gu, goliath_N_gu),
|
| 175 |
+
).to_float().T.contiguous().cpu(),
|
| 176 |
+
hessian_diag=h_gu.cpu() if h_gu is not None else None,
|
| 177 |
+
n_iters=args.n_iters)
|
| 178 |
+
|
| 179 |
+
# down codebooks (Hessian-weighted)
|
| 180 |
+
dn_ref = GoliathFEXVQWeights.from_float(
|
| 181 |
+
GoliathFP4Weights(
|
| 182 |
+
packed=ffn.packed_dn_w[perm[0]],
|
| 183 |
+
block_scales=ffn.packed_dn_s[perm[0]],
|
| 184 |
+
tensor_scale=ffn.packed_dn_ts[perm[0]].item(),
|
| 185 |
+
shape=(goliath_K_dn, goliath_N_dn),
|
| 186 |
+
).to_float().T.contiguous().cpu(),
|
| 187 |
+
hessian_diag=h_dn.cpu() if h_dn is not None else None,
|
| 188 |
+
n_iters=args.n_iters)
|
| 189 |
+
|
| 190 |
+
codebooks[layer_idx] = {
|
| 191 |
+
'gate_up': gu_ref.codebooks.cpu(),
|
| 192 |
+
'down': dn_ref.codebooks.cpu(),
|
| 193 |
+
'hessian_diag_gu': h_gu.cpu() if h_gu is not None else None,
|
| 194 |
+
'hessian_diag_dn': h_dn.cpu() if h_dn is not None else None,
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
elapsed = time.time() - t0
|
| 198 |
+
if layer_idx % 8 == 0 or layer_idx == len(engine.layers) - 1:
|
| 199 |
+
print(f" Layer {layer_idx}/{len(engine.layers)}: {elapsed:.1f}s")
|
| 200 |
+
|
| 201 |
+
total_time = time.time() - t_start
|
| 202 |
+
print(f"\nCodebook learning complete: {len(codebooks)} layers in {total_time:.1f}s")
|
| 203 |
+
|
| 204 |
+
# Save
|
| 205 |
+
output_path = args.output
|
| 206 |
+
if not os.path.isabs(output_path):
|
| 207 |
+
output_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
|
| 208 |
+
output_path)
|
| 209 |
+
torch.save({
|
| 210 |
+
'codebooks': codebooks,
|
| 211 |
+
'config': {
|
| 212 |
+
'n_prompts': args.n_prompts,
|
| 213 |
+
'max_tokens': args.max_tokens,
|
| 214 |
+
'n_iters': args.n_iters,
|
| 215 |
+
'n_centroids': 256,
|
| 216 |
+
'group_size': 8,
|
| 217 |
+
'num_codebooks': 2,
|
| 218 |
+
'method': 'fexvq_hessian_weighted',
|
| 219 |
+
},
|
| 220 |
+
'num_layers': len(codebooks),
|
| 221 |
+
}, output_path)
|
| 222 |
+
print(f"Saved codebooks to: {output_path}")
|
| 223 |
+
print(f"File size: {os.path.getsize(output_path) / 1024:.1f} KB")
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
if __name__ == '__main__':
|
| 227 |
+
main()
|
FireEcho Engine/csrc/cluster_launch.cpp
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* FireEcho Kernel - SM120 Cluster Launch Implementation
|
| 3 |
+
*
|
| 4 |
+
* Compile with:
|
| 5 |
+
* nvcc -shared -o libfireecho_cluster.so cluster_launch.cpp \
|
| 6 |
+
* -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -lcudart \
|
| 7 |
+
* --compiler-options '-fPIC' -arch=sm_120
|
| 8 |
+
*/
|
| 9 |
+
|
| 10 |
+
#include "cluster_launch.h"
|
| 11 |
+
#include <stdio.h>
|
| 12 |
+
|
| 13 |
+
namespace fireecho {
|
| 14 |
+
|
| 15 |
+
// Implementation of helper functions that need compilation
|
| 16 |
+
|
| 17 |
+
void print_cluster_info() {
|
| 18 |
+
if (!supports_clusters()) {
|
| 19 |
+
printf("Thread Block Clusters: NOT SUPPORTED\n");
|
| 20 |
+
return;
|
| 21 |
+
}
|
| 22 |
+
|
| 23 |
+
ClusterProperties props = get_cluster_properties();
|
| 24 |
+
|
| 25 |
+
printf("=== SM120 Thread Block Cluster Info ===\n");
|
| 26 |
+
printf("Max Cluster Size: %d\n", props.max_cluster_size);
|
| 27 |
+
printf("Max Blocks/SM: %d\n", props.max_blocks_per_sm);
|
| 28 |
+
printf("Shared Memory/Block: %d KB\n", props.shared_memory_per_block / 1024);
|
| 29 |
+
printf("Registers/Block: %d\n", props.registers_per_block);
|
| 30 |
+
printf("Distributed SMEM: %s\n", props.supports_dshem ? "YES" : "NO");
|
| 31 |
+
printf("========================================\n");
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
} // namespace fireecho
|
| 35 |
+
|
| 36 |
+
// Standalone test
|
| 37 |
+
#ifdef TEST_CLUSTER_LAUNCH
|
| 38 |
+
int main() {
|
| 39 |
+
// Initialize CUDA
|
| 40 |
+
cudaSetDevice(0);
|
| 41 |
+
|
| 42 |
+
fireecho::print_cluster_info();
|
| 43 |
+
|
| 44 |
+
if (fireecho::supports_clusters()) {
|
| 45 |
+
printf("\n✅ This GPU supports Thread Block Clusters!\n");
|
| 46 |
+
printf(" Max cluster size: %d CTAs\n", fireecho::get_max_cluster_size());
|
| 47 |
+
} else {
|
| 48 |
+
printf("\n❌ This GPU does NOT support Thread Block Clusters.\n");
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
return 0;
|
| 52 |
+
}
|
| 53 |
+
#endif
|
FireEcho Engine/csrc/cluster_launch.h
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* FireEcho Kernel - SM120 Thread Block Cluster Launcher
|
| 3 |
+
*
|
| 4 |
+
* Exposes true Thread Block Cluster APIs for Blackwell (SM 12.0)
|
| 5 |
+
* using the CUDA Driver API's cuLaunchKernelEx with cluster attributes.
|
| 6 |
+
*
|
| 7 |
+
* Requirements:
|
| 8 |
+
* - CUDA 12.8+ (for SM 12.0 support)
|
| 9 |
+
* - Triton 3.6.0+ compiled kernel (CUfunction)
|
| 10 |
+
* - Blackwell GPU (RTX 5090, SM 12.0)
|
| 11 |
+
*
|
| 12 |
+
* Features:
|
| 13 |
+
* - True hardware cluster launch (not just num_ctas hint)
|
| 14 |
+
* - Distributed Shared Memory (dSMEM) access
|
| 15 |
+
* - Cluster barriers for synchronization
|
| 16 |
+
*/
|
| 17 |
+
|
| 18 |
+
#ifndef FIREECHO_CLUSTER_LAUNCH_H
|
| 19 |
+
#define FIREECHO_CLUSTER_LAUNCH_H
|
| 20 |
+
|
| 21 |
+
#include <cuda.h>
|
| 22 |
+
#include <cuda_runtime.h>
|
| 23 |
+
#include <stdexcept>
|
| 24 |
+
#include <string>
|
| 25 |
+
|
| 26 |
+
namespace fireecho {
|
| 27 |
+
|
| 28 |
+
/**
|
| 29 |
+
* Cluster configuration for SM120 kernels.
|
| 30 |
+
*/
|
| 31 |
+
struct ClusterConfig {
|
| 32 |
+
int cluster_dim_x = 2; // Cluster size in X (typically 2 for 2-CTA MMA)
|
| 33 |
+
int cluster_dim_y = 1;
|
| 34 |
+
int cluster_dim_z = 1;
|
| 35 |
+
int max_registers = 240; // Cap for cluster occupancy
|
| 36 |
+
bool enable_dshem = true; // Enable distributed shared memory
|
| 37 |
+
};
|
| 38 |
+
|
| 39 |
+
/**
|
| 40 |
+
* Launch a Triton-compiled kernel with true SM120 cluster support.
|
| 41 |
+
*
|
| 42 |
+
* @param func The compiled CUfunction from Triton
|
| 43 |
+
* @param grid Grid dimensions (in clusters, not blocks)
|
| 44 |
+
* @param block Block dimensions
|
| 45 |
+
* @param args Kernel arguments
|
| 46 |
+
* @param config Cluster configuration
|
| 47 |
+
* @param stream CUDA stream (0 for default)
|
| 48 |
+
*/
|
| 49 |
+
inline CUresult launch_with_cluster(
|
| 50 |
+
CUfunction func,
|
| 51 |
+
dim3 grid,
|
| 52 |
+
dim3 block,
|
| 53 |
+
void** args,
|
| 54 |
+
const ClusterConfig& config = ClusterConfig(),
|
| 55 |
+
CUstream stream = 0
|
| 56 |
+
) {
|
| 57 |
+
// Set up cluster launch attributes for SM120
|
| 58 |
+
CUlaunchAttribute attrs[2];
|
| 59 |
+
int num_attrs = 0;
|
| 60 |
+
|
| 61 |
+
// Cluster dimension attribute
|
| 62 |
+
attrs[num_attrs].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION;
|
| 63 |
+
attrs[num_attrs].value.clusterDim.x = config.cluster_dim_x;
|
| 64 |
+
attrs[num_attrs].value.clusterDim.y = config.cluster_dim_y;
|
| 65 |
+
attrs[num_attrs].value.clusterDim.z = config.cluster_dim_z;
|
| 66 |
+
num_attrs++;
|
| 67 |
+
|
| 68 |
+
// Cluster scheduling policy (optional, for better occupancy)
|
| 69 |
+
attrs[num_attrs].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE;
|
| 70 |
+
attrs[num_attrs].value.clusterSchedulingPolicyPreference =
|
| 71 |
+
CU_CLUSTER_SCHEDULING_POLICY_SPREAD; // or CU_CLUSTER_SCHEDULING_POLICY_LOAD_BALANCING
|
| 72 |
+
num_attrs++;
|
| 73 |
+
|
| 74 |
+
// Configure the launch
|
| 75 |
+
CUlaunchConfig launch_config = {};
|
| 76 |
+
launch_config.gridDimX = grid.x;
|
| 77 |
+
launch_config.gridDimY = grid.y;
|
| 78 |
+
launch_config.gridDimZ = grid.z;
|
| 79 |
+
launch_config.blockDimX = block.x;
|
| 80 |
+
launch_config.blockDimY = block.y;
|
| 81 |
+
launch_config.blockDimZ = block.z;
|
| 82 |
+
launch_config.sharedMemBytes = 0; // Triton manages shared memory
|
| 83 |
+
launch_config.hStream = stream;
|
| 84 |
+
launch_config.attrs = attrs;
|
| 85 |
+
launch_config.numAttrs = num_attrs;
|
| 86 |
+
|
| 87 |
+
// Launch with cluster configuration
|
| 88 |
+
return cuLaunchKernelEx(&launch_config, func, args, nullptr);
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
/**
|
| 92 |
+
* Check if the current GPU supports Thread Block Clusters.
|
| 93 |
+
*/
|
| 94 |
+
inline bool supports_clusters() {
|
| 95 |
+
int device;
|
| 96 |
+
cudaGetDevice(&device);
|
| 97 |
+
|
| 98 |
+
cudaDeviceProp props;
|
| 99 |
+
cudaGetDeviceProperties(&props, device);
|
| 100 |
+
|
| 101 |
+
// Clusters require SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
|
| 102 |
+
return (props.major >= 9) || (props.major == 12);
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
/**
|
| 106 |
+
* Get maximum cluster size for the current GPU.
|
| 107 |
+
*/
|
| 108 |
+
inline int get_max_cluster_size() {
|
| 109 |
+
int device;
|
| 110 |
+
cudaGetDevice(&device);
|
| 111 |
+
|
| 112 |
+
int max_cluster_size = 1;
|
| 113 |
+
cudaDeviceGetAttribute(&max_cluster_size,
|
| 114 |
+
cudaDevAttrClusterLaunch, device);
|
| 115 |
+
|
| 116 |
+
return max_cluster_size;
|
| 117 |
+
}
|
| 118 |
+
|
| 119 |
+
/**
|
| 120 |
+
* Query cluster properties for SM120.
|
| 121 |
+
*/
|
| 122 |
+
struct ClusterProperties {
|
| 123 |
+
int max_cluster_size;
|
| 124 |
+
int max_blocks_per_sm;
|
| 125 |
+
int shared_memory_per_block;
|
| 126 |
+
int registers_per_block;
|
| 127 |
+
bool supports_dshem;
|
| 128 |
+
};
|
| 129 |
+
|
| 130 |
+
inline ClusterProperties get_cluster_properties() {
|
| 131 |
+
ClusterProperties props = {};
|
| 132 |
+
|
| 133 |
+
int device;
|
| 134 |
+
cudaGetDevice(&device);
|
| 135 |
+
|
| 136 |
+
cudaDeviceProp dev_props;
|
| 137 |
+
cudaGetDeviceProperties(&dev_props, device);
|
| 138 |
+
|
| 139 |
+
props.max_cluster_size = get_max_cluster_size();
|
| 140 |
+
props.max_blocks_per_sm = dev_props.maxBlocksPerMultiProcessor;
|
| 141 |
+
props.shared_memory_per_block = dev_props.sharedMemPerBlock;
|
| 142 |
+
props.registers_per_block = dev_props.regsPerBlock;
|
| 143 |
+
props.supports_dshem = (dev_props.major >= 9); // SM 9.0+ has dSMEM
|
| 144 |
+
|
| 145 |
+
return props;
|
| 146 |
+
}
|
| 147 |
+
|
| 148 |
+
/**
|
| 149 |
+
* Python-compatible wrapper for cluster launch.
|
| 150 |
+
* Can be called from Python via ctypes or pybind11.
|
| 151 |
+
*/
|
| 152 |
+
extern "C" {
|
| 153 |
+
|
| 154 |
+
int fireecho_launch_cluster(
|
| 155 |
+
void* func_ptr,
|
| 156 |
+
int grid_x, int grid_y, int grid_z,
|
| 157 |
+
int block_x, int block_y, int block_z,
|
| 158 |
+
void** args,
|
| 159 |
+
int cluster_x, int cluster_y, int cluster_z,
|
| 160 |
+
void* stream_ptr
|
| 161 |
+
) {
|
| 162 |
+
CUfunction func = (CUfunction)func_ptr;
|
| 163 |
+
CUstream stream = (CUstream)stream_ptr;
|
| 164 |
+
|
| 165 |
+
ClusterConfig config;
|
| 166 |
+
config.cluster_dim_x = cluster_x;
|
| 167 |
+
config.cluster_dim_y = cluster_y;
|
| 168 |
+
config.cluster_dim_z = cluster_z;
|
| 169 |
+
|
| 170 |
+
CUresult result = launch_with_cluster(
|
| 171 |
+
func,
|
| 172 |
+
dim3(grid_x, grid_y, grid_z),
|
| 173 |
+
dim3(block_x, block_y, block_z),
|
| 174 |
+
args,
|
| 175 |
+
config,
|
| 176 |
+
stream
|
| 177 |
+
);
|
| 178 |
+
|
| 179 |
+
return (int)result;
|
| 180 |
+
}
|
| 181 |
+
|
| 182 |
+
int fireecho_supports_clusters() {
|
| 183 |
+
return supports_clusters() ? 1 : 0;
|
| 184 |
+
}
|
| 185 |
+
|
| 186 |
+
int fireecho_max_cluster_size() {
|
| 187 |
+
return get_max_cluster_size();
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
} // extern "C"
|
| 191 |
+
|
| 192 |
+
} // namespace fireecho
|
| 193 |
+
|
| 194 |
+
#endif // FIREECHO_CLUSTER_LAUNCH_H
|
FireEcho Engine/csrc/dsmem_cluster.cuh
ADDED
|
@@ -0,0 +1,344 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* FireEcho Kernel - Distributed Shared Memory & Cluster Barriers
|
| 3 |
+
*
|
| 4 |
+
* Implements:
|
| 5 |
+
* 1. DSMEM via mapa PTX instruction
|
| 6 |
+
* 2. Cluster barriers via mbarrier PTX
|
| 7 |
+
* 3. Cooperative Groups cluster API
|
| 8 |
+
*
|
| 9 |
+
* Requirements:
|
| 10 |
+
* - CUDA 12.0+ (for Hopper cluster support)
|
| 11 |
+
* - CUDA 12.8+ (for Blackwell SM 12.0)
|
| 12 |
+
* - SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
|
| 13 |
+
*/
|
| 14 |
+
|
| 15 |
+
#ifndef FIREECHO_DSMEM_CLUSTER_CUH
|
| 16 |
+
#define FIREECHO_DSMEM_CLUSTER_CUH
|
| 17 |
+
|
| 18 |
+
#include <cuda.h>
|
| 19 |
+
#include <cuda_runtime.h>
|
| 20 |
+
#include <cooperative_groups.h>
|
| 21 |
+
#include <cooperative_groups/memcpy_async.h>
|
| 22 |
+
|
| 23 |
+
namespace cg = cooperative_groups;
|
| 24 |
+
|
| 25 |
+
namespace fireecho {
|
| 26 |
+
namespace dsmem {
|
| 27 |
+
|
| 28 |
+
// ============================================================================
|
| 29 |
+
// 1. DISTRIBUTED SHARED MEMORY (DSMEM) via mapa PTX
|
| 30 |
+
// ============================================================================
|
| 31 |
+
|
| 32 |
+
/**
|
| 33 |
+
* Map a local shared memory address to a remote block's shared memory.
|
| 34 |
+
* Uses the mapa PTX instruction for cluster-wide SMEM access.
|
| 35 |
+
*
|
| 36 |
+
* @param local_smem_ptr Local shared memory pointer
|
| 37 |
+
* @param target_rank Target block rank within the cluster (0-indexed)
|
| 38 |
+
* @return Generic pointer accessible across cluster
|
| 39 |
+
*/
|
| 40 |
+
__device__ __forceinline__ void* map_shared_to_rank(void* local_smem_ptr, int target_rank) {
|
| 41 |
+
void* remote_ptr;
|
| 42 |
+
|
| 43 |
+
// mapa.shared::cluster.u32 maps local SMEM to cluster-wide address
|
| 44 |
+
asm volatile(
|
| 45 |
+
"mapa.shared::cluster.u32 %0, %1, %2;"
|
| 46 |
+
: "=r"(remote_ptr)
|
| 47 |
+
: "r"(local_smem_ptr), "r"(target_rank)
|
| 48 |
+
);
|
| 49 |
+
|
| 50 |
+
return remote_ptr;
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
/**
|
| 54 |
+
* Map shared memory using cooperative_groups (higher-level API).
|
| 55 |
+
* Preferred over raw PTX when available.
|
| 56 |
+
*/
|
| 57 |
+
template<typename T>
|
| 58 |
+
__device__ __forceinline__ T* map_shared_rank_cg(T* local_ptr, int target_rank) {
|
| 59 |
+
auto cluster = cg::this_cluster();
|
| 60 |
+
return cluster.map_shared_rank(local_ptr, target_rank);
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
/**
|
| 64 |
+
* Get the current block's rank within the cluster.
|
| 65 |
+
*/
|
| 66 |
+
__device__ __forceinline__ int get_cluster_rank() {
|
| 67 |
+
auto cluster = cg::this_cluster();
|
| 68 |
+
return cluster.block_rank();
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
/**
|
| 72 |
+
* Get the total number of blocks in the cluster.
|
| 73 |
+
*/
|
| 74 |
+
__device__ __forceinline__ int get_cluster_size() {
|
| 75 |
+
auto cluster = cg::this_cluster();
|
| 76 |
+
return cluster.num_blocks();
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
// ============================================================================
|
| 80 |
+
// 2. CLUSTER BARRIERS via mbarrier PTX
|
| 81 |
+
// ============================================================================
|
| 82 |
+
|
| 83 |
+
/**
|
| 84 |
+
* Cluster-wide barrier object.
|
| 85 |
+
* Uses mbarrier for hardware-accelerated synchronization.
|
| 86 |
+
*/
|
| 87 |
+
struct ClusterBarrier {
|
| 88 |
+
uint64_t barrier_state; // mbarrier state (64-bit)
|
| 89 |
+
|
| 90 |
+
/**
|
| 91 |
+
* Initialize the barrier for a given number of threads.
|
| 92 |
+
* Must be called by exactly one thread per cluster.
|
| 93 |
+
*/
|
| 94 |
+
__device__ __forceinline__ void init(int expected_count) {
|
| 95 |
+
asm volatile(
|
| 96 |
+
"mbarrier.init.shared::cluster.b64 [%0], %1;"
|
| 97 |
+
:
|
| 98 |
+
: "r"(&barrier_state), "r"(expected_count)
|
| 99 |
+
: "memory"
|
| 100 |
+
);
|
| 101 |
+
}
|
| 102 |
+
|
| 103 |
+
/**
|
| 104 |
+
* Arrive at the barrier (signal completion).
|
| 105 |
+
* Returns the phase for try_wait.
|
| 106 |
+
*/
|
| 107 |
+
__device__ __forceinline__ uint64_t arrive() {
|
| 108 |
+
uint64_t phase;
|
| 109 |
+
asm volatile(
|
| 110 |
+
"mbarrier.arrive.shared::cluster.b64 %0, [%1];"
|
| 111 |
+
: "=l"(phase)
|
| 112 |
+
: "r"(&barrier_state)
|
| 113 |
+
: "memory"
|
| 114 |
+
);
|
| 115 |
+
return phase;
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
/**
|
| 119 |
+
* Arrive and expect additional arrivals from remote blocks.
|
| 120 |
+
* Used when data is being sent to this block's SMEM.
|
| 121 |
+
*/
|
| 122 |
+
__device__ __forceinline__ uint64_t arrive_expect_tx(int tx_count) {
|
| 123 |
+
uint64_t phase;
|
| 124 |
+
asm volatile(
|
| 125 |
+
"mbarrier.arrive.expect_tx.shared::cluster.b64 %0, [%1], %2;"
|
| 126 |
+
: "=l"(phase)
|
| 127 |
+
: "r"(&barrier_state), "r"(tx_count)
|
| 128 |
+
: "memory"
|
| 129 |
+
);
|
| 130 |
+
return phase;
|
| 131 |
+
}
|
| 132 |
+
|
| 133 |
+
/**
|
| 134 |
+
* Try to wait on the barrier (non-blocking check).
|
| 135 |
+
*/
|
| 136 |
+
__device__ __forceinline__ bool try_wait(uint64_t phase) {
|
| 137 |
+
int complete;
|
| 138 |
+
asm volatile(
|
| 139 |
+
"{"
|
| 140 |
+
".reg .pred P;"
|
| 141 |
+
"mbarrier.try_wait.shared::cluster.b64 P, [%1], %2;"
|
| 142 |
+
"selp.s32 %0, 1, 0, P;"
|
| 143 |
+
"}"
|
| 144 |
+
: "=r"(complete)
|
| 145 |
+
: "r"(&barrier_state), "l"(phase)
|
| 146 |
+
: "memory"
|
| 147 |
+
);
|
| 148 |
+
return complete != 0;
|
| 149 |
+
}
|
| 150 |
+
|
| 151 |
+
/**
|
| 152 |
+
* Wait on the barrier (blocking).
|
| 153 |
+
* Spins until all arrivals complete.
|
| 154 |
+
*/
|
| 155 |
+
__device__ __forceinline__ void wait(uint64_t phase) {
|
| 156 |
+
while (!try_wait(phase)) {
|
| 157 |
+
// Yield to reduce power consumption while spinning
|
| 158 |
+
__nanosleep(100);
|
| 159 |
+
}
|
| 160 |
+
}
|
| 161 |
+
};
|
| 162 |
+
|
| 163 |
+
/**
|
| 164 |
+
* Simple cluster-wide synchronization.
|
| 165 |
+
* Synchronizes all threads across all blocks in the cluster.
|
| 166 |
+
*/
|
| 167 |
+
__device__ __forceinline__ void cluster_sync() {
|
| 168 |
+
auto cluster = cg::this_cluster();
|
| 169 |
+
cluster.sync();
|
| 170 |
+
}
|
| 171 |
+
|
| 172 |
+
/**
|
| 173 |
+
* Cluster sync with memory fence.
|
| 174 |
+
* Ensures all DSMEM operations are visible.
|
| 175 |
+
*/
|
| 176 |
+
__device__ __forceinline__ void cluster_sync_fence() {
|
| 177 |
+
// Memory fence at cluster scope
|
| 178 |
+
asm volatile("fence.acq_rel.cluster;");
|
| 179 |
+
cluster_sync();
|
| 180 |
+
asm volatile("fence.acq_rel.cluster;");
|
| 181 |
+
}
|
| 182 |
+
|
| 183 |
+
// ============================================================================
|
| 184 |
+
// 3. DSMEM DATA TRANSFER PRIMITIVES
|
| 185 |
+
// ============================================================================
|
| 186 |
+
|
| 187 |
+
/**
|
| 188 |
+
* Async copy from local SMEM to remote block's SMEM.
|
| 189 |
+
* Uses cp.async with cluster scope.
|
| 190 |
+
*/
|
| 191 |
+
template<typename T, int SIZE>
|
| 192 |
+
__device__ __forceinline__ void async_copy_to_rank(
|
| 193 |
+
T* dst_smem, // Local destination pointer
|
| 194 |
+
T* src_smem, // Local source pointer
|
| 195 |
+
int target_rank // Target block rank
|
| 196 |
+
) {
|
| 197 |
+
// Map source to target's address space
|
| 198 |
+
T* remote_dst = (T*)map_shared_to_rank(dst_smem, target_rank);
|
| 199 |
+
|
| 200 |
+
// Async copy with cluster scope
|
| 201 |
+
asm volatile(
|
| 202 |
+
"cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2;"
|
| 203 |
+
:
|
| 204 |
+
: "r"(remote_dst), "l"(src_smem), "r"(SIZE * sizeof(T))
|
| 205 |
+
: "memory"
|
| 206 |
+
);
|
| 207 |
+
}
|
| 208 |
+
|
| 209 |
+
/**
|
| 210 |
+
* Load from remote block's shared memory.
|
| 211 |
+
*/
|
| 212 |
+
template<typename T>
|
| 213 |
+
__device__ __forceinline__ T load_remote_smem(T* local_smem, int target_rank) {
|
| 214 |
+
T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
|
| 215 |
+
return *remote;
|
| 216 |
+
}
|
| 217 |
+
|
| 218 |
+
/**
|
| 219 |
+
* Store to remote block's shared memory.
|
| 220 |
+
*/
|
| 221 |
+
template<typename T>
|
| 222 |
+
__device__ __forceinline__ void store_remote_smem(T* local_smem, T value, int target_rank) {
|
| 223 |
+
T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
|
| 224 |
+
*remote = value;
|
| 225 |
+
}
|
| 226 |
+
|
| 227 |
+
/**
|
| 228 |
+
* Atomic add to remote block's shared memory.
|
| 229 |
+
*/
|
| 230 |
+
template<typename T>
|
| 231 |
+
__device__ __forceinline__ T atomic_add_remote_smem(T* local_smem, T value, int target_rank) {
|
| 232 |
+
T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
|
| 233 |
+
return atomicAdd(remote, value);
|
| 234 |
+
}
|
| 235 |
+
|
| 236 |
+
// ============================================================================
|
| 237 |
+
// 4. HIGH-LEVEL CLUSTER MATMUL PRIMITIVES
|
| 238 |
+
// ============================================================================
|
| 239 |
+
|
| 240 |
+
/**
|
| 241 |
+
* 2-CTA Cooperative Matrix Multiply using DSMEM.
|
| 242 |
+
*
|
| 243 |
+
* Block 0: Loads A tiles, shares via DSMEM
|
| 244 |
+
* Block 1: Loads B tiles, shares via DSMEM
|
| 245 |
+
* Both: Compute partial C, reduce via DSMEM
|
| 246 |
+
*/
|
| 247 |
+
template<int BLOCK_M, int BLOCK_N, int BLOCK_K>
|
| 248 |
+
struct ClusterMatmul {
|
| 249 |
+
// Shared memory layout for 2-CTA cooperative multiply
|
| 250 |
+
struct SharedStorage {
|
| 251 |
+
__align__(128) float A_tile[BLOCK_M][BLOCK_K];
|
| 252 |
+
__align__(128) float B_tile[BLOCK_K][BLOCK_N];
|
| 253 |
+
__align__(128) float C_partial[BLOCK_M][BLOCK_N];
|
| 254 |
+
ClusterBarrier barrier;
|
| 255 |
+
};
|
| 256 |
+
|
| 257 |
+
__device__ static void compute(
|
| 258 |
+
SharedStorage& smem,
|
| 259 |
+
const float* A, const float* B, float* C,
|
| 260 |
+
int M, int N, int K
|
| 261 |
+
) {
|
| 262 |
+
int rank = get_cluster_rank();
|
| 263 |
+
int tid = threadIdx.x;
|
| 264 |
+
|
| 265 |
+
// Initialize barrier (only rank 0, thread 0)
|
| 266 |
+
if (rank == 0 && tid == 0) {
|
| 267 |
+
smem.barrier.init(BLOCK_M * 2); // 2 blocks participating
|
| 268 |
+
}
|
| 269 |
+
cluster_sync();
|
| 270 |
+
|
| 271 |
+
// Each block loads different data
|
| 272 |
+
if (rank == 0) {
|
| 273 |
+
// Load A tile
|
| 274 |
+
// ... (tile loading logic)
|
| 275 |
+
} else {
|
| 276 |
+
// Load B tile
|
| 277 |
+
// ... (tile loading logic)
|
| 278 |
+
}
|
| 279 |
+
|
| 280 |
+
// Synchronize and share via DSMEM
|
| 281 |
+
uint64_t phase = smem.barrier.arrive();
|
| 282 |
+
smem.barrier.wait(phase);
|
| 283 |
+
|
| 284 |
+
// Access partner's data via DSMEM
|
| 285 |
+
auto partner_smem = (SharedStorage*)map_shared_to_rank(&smem, 1 - rank);
|
| 286 |
+
|
| 287 |
+
// Compute using both tiles
|
| 288 |
+
// ... (matrix multiply accumulate)
|
| 289 |
+
|
| 290 |
+
// Final reduction
|
| 291 |
+
cluster_sync_fence();
|
| 292 |
+
}
|
| 293 |
+
};
|
| 294 |
+
|
| 295 |
+
} // namespace dsmem
|
| 296 |
+
|
| 297 |
+
// ============================================================================
|
| 298 |
+
// 5. SUPER-CLUSTER FORWARD DECLARATIONS (Vera Rubin / NVL72+)
|
| 299 |
+
// ============================================================================
|
| 300 |
+
|
| 301 |
+
namespace supercluster {
|
| 302 |
+
|
| 303 |
+
/**
|
| 304 |
+
* Super-Cluster configuration for Vera Rubin NVL72/NVL144.
|
| 305 |
+
*
|
| 306 |
+
* Note: This is a forward-looking API. Full implementation requires:
|
| 307 |
+
* - Vera Rubin hardware (2H 2026)
|
| 308 |
+
* - CUDA 13.0+ with NVLink 6 support
|
| 309 |
+
* - GB200/GR200 NVL72 or NVL144 system
|
| 310 |
+
*/
|
| 311 |
+
struct SuperClusterConfig {
|
| 312 |
+
int num_gpus = 72; // NVL72 default
|
| 313 |
+
int gpus_per_node = 8; // Grace-Rubin configuration
|
| 314 |
+
int nvlink_bandwidth_tb_s = 3; // 3.6 TB/s per GPU
|
| 315 |
+
bool use_coherent_memory = true;
|
| 316 |
+
};
|
| 317 |
+
|
| 318 |
+
/**
|
| 319 |
+
* Placeholder for Super-Cluster initialization.
|
| 320 |
+
* Will use NCCL + NVLink 6 for rack-scale coherent memory.
|
| 321 |
+
*/
|
| 322 |
+
inline void init_super_cluster(const SuperClusterConfig& config) {
|
| 323 |
+
// Vera Rubin: NVL72 acts as single coherent memory space
|
| 324 |
+
// Implementation pending hardware availability
|
| 325 |
+
(void)config;
|
| 326 |
+
}
|
| 327 |
+
|
| 328 |
+
/**
|
| 329 |
+
* Super-Cluster all-reduce (rack-scale).
|
| 330 |
+
* Leverages 3.6 TB/s NVLink 6 bandwidth.
|
| 331 |
+
*/
|
| 332 |
+
template<typename T>
|
| 333 |
+
void all_reduce_super_cluster(T* data, size_t count) {
|
| 334 |
+
// Future: Direct NVLink 6 all-reduce without host involvement
|
| 335 |
+
// For now, falls back to NCCL
|
| 336 |
+
(void)data;
|
| 337 |
+
(void)count;
|
| 338 |
+
}
|
| 339 |
+
|
| 340 |
+
} // namespace supercluster
|
| 341 |
+
|
| 342 |
+
} // namespace fireecho
|
| 343 |
+
|
| 344 |
+
#endif // FIREECHO_DSMEM_CLUSTER_CUH
|
FireEcho Engine/csrc/femx_bindings.cpp
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// FE-MX CUDA Kernels — pybind11 bindings
|
| 2 |
+
// JIT-compiled via torch.utils.cpp_extension.load()
|
| 3 |
+
|
| 4 |
+
#include <torch/extension.h>
|
| 5 |
+
|
| 6 |
+
// Forward declarations from femx_kernels.cu
|
| 7 |
+
void femx_quantize_impl(
|
| 8 |
+
torch::Tensor master,
|
| 9 |
+
torch::Tensor tier,
|
| 10 |
+
torch::Tensor packed,
|
| 11 |
+
torch::Tensor scales,
|
| 12 |
+
bool stochastic,
|
| 13 |
+
int64_t seed
|
| 14 |
+
);
|
| 15 |
+
|
| 16 |
+
torch::Tensor femx_dequantize_impl(
|
| 17 |
+
torch::Tensor packed,
|
| 18 |
+
torch::Tensor scales,
|
| 19 |
+
torch::Tensor tier,
|
| 20 |
+
int64_t block_size
|
| 21 |
+
);
|
| 22 |
+
|
| 23 |
+
void femx_sync_impl(
|
| 24 |
+
torch::Tensor master,
|
| 25 |
+
torch::Tensor tier,
|
| 26 |
+
torch::Tensor packed,
|
| 27 |
+
torch::Tensor scales,
|
| 28 |
+
torch::Tensor fast_weight,
|
| 29 |
+
int64_t seed
|
| 30 |
+
);
|
| 31 |
+
|
| 32 |
+
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
| 33 |
+
m.doc() = "FE-MX CUDA kernels: fused quantize/dequantize for Hebbian memory";
|
| 34 |
+
m.def("femx_quantize", &femx_quantize_impl,
|
| 35 |
+
"Quantize FP32 master to packed uint8 + E8M0 scales (stochastic rounding)",
|
| 36 |
+
py::arg("master"), py::arg("tier"),
|
| 37 |
+
py::arg("packed"), py::arg("scales"),
|
| 38 |
+
py::arg("stochastic"), py::arg("seed"));
|
| 39 |
+
m.def("femx_dequantize", &femx_dequantize_impl,
|
| 40 |
+
"Dequantize packed uint8 + E8M0 scales to FP32",
|
| 41 |
+
py::arg("packed"), py::arg("scales"),
|
| 42 |
+
py::arg("tier"), py::arg("block_size"));
|
| 43 |
+
m.def("femx_sync", &femx_sync_impl,
|
| 44 |
+
"Fused quantize + dequantize: master FP32 -> packed + BF16 fast_weight",
|
| 45 |
+
py::arg("master"), py::arg("tier"),
|
| 46 |
+
py::arg("packed"), py::arg("scales"),
|
| 47 |
+
py::arg("fast_weight"), py::arg("seed"));
|
| 48 |
+
}
|
FireEcho Engine/csrc/femx_kernels.cu
ADDED
|
@@ -0,0 +1,422 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// FE-MX CUDA Kernels — Fused quantize/dequantize for Hebbian memory
|
| 2 |
+
// Block Floating Point with E8M0 shared exponents, stochastic rounding,
|
| 3 |
+
// and age-adaptive precision tiers (FEMX4/FEMX6/FEMX8).
|
| 4 |
+
//
|
| 5 |
+
// JIT-compiled via torch.utils.cpp_extension.load()
|
| 6 |
+
//
|
| 7 |
+
// Kernel 1: femx_quantize_kernel — master FP32 → packed uint8 + E8M0 scales
|
| 8 |
+
// Kernel 2: femx_dequantize_kernel — packed uint8 + E8M0 scales → FP32
|
| 9 |
+
// Kernel 3: femx_sync_kernel — fused quantize + dequantize → BF16 writeback
|
| 10 |
+
|
| 11 |
+
#include <torch/extension.h>
|
| 12 |
+
#include <cuda_runtime.h>
|
| 13 |
+
#include <cuda_bf16.h>
|
| 14 |
+
#include <curand_kernel.h>
|
| 15 |
+
#include <math.h>
|
| 16 |
+
|
| 17 |
+
// ============================================================================
|
| 18 |
+
// Constants
|
| 19 |
+
// ============================================================================
|
| 20 |
+
|
| 21 |
+
// Tier mantissa bits: FEMX4=3, FEMX6=5, FEMX8=7
|
| 22 |
+
__constant__ int TIER_MBITS[3] = {3, 5, 7};
|
| 23 |
+
|
| 24 |
+
#define CUDA_CHECK(call) do { \
|
| 25 |
+
cudaError_t err = (call); \
|
| 26 |
+
TORCH_CHECK(err == cudaSuccess, "CUDA error: ", cudaGetErrorString(err)); \
|
| 27 |
+
} while(0)
|
| 28 |
+
|
| 29 |
+
// ============================================================================
|
| 30 |
+
// Device helpers
|
| 31 |
+
// ============================================================================
|
| 32 |
+
|
| 33 |
+
// Get mantissa bits and levels from tier
|
| 34 |
+
__device__ __forceinline__ void tier_params(int tier, int& mantissa_bits, int& levels) {
|
| 35 |
+
mantissa_bits = (tier == 0) ? 3 : (tier == 1) ? 5 : 7;
|
| 36 |
+
levels = 1 << mantissa_bits;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
// Compute E8M0 shared exponent: ceil(log2(abs_max)) + 127
|
| 40 |
+
// Returns 0 for zero blocks.
|
| 41 |
+
__device__ __forceinline__ uint8_t compute_e8m0(float abs_max) {
|
| 42 |
+
if (abs_max == 0.0f) return 0;
|
| 43 |
+
int exp = (int)ceilf(log2f(abs_max)) + 127;
|
| 44 |
+
return (uint8_t)max(0, min(254, exp));
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
// Warp-level max reduction (full warp, 32 threads)
|
| 48 |
+
__device__ __forceinline__ float warp_reduce_max(float val) {
|
| 49 |
+
#pragma unroll
|
| 50 |
+
for (int offset = 16; offset > 0; offset >>= 1) {
|
| 51 |
+
val = fmaxf(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
|
| 52 |
+
}
|
| 53 |
+
return __shfl_sync(0xFFFFFFFF, val, 0); // broadcast from lane 0
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
// ============================================================================
|
| 58 |
+
// Kernel 1: Fused quantize (master FP32 → packed uint8 + E8M0 scales)
|
| 59 |
+
//
|
| 60 |
+
// Grid: (num_slots, 1, 1) — one CUDA block per memory slot
|
| 61 |
+
// Block: (256, 1, 1) — 8 warps of 32 threads
|
| 62 |
+
//
|
| 63 |
+
// Each warp processes one block of 32 elements at a time.
|
| 64 |
+
// With 96 blocks per slot (dim=3072, block_size=32) and 8 warps,
|
| 65 |
+
// each warp handles 12 iterations.
|
| 66 |
+
//
|
| 67 |
+
// Stochastic rounding uses Philox4_32_10 PRNG (one state per thread).
|
| 68 |
+
// ============================================================================
|
| 69 |
+
__global__ void femx_quantize_kernel(
|
| 70 |
+
const float* __restrict__ master, // [S, D] FP32 master copy
|
| 71 |
+
const uint8_t* __restrict__ tier, // [S] per-slot precision tier
|
| 72 |
+
uint8_t* __restrict__ packed, // [S, D] output packed uint8
|
| 73 |
+
uint8_t* __restrict__ scales, // [S, B] output E8M0 exponents
|
| 74 |
+
int num_slots,
|
| 75 |
+
int dim,
|
| 76 |
+
int block_size, // 32
|
| 77 |
+
int num_blocks, // dim / block_size
|
| 78 |
+
bool stochastic,
|
| 79 |
+
unsigned long long seed
|
| 80 |
+
) {
|
| 81 |
+
int slot_idx = blockIdx.x;
|
| 82 |
+
if (slot_idx >= num_slots) return;
|
| 83 |
+
|
| 84 |
+
int tid = threadIdx.x;
|
| 85 |
+
int warp_id = tid / 32;
|
| 86 |
+
int lane_id = tid % 32;
|
| 87 |
+
int num_warps = blockDim.x / 32;
|
| 88 |
+
|
| 89 |
+
// Read tier for this slot (uniform across all threads in block)
|
| 90 |
+
int t = (int)tier[slot_idx];
|
| 91 |
+
int mantissa_bits, levels;
|
| 92 |
+
tier_params(t, mantissa_bits, levels);
|
| 93 |
+
|
| 94 |
+
// Init Philox PRNG per thread (only if stochastic)
|
| 95 |
+
curandStatePhilox4_32_10_t rng;
|
| 96 |
+
if (stochastic) {
|
| 97 |
+
curand_init(seed,
|
| 98 |
+
(unsigned long long)(slot_idx * blockDim.x + tid),
|
| 99 |
+
0, &rng);
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
// Base pointers for this slot
|
| 103 |
+
const float* slot_m = master + (long long)slot_idx * dim;
|
| 104 |
+
uint8_t* slot_p = packed + (long long)slot_idx * dim;
|
| 105 |
+
uint8_t* slot_s = scales + (long long)slot_idx * num_blocks;
|
| 106 |
+
|
| 107 |
+
// Each warp handles blocks in strided order
|
| 108 |
+
for (int blk = warp_id; blk < num_blocks; blk += num_warps) {
|
| 109 |
+
int elem_off = blk * block_size + lane_id;
|
| 110 |
+
|
| 111 |
+
// 1. Load one element per lane
|
| 112 |
+
float val = (elem_off < dim) ? slot_m[elem_off] : 0.0f;
|
| 113 |
+
float abs_val = fabsf(val);
|
| 114 |
+
|
| 115 |
+
// 2. Warp-level reduction for block abs_max
|
| 116 |
+
float block_max = warp_reduce_max(abs_val);
|
| 117 |
+
|
| 118 |
+
// 3. E8M0 shared exponent
|
| 119 |
+
uint8_t e8m0 = compute_e8m0(block_max);
|
| 120 |
+
|
| 121 |
+
// 4. Normalize: val / 2^(e8m0 - 127)
|
| 122 |
+
float scale = exp2f((float)e8m0 - 127.0f);
|
| 123 |
+
scale = fmaxf(scale, 1e-38f); // avoid div-by-zero
|
| 124 |
+
float normalized = val / scale;
|
| 125 |
+
|
| 126 |
+
// 5. Scale to integer range
|
| 127 |
+
float scaled = normalized * (float)levels;
|
| 128 |
+
|
| 129 |
+
// 6. Round (stochastic or deterministic)
|
| 130 |
+
float rounded;
|
| 131 |
+
if (stochastic) {
|
| 132 |
+
float noise = curand_uniform(&rng);
|
| 133 |
+
rounded = floorf(scaled + noise);
|
| 134 |
+
} else {
|
| 135 |
+
rounded = roundf(scaled);
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
// 7. Clamp to representable range: [-levels, levels-1]
|
| 139 |
+
rounded = fmaxf((float)(-levels), fminf((float)(levels - 1), rounded));
|
| 140 |
+
int rounded_int = (int)rounded;
|
| 141 |
+
|
| 142 |
+
// 8. Sign-magnitude packing
|
| 143 |
+
uint8_t sign_bit = (rounded_int < 0) ? 1 : 0;
|
| 144 |
+
int abs_rounded = abs(rounded_int);
|
| 145 |
+
uint8_t mag = (uint8_t)min(abs_rounded, levels - 1);
|
| 146 |
+
uint8_t packed_val = (sign_bit << mantissa_bits) | mag;
|
| 147 |
+
|
| 148 |
+
// 9. Write packed element
|
| 149 |
+
if (elem_off < dim) {
|
| 150 |
+
slot_p[elem_off] = packed_val;
|
| 151 |
+
}
|
| 152 |
+
|
| 153 |
+
// 10. Lane 0 writes the shared exponent for this block
|
| 154 |
+
if (lane_id == 0) {
|
| 155 |
+
slot_s[blk] = e8m0;
|
| 156 |
+
}
|
| 157 |
+
}
|
| 158 |
+
}
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
// ============================================================================
|
| 162 |
+
// Kernel 2: Fused dequantize (packed uint8 + E8M0 → FP32)
|
| 163 |
+
//
|
| 164 |
+
// Grid: ceil(total_elements / 256)
|
| 165 |
+
// Block: 256 threads
|
| 166 |
+
//
|
| 167 |
+
// Simple element-parallel kernel. Each thread dequantizes one element.
|
| 168 |
+
// Bandwidth-bound — no shared memory or reductions needed.
|
| 169 |
+
// ============================================================================
|
| 170 |
+
__global__ void femx_dequantize_kernel(
|
| 171 |
+
const uint8_t* __restrict__ packed, // [S, D] packed mantissa+sign
|
| 172 |
+
const uint8_t* __restrict__ scales_buf, // [S, B] E8M0 shared exponents
|
| 173 |
+
const uint8_t* __restrict__ tier, // [S] per-slot tier
|
| 174 |
+
float* __restrict__ output, // [S, D] FP32 output
|
| 175 |
+
int num_slots,
|
| 176 |
+
int dim,
|
| 177 |
+
int block_size,
|
| 178 |
+
int num_blocks
|
| 179 |
+
) {
|
| 180 |
+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
| 181 |
+
int total = num_slots * dim;
|
| 182 |
+
if (idx >= total) return;
|
| 183 |
+
|
| 184 |
+
int slot_idx = idx / dim;
|
| 185 |
+
int elem_idx = idx % dim;
|
| 186 |
+
int blk_idx = elem_idx / block_size;
|
| 187 |
+
|
| 188 |
+
// Tier → mantissa bits
|
| 189 |
+
int t = (int)tier[slot_idx];
|
| 190 |
+
int mantissa_bits, levels;
|
| 191 |
+
tier_params(t, mantissa_bits, levels);
|
| 192 |
+
int mask = levels - 1;
|
| 193 |
+
|
| 194 |
+
// Unpack sign-magnitude
|
| 195 |
+
uint8_t p = packed[idx];
|
| 196 |
+
uint8_t sign = (p >> mantissa_bits) & 1;
|
| 197 |
+
float mag = (float)(p & mask);
|
| 198 |
+
|
| 199 |
+
// Reconstruct normalized value
|
| 200 |
+
float val = sign ? -mag : mag;
|
| 201 |
+
val /= (float)levels;
|
| 202 |
+
|
| 203 |
+
// Apply shared exponent
|
| 204 |
+
uint8_t e8m0 = scales_buf[(long long)slot_idx * num_blocks + blk_idx];
|
| 205 |
+
float scale = exp2f((float)e8m0 - 127.0f);
|
| 206 |
+
|
| 207 |
+
output[idx] = val * scale;
|
| 208 |
+
}
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
// ============================================================================
|
| 212 |
+
// Kernel 3: Fused sync (quantize master → packed + dequantize → BF16 writeback)
|
| 213 |
+
//
|
| 214 |
+
// Combines quantize and dequantize in a single pass:
|
| 215 |
+
// 1. Read master FP32
|
| 216 |
+
// 2. Quantize to packed uint8 (with stochastic rounding)
|
| 217 |
+
// 3. Immediately dequantize the quantized value (no memory round-trip)
|
| 218 |
+
// 4. Write BF16 to fast_weight
|
| 219 |
+
//
|
| 220 |
+
// This avoids a separate read pass over packed+scales, saving bandwidth.
|
| 221 |
+
// Same grid/block layout as the quantize kernel.
|
| 222 |
+
// ============================================================================
|
| 223 |
+
__global__ void femx_sync_kernel(
|
| 224 |
+
const float* __restrict__ master, // [S, D] FP32 input
|
| 225 |
+
const uint8_t* __restrict__ tier, // [S] per-slot tier
|
| 226 |
+
uint8_t* __restrict__ packed, // [S, D] packed output
|
| 227 |
+
uint8_t* __restrict__ scales, // [S, B] E8M0 output
|
| 228 |
+
__nv_bfloat16* __restrict__ fast_weight, // [S, D] BF16 output
|
| 229 |
+
int num_slots,
|
| 230 |
+
int dim,
|
| 231 |
+
int block_size,
|
| 232 |
+
int num_blocks,
|
| 233 |
+
unsigned long long seed
|
| 234 |
+
) {
|
| 235 |
+
int slot_idx = blockIdx.x;
|
| 236 |
+
if (slot_idx >= num_slots) return;
|
| 237 |
+
|
| 238 |
+
int tid = threadIdx.x;
|
| 239 |
+
int warp_id = tid / 32;
|
| 240 |
+
int lane_id = tid % 32;
|
| 241 |
+
int num_warps = blockDim.x / 32;
|
| 242 |
+
|
| 243 |
+
// Tier params (uniform across block)
|
| 244 |
+
int t = (int)tier[slot_idx];
|
| 245 |
+
int mantissa_bits, levels;
|
| 246 |
+
tier_params(t, mantissa_bits, levels);
|
| 247 |
+
int mask = levels - 1;
|
| 248 |
+
|
| 249 |
+
// Init Philox PRNG (always stochastic for sync)
|
| 250 |
+
curandStatePhilox4_32_10_t rng;
|
| 251 |
+
curand_init(seed,
|
| 252 |
+
(unsigned long long)(slot_idx * blockDim.x + tid),
|
| 253 |
+
0, &rng);
|
| 254 |
+
|
| 255 |
+
// Base pointers
|
| 256 |
+
const float* slot_m = master + (long long)slot_idx * dim;
|
| 257 |
+
uint8_t* slot_p = packed + (long long)slot_idx * dim;
|
| 258 |
+
uint8_t* slot_s = scales + (long long)slot_idx * num_blocks;
|
| 259 |
+
__nv_bfloat16* slot_fw = fast_weight + (long long)slot_idx * dim;
|
| 260 |
+
|
| 261 |
+
for (int blk = warp_id; blk < num_blocks; blk += num_warps) {
|
| 262 |
+
int elem_off = blk * block_size + lane_id;
|
| 263 |
+
|
| 264 |
+
// === QUANTIZE PASS ===
|
| 265 |
+
|
| 266 |
+
// 1. Load master
|
| 267 |
+
float val = (elem_off < dim) ? slot_m[elem_off] : 0.0f;
|
| 268 |
+
float abs_val = fabsf(val);
|
| 269 |
+
|
| 270 |
+
// 2. Block abs_max via warp reduction
|
| 271 |
+
float block_max = warp_reduce_max(abs_val);
|
| 272 |
+
|
| 273 |
+
// 3. E8M0
|
| 274 |
+
uint8_t e8m0 = compute_e8m0(block_max);
|
| 275 |
+
float scale = exp2f((float)e8m0 - 127.0f);
|
| 276 |
+
scale = fmaxf(scale, 1e-38f);
|
| 277 |
+
|
| 278 |
+
// 4. Normalize + quantize with SR
|
| 279 |
+
float normalized = val / scale;
|
| 280 |
+
float scaled = normalized * (float)levels;
|
| 281 |
+
float noise = curand_uniform(&rng);
|
| 282 |
+
float rounded = floorf(scaled + noise);
|
| 283 |
+
rounded = fmaxf((float)(-levels), fminf((float)(levels - 1), rounded));
|
| 284 |
+
int rounded_int = (int)rounded;
|
| 285 |
+
|
| 286 |
+
// 5. Pack sign-magnitude
|
| 287 |
+
uint8_t sign_bit = (rounded_int < 0) ? 1 : 0;
|
| 288 |
+
uint8_t mag = (uint8_t)min(abs(rounded_int), levels - 1);
|
| 289 |
+
uint8_t packed_val = (sign_bit << mantissa_bits) | mag;
|
| 290 |
+
|
| 291 |
+
// === DEQUANTIZE PASS (in-register, no memory round-trip) ===
|
| 292 |
+
|
| 293 |
+
// 6. Unpack what we just packed
|
| 294 |
+
float dq_mag = (float)(packed_val & mask);
|
| 295 |
+
float dq_val = sign_bit ? -dq_mag : dq_mag;
|
| 296 |
+
dq_val /= (float)levels;
|
| 297 |
+
float result = dq_val * scale; // same scale, still in register
|
| 298 |
+
|
| 299 |
+
// === WRITE ALL OUTPUTS ===
|
| 300 |
+
if (elem_off < dim) {
|
| 301 |
+
slot_p[elem_off] = packed_val;
|
| 302 |
+
slot_fw[elem_off] = __float2bfloat16(result);
|
| 303 |
+
}
|
| 304 |
+
if (lane_id == 0) {
|
| 305 |
+
slot_s[blk] = e8m0;
|
| 306 |
+
}
|
| 307 |
+
}
|
| 308 |
+
}
|
| 309 |
+
|
| 310 |
+
|
| 311 |
+
// ============================================================================
|
| 312 |
+
// Host wrapper functions (called from pybind11 bindings)
|
| 313 |
+
// ============================================================================
|
| 314 |
+
|
| 315 |
+
void femx_quantize_impl(
|
| 316 |
+
torch::Tensor master, // [S, D] float32 CUDA
|
| 317 |
+
torch::Tensor tier, // [S] uint8 CUDA
|
| 318 |
+
torch::Tensor packed, // [S, D] uint8 CUDA (output, pre-allocated)
|
| 319 |
+
torch::Tensor scales, // [S, B] uint8 CUDA (output, pre-allocated)
|
| 320 |
+
bool stochastic,
|
| 321 |
+
int64_t seed
|
| 322 |
+
) {
|
| 323 |
+
TORCH_CHECK(master.is_cuda(), "master must be on CUDA");
|
| 324 |
+
TORCH_CHECK(master.dtype() == torch::kFloat32, "master must be float32");
|
| 325 |
+
TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
|
| 326 |
+
TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
|
| 327 |
+
TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
|
| 328 |
+
|
| 329 |
+
master = master.contiguous();
|
| 330 |
+
tier = tier.contiguous();
|
| 331 |
+
|
| 332 |
+
int num_slots = master.size(0);
|
| 333 |
+
int dim = master.size(1);
|
| 334 |
+
int num_blocks = scales.size(1);
|
| 335 |
+
int block_size = dim / num_blocks;
|
| 336 |
+
|
| 337 |
+
TORCH_CHECK(dim % block_size == 0, "dim must be divisible by block_size");
|
| 338 |
+
|
| 339 |
+
int threads = 256;
|
| 340 |
+
femx_quantize_kernel<<<num_slots, threads>>>(
|
| 341 |
+
master.data_ptr<float>(),
|
| 342 |
+
tier.data_ptr<uint8_t>(),
|
| 343 |
+
packed.data_ptr<uint8_t>(),
|
| 344 |
+
scales.data_ptr<uint8_t>(),
|
| 345 |
+
num_slots, dim, block_size, num_blocks,
|
| 346 |
+
stochastic, (unsigned long long)seed
|
| 347 |
+
);
|
| 348 |
+
}
|
| 349 |
+
|
| 350 |
+
|
| 351 |
+
torch::Tensor femx_dequantize_impl(
|
| 352 |
+
torch::Tensor packed, // [S, D] uint8 CUDA
|
| 353 |
+
torch::Tensor scales, // [S, B] uint8 CUDA
|
| 354 |
+
torch::Tensor tier, // [S] uint8 CUDA
|
| 355 |
+
int64_t block_size
|
| 356 |
+
) {
|
| 357 |
+
TORCH_CHECK(packed.is_cuda(), "packed must be on CUDA");
|
| 358 |
+
TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
|
| 359 |
+
TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
|
| 360 |
+
TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
|
| 361 |
+
|
| 362 |
+
packed = packed.contiguous();
|
| 363 |
+
scales = scales.contiguous();
|
| 364 |
+
tier = tier.contiguous();
|
| 365 |
+
|
| 366 |
+
int num_slots = packed.size(0);
|
| 367 |
+
int dim = packed.size(1);
|
| 368 |
+
int num_blocks = dim / block_size;
|
| 369 |
+
|
| 370 |
+
auto output = torch::empty({num_slots, dim},
|
| 371 |
+
torch::TensorOptions().dtype(torch::kFloat32).device(packed.device()));
|
| 372 |
+
|
| 373 |
+
int total = num_slots * dim;
|
| 374 |
+
int threads = 256;
|
| 375 |
+
int blocks = (total + threads - 1) / threads;
|
| 376 |
+
|
| 377 |
+
femx_dequantize_kernel<<<blocks, threads>>>(
|
| 378 |
+
packed.data_ptr<uint8_t>(),
|
| 379 |
+
scales.data_ptr<uint8_t>(),
|
| 380 |
+
tier.data_ptr<uint8_t>(),
|
| 381 |
+
output.data_ptr<float>(),
|
| 382 |
+
num_slots, dim, block_size, num_blocks
|
| 383 |
+
);
|
| 384 |
+
|
| 385 |
+
return output;
|
| 386 |
+
}
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
void femx_sync_impl(
|
| 390 |
+
torch::Tensor master, // [S, D] float32 CUDA
|
| 391 |
+
torch::Tensor tier, // [S] uint8 CUDA
|
| 392 |
+
torch::Tensor packed, // [S, D] uint8 CUDA (output)
|
| 393 |
+
torch::Tensor scales, // [S, B] uint8 CUDA (output)
|
| 394 |
+
torch::Tensor fast_weight, // [S, D] bfloat16 CUDA (output)
|
| 395 |
+
int64_t seed
|
| 396 |
+
) {
|
| 397 |
+
TORCH_CHECK(master.is_cuda(), "master must be on CUDA");
|
| 398 |
+
TORCH_CHECK(master.dtype() == torch::kFloat32, "master must be float32");
|
| 399 |
+
TORCH_CHECK(fast_weight.dtype() == torch::kBFloat16, "fast_weight must be bfloat16");
|
| 400 |
+
TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
|
| 401 |
+
TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
|
| 402 |
+
TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
|
| 403 |
+
|
| 404 |
+
master = master.contiguous();
|
| 405 |
+
tier = tier.contiguous();
|
| 406 |
+
|
| 407 |
+
int num_slots = master.size(0);
|
| 408 |
+
int dim = master.size(1);
|
| 409 |
+
int num_blocks = scales.size(1);
|
| 410 |
+
int block_size = dim / num_blocks;
|
| 411 |
+
|
| 412 |
+
int threads = 256;
|
| 413 |
+
femx_sync_kernel<<<num_slots, threads>>>(
|
| 414 |
+
master.data_ptr<float>(),
|
| 415 |
+
tier.data_ptr<uint8_t>(),
|
| 416 |
+
packed.data_ptr<uint8_t>(),
|
| 417 |
+
scales.data_ptr<uint8_t>(),
|
| 418 |
+
reinterpret_cast<__nv_bfloat16*>(fast_weight.data_ptr<at::BFloat16>()),
|
| 419 |
+
num_slots, dim, block_size, num_blocks,
|
| 420 |
+
(unsigned long long)seed
|
| 421 |
+
);
|
| 422 |
+
}
|
FireEcho Engine/csrc/fireecho_preproc.cpp
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// FireEcho Preprocessing — pybind11 bindings (SpeechLib-matched)
|
| 2 |
+
// JIT-compiled via torch.utils.cpp_extension.load()
|
| 3 |
+
|
| 4 |
+
#include <torch/extension.h>
|
| 5 |
+
|
| 6 |
+
// Forward declarations from fireecho_preproc_cuda.cu
|
| 7 |
+
torch::Tensor cuda_stft_impl(
|
| 8 |
+
torch::Tensor audio,
|
| 9 |
+
torch::Tensor window,
|
| 10 |
+
int64_t n_fft,
|
| 11 |
+
int64_t win_length,
|
| 12 |
+
int64_t hop_length,
|
| 13 |
+
double preemph_coeff
|
| 14 |
+
);
|
| 15 |
+
|
| 16 |
+
torch::Tensor cuda_mel_filterbank_impl(
|
| 17 |
+
torch::Tensor power_spec,
|
| 18 |
+
torch::Tensor mel_matrix
|
| 19 |
+
);
|
| 20 |
+
|
| 21 |
+
torch::Tensor cuda_audio_pipeline_impl(
|
| 22 |
+
torch::Tensor audio,
|
| 23 |
+
torch::Tensor window,
|
| 24 |
+
torch::Tensor mel_matrix,
|
| 25 |
+
int64_t n_fft,
|
| 26 |
+
int64_t win_length,
|
| 27 |
+
int64_t hop_length,
|
| 28 |
+
double preemph_coeff
|
| 29 |
+
);
|
| 30 |
+
|
| 31 |
+
torch::Tensor cuda_image_preprocess_impl(
|
| 32 |
+
torch::Tensor image_rgb,
|
| 33 |
+
int64_t crop_size
|
| 34 |
+
);
|
| 35 |
+
|
| 36 |
+
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
| 37 |
+
m.doc() = "FireEcho CUDA-accelerated preprocessing (Phase 5, SpeechLib-matched)";
|
| 38 |
+
m.def("cuda_stft", &cuda_stft_impl,
|
| 39 |
+
"Batched STFT with per-frame pre-emphasis + 32768 scaling via cuFFT",
|
| 40 |
+
py::arg("audio"), py::arg("window"),
|
| 41 |
+
py::arg("n_fft"), py::arg("win_length"), py::arg("hop_length"),
|
| 42 |
+
py::arg("preemph_coeff") = 0.97);
|
| 43 |
+
m.def("cuda_mel_filterbank", &cuda_mel_filterbank_impl,
|
| 44 |
+
"Mel filterbank with pre-computed SpeechLib matrix + fused clip+log",
|
| 45 |
+
py::arg("power_spec"), py::arg("mel_matrix"));
|
| 46 |
+
m.def("cuda_audio_pipeline", &cuda_audio_pipeline_impl,
|
| 47 |
+
"Full audio pipeline: STFT + mel in single call",
|
| 48 |
+
py::arg("audio"), py::arg("window"), py::arg("mel_matrix"),
|
| 49 |
+
py::arg("n_fft"), py::arg("win_length"), py::arg("hop_length"),
|
| 50 |
+
py::arg("preemph_coeff") = 0.97);
|
| 51 |
+
m.def("cuda_image_preprocess", &cuda_image_preprocess_impl,
|
| 52 |
+
"Fused bicubic resize + normalize [-1,1] + bf16",
|
| 53 |
+
py::arg("image_rgb"), py::arg("crop_size"));
|
| 54 |
+
}
|
FireEcho Engine/csrc/fireecho_preproc_cuda.cu
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// FireEcho Preprocessing CUDA Kernels — Phase 5 (SpeechLib-matched)
|
| 2 |
+
// Accelerated audio STFT, mel filterbank, and image preprocessing
|
| 3 |
+
// JIT-compiled via torch.utils.cpp_extension
|
| 4 |
+
//
|
| 5 |
+
// Audio pipeline exactly replicates Phi-4 processing_phi4mm.py:
|
| 6 |
+
// Per-frame pre-emphasis with roll (prev[0]=frame[0]) + scale 32768
|
| 7 |
+
// Hamming window → cuFFT R2C → |z|^2 → mel matmul → clip(1.0) → ln()
|
| 8 |
+
|
| 9 |
+
#include <torch/extension.h>
|
| 10 |
+
#include <cuda_runtime.h>
|
| 11 |
+
#include <cufft.h>
|
| 12 |
+
#include <math.h>
|
| 13 |
+
|
| 14 |
+
// ============================================================================
|
| 15 |
+
// CUDA error checking
|
| 16 |
+
// ============================================================================
|
| 17 |
+
#define CUDA_CHECK(call) do { \
|
| 18 |
+
cudaError_t err = (call); \
|
| 19 |
+
TORCH_CHECK(err == cudaSuccess, "CUDA error: ", cudaGetErrorString(err)); \
|
| 20 |
+
} while(0)
|
| 21 |
+
|
| 22 |
+
#define CUFFT_CHECK(call) do { \
|
| 23 |
+
cufftResult err = (call); \
|
| 24 |
+
TORCH_CHECK(err == CUFFT_SUCCESS, "cuFFT error: ", (int)err); \
|
| 25 |
+
} while(0)
|
| 26 |
+
|
| 27 |
+
// ============================================================================
|
| 28 |
+
// Kernel 1: Frame extraction + per-frame pre-emphasis + scaling + windowing
|
| 29 |
+
//
|
| 30 |
+
// Matches SpeechLib / Phi-4 processing_phi4mm.py exactly:
|
| 31 |
+
// prev[0] = frame[0] (NOT zero — SpeechLib sets prev[:,0] = prev[:,1])
|
| 32 |
+
// prev[i] = frame[i-1] for i > 0
|
| 33 |
+
// output[i] = (frame[i] - coeff * prev[i]) * 32768.0 * window[i]
|
| 34 |
+
//
|
| 35 |
+
// Each thread-block handles one frame using shared memory.
|
| 36 |
+
// ============================================================================
|
| 37 |
+
__global__ void frame_extract_preemph_kernel(
|
| 38 |
+
const float* __restrict__ audio, // [N] raw samples
|
| 39 |
+
const float* __restrict__ window, // [win_length]
|
| 40 |
+
float* __restrict__ frames, // [num_frames, n_fft] output
|
| 41 |
+
int N,
|
| 42 |
+
int n_fft,
|
| 43 |
+
int win_length,
|
| 44 |
+
int hop_length,
|
| 45 |
+
int num_frames,
|
| 46 |
+
float preemph_coeff // 0.97
|
| 47 |
+
) {
|
| 48 |
+
extern __shared__ float sframe[]; // [win_length] shared per block
|
| 49 |
+
int frame_idx = blockIdx.x;
|
| 50 |
+
if (frame_idx >= num_frames) return;
|
| 51 |
+
|
| 52 |
+
int start = frame_idx * hop_length;
|
| 53 |
+
|
| 54 |
+
// Phase 1: Load raw samples into shared memory
|
| 55 |
+
for (int i = threadIdx.x; i < win_length; i += blockDim.x) {
|
| 56 |
+
int sample_idx = start + i;
|
| 57 |
+
sframe[i] = (sample_idx < N) ? audio[sample_idx] : 0.0f;
|
| 58 |
+
}
|
| 59 |
+
__syncthreads();
|
| 60 |
+
|
| 61 |
+
// Phase 2: Per-frame pre-emphasis + 32768 scaling + windowing + zero-pad
|
| 62 |
+
for (int i = threadIdx.x; i < n_fft; i += blockDim.x) {
|
| 63 |
+
float val = 0.0f;
|
| 64 |
+
if (i < win_length) {
|
| 65 |
+
float curr = sframe[i];
|
| 66 |
+
// SpeechLib: prev[0] = frame[0], prev[i] = frame[i-1] for i > 0
|
| 67 |
+
float prev = (i > 0) ? sframe[i - 1] : curr;
|
| 68 |
+
val = (curr - preemph_coeff * prev) * 32768.0f * window[i];
|
| 69 |
+
}
|
| 70 |
+
// Beyond win_length: val stays 0.0 (zero-pad to n_fft)
|
| 71 |
+
frames[frame_idx * n_fft + i] = val;
|
| 72 |
+
}
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
// ============================================================================
|
| 76 |
+
// Kernel 2: Power spectrum from complex FFT output
|
| 77 |
+
// |z|^2 = re^2 + im^2
|
| 78 |
+
// ============================================================================
|
| 79 |
+
__global__ void power_spectrum_kernel(
|
| 80 |
+
const cufftComplex* __restrict__ spectrum, // [num_frames, n_fft/2+1]
|
| 81 |
+
float* __restrict__ power, // [num_frames, n_fft/2+1]
|
| 82 |
+
int total_elements
|
| 83 |
+
) {
|
| 84 |
+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
| 85 |
+
if (idx >= total_elements) return;
|
| 86 |
+
|
| 87 |
+
float re = spectrum[idx].x;
|
| 88 |
+
float im = spectrum[idx].y;
|
| 89 |
+
power[idx] = re * re + im * im;
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
// ============================================================================
|
| 93 |
+
// Kernel 3: Fused clip(1.0) + natural log
|
| 94 |
+
// Applied element-wise to mel-filtered power spectrum
|
| 95 |
+
// Matches: np.log(np.clip(spec_power.dot(mel_matrix), 1.0, None))
|
| 96 |
+
// ============================================================================
|
| 97 |
+
__global__ void clip_log_kernel(
|
| 98 |
+
float* __restrict__ data, // [T * n_mels] in-place
|
| 99 |
+
int total_elements
|
| 100 |
+
) {
|
| 101 |
+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
| 102 |
+
if (idx >= total_elements) return;
|
| 103 |
+
|
| 104 |
+
data[idx] = logf(fmaxf(data[idx], 1.0f));
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
// ============================================================================
|
| 108 |
+
// Kernel 4: Fused bicubic resize + normalize for images
|
| 109 |
+
// Each thread computes one output pixel (c, y, x).
|
| 110 |
+
// Catmull-Rom spline (a = -0.5) matching TorchVision bicubic.
|
| 111 |
+
// Output: normalized to [-1, 1] range.
|
| 112 |
+
// ============================================================================
|
| 113 |
+
__device__ float cubic_weight(float x, float a = -0.5f) {
|
| 114 |
+
x = fabsf(x);
|
| 115 |
+
if (x <= 1.0f) {
|
| 116 |
+
return (a + 2.0f) * x * x * x - (a + 3.0f) * x * x + 1.0f;
|
| 117 |
+
} else if (x < 2.0f) {
|
| 118 |
+
return a * x * x * x - 5.0f * a * x * x + 8.0f * a * x - 4.0f * a;
|
| 119 |
+
}
|
| 120 |
+
return 0.0f;
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
__global__ void image_resize_normalize_kernel(
|
| 124 |
+
const unsigned char* __restrict__ image, // [H_in, W_in, 3] uint8
|
| 125 |
+
float* __restrict__ output, // [3, H_out, W_out] float
|
| 126 |
+
int H_in, int W_in,
|
| 127 |
+
int H_out, int W_out
|
| 128 |
+
) {
|
| 129 |
+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
| 130 |
+
int total = 3 * H_out * W_out;
|
| 131 |
+
if (idx >= total) return;
|
| 132 |
+
|
| 133 |
+
int c = idx / (H_out * W_out);
|
| 134 |
+
int rem = idx % (H_out * W_out);
|
| 135 |
+
int y_out = rem / W_out;
|
| 136 |
+
int x_out = rem % W_out;
|
| 137 |
+
|
| 138 |
+
// Map output coordinate to input coordinate
|
| 139 |
+
float scale_y = (float)H_in / (float)H_out;
|
| 140 |
+
float scale_x = (float)W_in / (float)W_out;
|
| 141 |
+
float y_in = ((float)y_out + 0.5f) * scale_y - 0.5f;
|
| 142 |
+
float x_in = ((float)x_out + 0.5f) * scale_x - 0.5f;
|
| 143 |
+
|
| 144 |
+
int y0 = (int)floorf(y_in) - 1;
|
| 145 |
+
int x0 = (int)floorf(x_in) - 1;
|
| 146 |
+
|
| 147 |
+
float sum = 0.0f;
|
| 148 |
+
float weight_sum = 0.0f;
|
| 149 |
+
|
| 150 |
+
// 4x4 bicubic kernel
|
| 151 |
+
for (int dy = 0; dy < 4; dy++) {
|
| 152 |
+
int yy = y0 + dy;
|
| 153 |
+
float wy = cubic_weight(y_in - (float)yy);
|
| 154 |
+
// Clamp to image bounds
|
| 155 |
+
yy = max(0, min(yy, H_in - 1));
|
| 156 |
+
|
| 157 |
+
for (int dx = 0; dx < 4; dx++) {
|
| 158 |
+
int xx = x0 + dx;
|
| 159 |
+
float wx = cubic_weight(x_in - (float)xx);
|
| 160 |
+
xx = max(0, min(xx, W_in - 1));
|
| 161 |
+
|
| 162 |
+
float pixel = (float)image[yy * W_in * 3 + xx * 3 + c];
|
| 163 |
+
float w = wy * wx;
|
| 164 |
+
sum += pixel * w;
|
| 165 |
+
weight_sum += w;
|
| 166 |
+
}
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
// Normalize weights, convert to [0, 1], then to [-1, 1]
|
| 170 |
+
float val = sum / fmaxf(weight_sum, 1e-6f);
|
| 171 |
+
val = val / 255.0f; // [0, 1]
|
| 172 |
+
val = (val - 0.5f) / 0.5f; // [-1, 1]
|
| 173 |
+
output[idx] = val;
|
| 174 |
+
}
|
| 175 |
+
|
| 176 |
+
// ============================================================================
|
| 177 |
+
// Host functions called from C++ bindings
|
| 178 |
+
// ============================================================================
|
| 179 |
+
|
| 180 |
+
torch::Tensor cuda_stft_impl(
|
| 181 |
+
torch::Tensor audio, // [N] float32 on CUDA
|
| 182 |
+
torch::Tensor window, // [win_length] float32 on CUDA
|
| 183 |
+
int64_t n_fft,
|
| 184 |
+
int64_t win_length,
|
| 185 |
+
int64_t hop_length,
|
| 186 |
+
double preemph_coeff // 0.97
|
| 187 |
+
) {
|
| 188 |
+
TORCH_CHECK(audio.is_cuda(), "audio must be on CUDA");
|
| 189 |
+
TORCH_CHECK(window.is_cuda(), "window must be on CUDA");
|
| 190 |
+
TORCH_CHECK(audio.dtype() == torch::kFloat32, "audio must be float32");
|
| 191 |
+
audio = audio.contiguous();
|
| 192 |
+
window = window.contiguous();
|
| 193 |
+
|
| 194 |
+
int N = audio.size(0);
|
| 195 |
+
int num_frames = (N - win_length) / hop_length + 1;
|
| 196 |
+
if (num_frames <= 0) num_frames = 1;
|
| 197 |
+
|
| 198 |
+
int freq_bins = n_fft / 2 + 1;
|
| 199 |
+
|
| 200 |
+
// Allocate frames buffer [num_frames, n_fft]
|
| 201 |
+
auto frames = torch::zeros({num_frames, n_fft},
|
| 202 |
+
torch::TensorOptions().dtype(torch::kFloat32).device(audio.device()));
|
| 203 |
+
|
| 204 |
+
// Frame extraction + per-frame pre-emphasis + 32768 scaling + windowing
|
| 205 |
+
// Shared memory: win_length floats for the raw frame
|
| 206 |
+
int threads = 256;
|
| 207 |
+
int smem = win_length * sizeof(float);
|
| 208 |
+
frame_extract_preemph_kernel<<<num_frames, threads, smem>>>(
|
| 209 |
+
audio.data_ptr<float>(),
|
| 210 |
+
window.data_ptr<float>(),
|
| 211 |
+
frames.data_ptr<float>(),
|
| 212 |
+
N, n_fft, win_length, hop_length, num_frames,
|
| 213 |
+
(float)preemph_coeff
|
| 214 |
+
);
|
| 215 |
+
|
| 216 |
+
// Batched real-to-complex FFT via cuFFT
|
| 217 |
+
cufftHandle plan;
|
| 218 |
+
CUFFT_CHECK(cufftPlan1d(&plan, n_fft, CUFFT_R2C, num_frames));
|
| 219 |
+
|
| 220 |
+
auto spectrum = torch::empty({num_frames, freq_bins},
|
| 221 |
+
torch::TensorOptions().dtype(torch::kComplexFloat).device(audio.device()));
|
| 222 |
+
|
| 223 |
+
CUFFT_CHECK(cufftExecR2C(plan,
|
| 224 |
+
frames.data_ptr<float>(),
|
| 225 |
+
reinterpret_cast<cufftComplex*>(spectrum.data_ptr<c10::complex<float>>())
|
| 226 |
+
));
|
| 227 |
+
cufftDestroy(plan);
|
| 228 |
+
|
| 229 |
+
// Power spectrum: |z|^2 = re^2 + im^2
|
| 230 |
+
int total = num_frames * freq_bins;
|
| 231 |
+
int blocks = (total + 255) / 256;
|
| 232 |
+
auto power = torch::empty({num_frames, freq_bins},
|
| 233 |
+
torch::TensorOptions().dtype(torch::kFloat32).device(audio.device()));
|
| 234 |
+
|
| 235 |
+
power_spectrum_kernel<<<blocks, 256>>>(
|
| 236 |
+
reinterpret_cast<const cufftComplex*>(spectrum.data_ptr<c10::complex<float>>()),
|
| 237 |
+
power.data_ptr<float>(),
|
| 238 |
+
total
|
| 239 |
+
);
|
| 240 |
+
|
| 241 |
+
return power; // [num_frames, n_fft/2+1]
|
| 242 |
+
}
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
torch::Tensor cuda_mel_filterbank_impl(
|
| 246 |
+
torch::Tensor power_spec, // [T, F] float32 on CUDA
|
| 247 |
+
torch::Tensor mel_matrix // [F, n_mels] float32 on CUDA (pre-computed SpeechLib, transposed)
|
| 248 |
+
) {
|
| 249 |
+
TORCH_CHECK(power_spec.is_cuda(), "power_spec must be on CUDA");
|
| 250 |
+
TORCH_CHECK(mel_matrix.is_cuda(), "mel_matrix must be on CUDA");
|
| 251 |
+
TORCH_CHECK(power_spec.dtype() == torch::kFloat32, "power_spec must be float32");
|
| 252 |
+
TORCH_CHECK(mel_matrix.dtype() == torch::kFloat32, "mel_matrix must be float32");
|
| 253 |
+
power_spec = power_spec.contiguous();
|
| 254 |
+
mel_matrix = mel_matrix.contiguous();
|
| 255 |
+
|
| 256 |
+
// mel_out = power_spec @ mel_matrix → [T, n_mels]
|
| 257 |
+
// mel_matrix is already [F, n_mels] (transposed for dot product)
|
| 258 |
+
auto mel_out = torch::mm(power_spec, mel_matrix);
|
| 259 |
+
|
| 260 |
+
// Fused clip(1.0) + log in-place
|
| 261 |
+
int total = mel_out.numel();
|
| 262 |
+
int threads = 256;
|
| 263 |
+
int blocks = (total + threads - 1) / threads;
|
| 264 |
+
clip_log_kernel<<<blocks, threads>>>(mel_out.data_ptr<float>(), total);
|
| 265 |
+
|
| 266 |
+
return mel_out; // [T, n_mels]
|
| 267 |
+
}
|
| 268 |
+
|
| 269 |
+
|
| 270 |
+
torch::Tensor cuda_audio_pipeline_impl(
|
| 271 |
+
torch::Tensor audio, // [N] float32 on CUDA
|
| 272 |
+
torch::Tensor window, // [win_length] float32 on CUDA
|
| 273 |
+
torch::Tensor mel_matrix, // [F, n_mels] float32 on CUDA (pre-computed SpeechLib)
|
| 274 |
+
int64_t n_fft,
|
| 275 |
+
int64_t win_length,
|
| 276 |
+
int64_t hop_length,
|
| 277 |
+
double preemph_coeff // 0.97
|
| 278 |
+
) {
|
| 279 |
+
// Full pipeline: audio → frames → FFT → power → mel → clip → log
|
| 280 |
+
// Single call minimizes Python ↔ CUDA round-trips
|
| 281 |
+
auto power = cuda_stft_impl(audio, window, n_fft, win_length, hop_length, preemph_coeff);
|
| 282 |
+
auto mel = cuda_mel_filterbank_impl(power, mel_matrix);
|
| 283 |
+
return mel; // [T, n_mels]
|
| 284 |
+
}
|
| 285 |
+
|
| 286 |
+
|
| 287 |
+
torch::Tensor cuda_image_preprocess_impl(
|
| 288 |
+
torch::Tensor image_rgb, // [H, W, 3] uint8 on CUDA
|
| 289 |
+
int64_t crop_size
|
| 290 |
+
) {
|
| 291 |
+
TORCH_CHECK(image_rgb.is_cuda(), "image must be on CUDA");
|
| 292 |
+
TORCH_CHECK(image_rgb.dtype() == torch::kUInt8, "image must be uint8");
|
| 293 |
+
image_rgb = image_rgb.contiguous();
|
| 294 |
+
|
| 295 |
+
int H_in = image_rgb.size(0);
|
| 296 |
+
int W_in = image_rgb.size(1);
|
| 297 |
+
int H_out = crop_size;
|
| 298 |
+
int W_out = crop_size;
|
| 299 |
+
|
| 300 |
+
// Output: [3, H_out, W_out] float32, then we'll convert to bf16
|
| 301 |
+
auto output = torch::empty({3, H_out, W_out},
|
| 302 |
+
torch::TensorOptions().dtype(torch::kFloat32).device(image_rgb.device()));
|
| 303 |
+
|
| 304 |
+
int total = 3 * H_out * W_out;
|
| 305 |
+
int threads = 256;
|
| 306 |
+
int blocks = (total + threads - 1) / threads;
|
| 307 |
+
|
| 308 |
+
image_resize_normalize_kernel<<<blocks, threads>>>(
|
| 309 |
+
image_rgb.data_ptr<unsigned char>(),
|
| 310 |
+
output.data_ptr<float>(),
|
| 311 |
+
H_in, W_in, H_out, W_out
|
| 312 |
+
);
|
| 313 |
+
|
| 314 |
+
// Add batch dimension and convert to bfloat16
|
| 315 |
+
return output.unsqueeze(0).to(torch::kBFloat16); // [1, 3, H_out, W_out]
|
| 316 |
+
}
|
FireEcho Engine/cutlass_kernels.py
ADDED
|
@@ -0,0 +1,2418 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FireEcho CUTLASS — Self-Contained CUTLASS-Compatible Kernels
|
| 3 |
+
=============================================================
|
| 4 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 5 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 6 |
+
|
| 7 |
+
Pure Python/Triton/PyTorch implementations — no .so binary required.
|
| 8 |
+
|
| 9 |
+
1. TMA MatMul — Triton block-pointer kernel with multi-stage pipelining
|
| 10 |
+
2. TMA Attention — PyTorch SDPA (dispatches to Flash Attention 2 on HW)
|
| 11 |
+
3. NVFP4 GEMM — Fused dequant-matmul Triton kernel (Blackwell native format)
|
| 12 |
+
16-element blocks, E4M3 scales, per-tensor FP32 scale.
|
| 13 |
+
Multi-tier dispatch: native cuBLAS _scaled_mm → fused Triton → CPU.
|
| 14 |
+
Vectorized O(K*N) activation quantization via torch.bucketize.
|
| 15 |
+
4. MXFP4 GEMM — Fused dequant-matmul Triton kernel (OCP MXFP4 format)
|
| 16 |
+
32-element blocks, E8M0 power-of-two scales.
|
| 17 |
+
Kept for backward compatibility.
|
| 18 |
+
5. L2 Cache Control — ctypes/libcudart.so cudaAccessPolicyWindow
|
| 19 |
+
|
| 20 |
+
Usage:
|
| 21 |
+
from fireecho_kernel.cutlass_kernels import (
|
| 22 |
+
tma_matmul,
|
| 23 |
+
tma_attention,
|
| 24 |
+
nvfp4_gemm, # New: NVFP4 (recommended)
|
| 25 |
+
mxfp4_gemm, # Legacy: MXFP4
|
| 26 |
+
fp4_gemm, # Alias -> nvfp4_gemm
|
| 27 |
+
NVFP4Weights,
|
| 28 |
+
MXFP4Weights,
|
| 29 |
+
L2CacheManager,
|
| 30 |
+
)
|
| 31 |
+
|
| 32 |
+
# TMA MatMul (Triton block-pointer)
|
| 33 |
+
c = tma_matmul(a, b)
|
| 34 |
+
|
| 35 |
+
# NVFP4 GEMM (recommended — fused dequant-matmul, 16-element blocks)
|
| 36 |
+
w_q = quantize_to_nvfp4(weights)
|
| 37 |
+
out = nvfp4_gemm(activations, w_q)
|
| 38 |
+
|
| 39 |
+
# MXFP4 GEMM (legacy — fused dequant-matmul, 32-element blocks)
|
| 40 |
+
w_q = quantize_to_mxfp4(weights)
|
| 41 |
+
out = mxfp4_gemm(activations, w_q)
|
| 42 |
+
|
| 43 |
+
# L2 Cache pinning (hardware-backed via cudart)
|
| 44 |
+
l2 = L2CacheManager()
|
| 45 |
+
l2.pin(embedding_table)
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
import torch
|
| 49 |
+
import torch.nn.functional as F
|
| 50 |
+
import triton
|
| 51 |
+
import triton.language as tl
|
| 52 |
+
from typing import Optional, Tuple, Dict, Any
|
| 53 |
+
from dataclasses import dataclass
|
| 54 |
+
import ctypes
|
| 55 |
+
import ctypes.util
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# =============================================================================
|
| 59 |
+
# Triton TMA MatMul Kernel (block-pointer, multi-stage)
|
| 60 |
+
# =============================================================================
|
| 61 |
+
|
| 62 |
+
@triton.autotune(
|
| 63 |
+
configs=[
|
| 64 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
|
| 65 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
|
| 66 |
+
triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
|
| 67 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=4),
|
| 68 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_stages=4, num_warps=8),
|
| 69 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 70 |
+
],
|
| 71 |
+
key=['M', 'N', 'K'],
|
| 72 |
+
)
|
| 73 |
+
@triton.jit
|
| 74 |
+
def _tma_matmul_kernel(
|
| 75 |
+
a_ptr, b_ptr, c_ptr, d_ptr,
|
| 76 |
+
M, N, K,
|
| 77 |
+
stride_am, stride_ak,
|
| 78 |
+
stride_bk, stride_bn,
|
| 79 |
+
stride_cm, stride_cn,
|
| 80 |
+
stride_dm, stride_dn,
|
| 81 |
+
alpha, beta,
|
| 82 |
+
HAS_C: tl.constexpr,
|
| 83 |
+
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
|
| 84 |
+
):
|
| 85 |
+
"""
|
| 86 |
+
TMA-style MatMul using block pointers for async memory access.
|
| 87 |
+
|
| 88 |
+
D = alpha * (A @ B) + beta * C
|
| 89 |
+
|
| 90 |
+
Block pointers enable hardware-managed address generation and
|
| 91 |
+
async DDR7/HBM -> SMEM loads overlapped with compute.
|
| 92 |
+
"""
|
| 93 |
+
pid_m = tl.program_id(0)
|
| 94 |
+
pid_n = tl.program_id(1)
|
| 95 |
+
|
| 96 |
+
a_block_ptr = tl.make_block_ptr(
|
| 97 |
+
base=a_ptr,
|
| 98 |
+
shape=(M, K),
|
| 99 |
+
strides=(stride_am, stride_ak),
|
| 100 |
+
offsets=(pid_m * BLOCK_M, 0),
|
| 101 |
+
block_shape=(BLOCK_M, BLOCK_K),
|
| 102 |
+
order=(1, 0),
|
| 103 |
+
)
|
| 104 |
+
b_block_ptr = tl.make_block_ptr(
|
| 105 |
+
base=b_ptr,
|
| 106 |
+
shape=(K, N),
|
| 107 |
+
strides=(stride_bk, stride_bn),
|
| 108 |
+
offsets=(0, pid_n * BLOCK_N),
|
| 109 |
+
block_shape=(BLOCK_K, BLOCK_N),
|
| 110 |
+
order=(1, 0),
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 114 |
+
|
| 115 |
+
for _ in range(0, tl.cdiv(K, BLOCK_K)):
|
| 116 |
+
a = tl.load(a_block_ptr, boundary_check=(0, 1))
|
| 117 |
+
b = tl.load(b_block_ptr, boundary_check=(0, 1))
|
| 118 |
+
acc += tl.dot(a, b)
|
| 119 |
+
a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_K))
|
| 120 |
+
b_block_ptr = tl.advance(b_block_ptr, (BLOCK_K, 0))
|
| 121 |
+
|
| 122 |
+
# Apply alpha
|
| 123 |
+
if alpha != 1.0:
|
| 124 |
+
acc = acc * alpha
|
| 125 |
+
|
| 126 |
+
# Apply beta * C
|
| 127 |
+
if HAS_C:
|
| 128 |
+
c_block_ptr = tl.make_block_ptr(
|
| 129 |
+
base=c_ptr,
|
| 130 |
+
shape=(M, N),
|
| 131 |
+
strides=(stride_cm, stride_cn),
|
| 132 |
+
offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
|
| 133 |
+
block_shape=(BLOCK_M, BLOCK_N),
|
| 134 |
+
order=(1, 0),
|
| 135 |
+
)
|
| 136 |
+
c_val = tl.load(c_block_ptr, boundary_check=(0, 1)).to(tl.float32)
|
| 137 |
+
acc = acc + beta * c_val
|
| 138 |
+
|
| 139 |
+
# Store result
|
| 140 |
+
d_block_ptr = tl.make_block_ptr(
|
| 141 |
+
base=d_ptr,
|
| 142 |
+
shape=(M, N),
|
| 143 |
+
strides=(stride_dm, stride_dn),
|
| 144 |
+
offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
|
| 145 |
+
block_shape=(BLOCK_M, BLOCK_N),
|
| 146 |
+
order=(1, 0),
|
| 147 |
+
)
|
| 148 |
+
tl.store(d_block_ptr, acc.to(tl.bfloat16), boundary_check=(0, 1))
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
# =============================================================================
|
| 152 |
+
# TMA MatMul (public API)
|
| 153 |
+
# =============================================================================
|
| 154 |
+
|
| 155 |
+
def tma_matmul(
|
| 156 |
+
a: torch.Tensor,
|
| 157 |
+
b: torch.Tensor,
|
| 158 |
+
alpha: float = 1.0,
|
| 159 |
+
beta: float = 0.0,
|
| 160 |
+
c: Optional[torch.Tensor] = None,
|
| 161 |
+
) -> torch.Tensor:
|
| 162 |
+
"""
|
| 163 |
+
TMA-style matrix multiplication via Triton block-pointer kernel.
|
| 164 |
+
|
| 165 |
+
Uses async memory transfers (block pointers + multi-stage pipelining)
|
| 166 |
+
for compute/memory overlap on SM90+ GPUs.
|
| 167 |
+
|
| 168 |
+
Args:
|
| 169 |
+
a: Input matrix A [M, K] in BF16/FP16
|
| 170 |
+
b: Input matrix B [K, N] in BF16/FP16
|
| 171 |
+
alpha: Scale for A @ B
|
| 172 |
+
beta: Scale for C
|
| 173 |
+
c: Optional input C for D = alpha * A @ B + beta * C
|
| 174 |
+
|
| 175 |
+
Returns:
|
| 176 |
+
Output matrix D [M, N]
|
| 177 |
+
"""
|
| 178 |
+
M, K = a.shape
|
| 179 |
+
K2, N = b.shape
|
| 180 |
+
assert K == K2, f"K dimension mismatch: {K} vs {K2}"
|
| 181 |
+
|
| 182 |
+
if a.dtype not in (torch.bfloat16, torch.float16):
|
| 183 |
+
a = a.to(torch.bfloat16)
|
| 184 |
+
if b.dtype != a.dtype:
|
| 185 |
+
b = b.to(a.dtype)
|
| 186 |
+
|
| 187 |
+
d = torch.empty(M, N, device=a.device, dtype=a.dtype)
|
| 188 |
+
|
| 189 |
+
if c is not None and beta != 0:
|
| 190 |
+
if c.dtype != a.dtype:
|
| 191 |
+
c = c.to(a.dtype)
|
| 192 |
+
c_contiguous = c.contiguous()
|
| 193 |
+
has_c = True
|
| 194 |
+
else:
|
| 195 |
+
c_contiguous = d # dummy — not read when HAS_C=False
|
| 196 |
+
beta = 0.0
|
| 197 |
+
has_c = False
|
| 198 |
+
|
| 199 |
+
a = a.contiguous()
|
| 200 |
+
b = b.contiguous()
|
| 201 |
+
|
| 202 |
+
# Fall back to torch.matmul on CPU
|
| 203 |
+
if not a.is_cuda:
|
| 204 |
+
result = alpha * torch.matmul(a.float(), b.float()).to(a.dtype)
|
| 205 |
+
if has_c:
|
| 206 |
+
result = result + beta * c_contiguous
|
| 207 |
+
return result
|
| 208 |
+
|
| 209 |
+
grid = lambda META: (
|
| 210 |
+
triton.cdiv(M, META['BLOCK_M']),
|
| 211 |
+
triton.cdiv(N, META['BLOCK_N']),
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
_tma_matmul_kernel[grid](
|
| 215 |
+
a, b, c_contiguous, d,
|
| 216 |
+
M, N, K,
|
| 217 |
+
a.stride(0), a.stride(1),
|
| 218 |
+
b.stride(0), b.stride(1),
|
| 219 |
+
c_contiguous.stride(0), c_contiguous.stride(1),
|
| 220 |
+
d.stride(0), d.stride(1),
|
| 221 |
+
alpha, beta,
|
| 222 |
+
HAS_C=has_c,
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
return d
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
# =============================================================================
|
| 229 |
+
# TMA Attention (SDPA-backed)
|
| 230 |
+
# =============================================================================
|
| 231 |
+
|
| 232 |
+
def tma_attention(
|
| 233 |
+
q: torch.Tensor,
|
| 234 |
+
k: torch.Tensor,
|
| 235 |
+
v: torch.Tensor,
|
| 236 |
+
scale: Optional[float] = None,
|
| 237 |
+
is_causal: bool = False,
|
| 238 |
+
dropout_p: float = 0.0,
|
| 239 |
+
) -> torch.Tensor:
|
| 240 |
+
"""
|
| 241 |
+
Attention via PyTorch SDPA (dispatches to Flash Attention 2 on supported HW).
|
| 242 |
+
|
| 243 |
+
Args:
|
| 244 |
+
q: Query tensor [batch, heads, seq_q, head_dim]
|
| 245 |
+
k: Key tensor [batch, heads, seq_kv, head_dim]
|
| 246 |
+
v: Value tensor [batch, heads, seq_kv, head_dim]
|
| 247 |
+
scale: Attention scale (default: 1/sqrt(head_dim))
|
| 248 |
+
is_causal: Apply causal mask
|
| 249 |
+
dropout_p: Dropout probability
|
| 250 |
+
|
| 251 |
+
Returns:
|
| 252 |
+
Output tensor [batch, heads, seq_q, head_dim]
|
| 253 |
+
"""
|
| 254 |
+
if scale is None:
|
| 255 |
+
scale = q.shape[-1] ** -0.5
|
| 256 |
+
|
| 257 |
+
orig_dtype = q.dtype
|
| 258 |
+
if q.dtype not in (torch.bfloat16, torch.float16):
|
| 259 |
+
q = q.to(torch.bfloat16)
|
| 260 |
+
k = k.to(torch.bfloat16)
|
| 261 |
+
v = v.to(torch.bfloat16)
|
| 262 |
+
|
| 263 |
+
o = F.scaled_dot_product_attention(
|
| 264 |
+
q, k, v,
|
| 265 |
+
attn_mask=None,
|
| 266 |
+
dropout_p=dropout_p if q.requires_grad else 0.0,
|
| 267 |
+
is_causal=is_causal,
|
| 268 |
+
scale=scale,
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
return o.to(orig_dtype)
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
def tma_gqa_attention(
|
| 275 |
+
q: torch.Tensor,
|
| 276 |
+
k: torch.Tensor,
|
| 277 |
+
v: torch.Tensor,
|
| 278 |
+
num_kv_heads: int,
|
| 279 |
+
scale: Optional[float] = None,
|
| 280 |
+
is_causal: bool = False,
|
| 281 |
+
) -> torch.Tensor:
|
| 282 |
+
"""
|
| 283 |
+
Grouped Query Attention via SDPA.
|
| 284 |
+
|
| 285 |
+
Expands KV heads to match Q heads then delegates to tma_attention.
|
| 286 |
+
|
| 287 |
+
Args:
|
| 288 |
+
q: Query [batch, num_q_heads, seq, head_dim]
|
| 289 |
+
k: Key [batch, num_kv_heads, seq, head_dim]
|
| 290 |
+
v: Value [batch, num_kv_heads, seq, head_dim]
|
| 291 |
+
num_kv_heads: Number of KV heads
|
| 292 |
+
scale: Attention scale
|
| 293 |
+
is_causal: Apply causal mask
|
| 294 |
+
|
| 295 |
+
Returns:
|
| 296 |
+
Output [batch, num_q_heads, seq, head_dim]
|
| 297 |
+
"""
|
| 298 |
+
batch, num_q_heads, seq_q, head_dim = q.shape
|
| 299 |
+
heads_per_group = num_q_heads // num_kv_heads
|
| 300 |
+
|
| 301 |
+
if heads_per_group > 1:
|
| 302 |
+
k = k.repeat_interleave(heads_per_group, dim=1)
|
| 303 |
+
v = v.repeat_interleave(heads_per_group, dim=1)
|
| 304 |
+
|
| 305 |
+
return tma_attention(q, k, v, scale=scale, is_causal=is_causal)
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
# =============================================================================
|
| 309 |
+
# Native MXFP4 GEMM (Quartet Algorithm)
|
| 310 |
+
# =============================================================================
|
| 311 |
+
# Reference: "Quartet: Native FP4 Training Can Be Optimal for LLMs"
|
| 312 |
+
# https://arxiv.org/html/2505.14669v1
|
| 313 |
+
|
| 314 |
+
# E2M1 quantization grid (MXFP4/NVFP4)
|
| 315 |
+
_E2M1_VALUES = torch.tensor(
|
| 316 |
+
[0, 0.5, 1, 1.5, 2, 3, 4, 6, 0, -0.5, -1, -1.5, -2, -3, -4, -6],
|
| 317 |
+
dtype=torch.float32,
|
| 318 |
+
)
|
| 319 |
+
|
| 320 |
+
# Bucketize boundaries for vectorized E2M1 quantization.
|
| 321 |
+
# Midpoints between adjacent unsigned E2M1 values [0, 0.5, 1, 1.5, 2, 3, 4, 6].
|
| 322 |
+
# torch.bucketize gives index 0-7 for unsigned magnitude, then sign is applied.
|
| 323 |
+
_E2M1_BOUNDARIES = torch.tensor([0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0])
|
| 324 |
+
|
| 325 |
+
# QuEST optimal clipping factor (empirically derived)
|
| 326 |
+
_QUEST_CLIP_FACTOR = 0.88
|
| 327 |
+
|
| 328 |
+
|
| 329 |
+
@dataclass
|
| 330 |
+
class MXFP4Weights:
|
| 331 |
+
"""
|
| 332 |
+
MXFP4 quantized weights following OCP Microscaling Spec v1.0.
|
| 333 |
+
|
| 334 |
+
Format: 32 E2M1 values share 1 E8M0 power-of-two scale.
|
| 335 |
+
|
| 336 |
+
Memory layout:
|
| 337 |
+
- packed: [K//2, N] uint8 (2 nibbles per byte)
|
| 338 |
+
- scales: [K//32, N] uint8 (E8M0 exponent-only)
|
| 339 |
+
|
| 340 |
+
Total size: K*N/2 + K*N/32 = K*N * (1/2 + 1/32) ~ 0.53 * original
|
| 341 |
+
"""
|
| 342 |
+
packed: torch.Tensor # [K//2, N] uint8
|
| 343 |
+
scales: torch.Tensor # [K//32, N] uint8 (E8M0)
|
| 344 |
+
shape: Tuple[int, int] # Original (K, N)
|
| 345 |
+
clip_mask: Optional[torch.Tensor] = None # For QuEST gradient masking
|
| 346 |
+
|
| 347 |
+
@classmethod
|
| 348 |
+
def from_float(cls, weights: torch.Tensor, use_quest: bool = True) -> 'MXFP4Weights':
|
| 349 |
+
"""
|
| 350 |
+
Quantize FP16/FP32 weights to MXFP4 with block scaling.
|
| 351 |
+
|
| 352 |
+
Args:
|
| 353 |
+
weights: Input tensor [K, N]
|
| 354 |
+
use_quest: Use QuEST optimal clipping (recommended for forward)
|
| 355 |
+
|
| 356 |
+
Returns:
|
| 357 |
+
MXFP4Weights with packed values and E8M0 scales
|
| 358 |
+
"""
|
| 359 |
+
K, N = weights.shape
|
| 360 |
+
assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
|
| 361 |
+
|
| 362 |
+
device = weights.device
|
| 363 |
+
weights = weights.float()
|
| 364 |
+
|
| 365 |
+
# Reshape to blocks of 32
|
| 366 |
+
reshaped = weights.view(K // 32, 32, N)
|
| 367 |
+
|
| 368 |
+
# Find block-wise absmax
|
| 369 |
+
absmax = reshaped.abs().amax(dim=1) # [K//32, N]
|
| 370 |
+
absmax = absmax.clamp(min=1e-10)
|
| 371 |
+
|
| 372 |
+
# Apply QuEST clipping factor
|
| 373 |
+
if use_quest:
|
| 374 |
+
clip_bound = absmax * _QUEST_CLIP_FACTOR
|
| 375 |
+
else:
|
| 376 |
+
clip_bound = absmax
|
| 377 |
+
|
| 378 |
+
# Compute E8M0 scales (power-of-two)
|
| 379 |
+
# E8M0: value = 2^(exponent - 127), exponent in [0, 255]
|
| 380 |
+
# We want scale * 6.0 >= clip_bound, so scale >= clip_bound / 6
|
| 381 |
+
scale_float = clip_bound / 6.0 # 6.0 is E2M1 max
|
| 382 |
+
|
| 383 |
+
# Convert to E8M0 (find nearest power of 2)
|
| 384 |
+
log2_scale = torch.log2(scale_float.clamp(min=2**-126))
|
| 385 |
+
exponent = (log2_scale.round() + 127).clamp(1, 254).to(torch.uint8)
|
| 386 |
+
|
| 387 |
+
# Reconstruct actual scale from E8M0
|
| 388 |
+
actual_scale = torch.pow(2.0, exponent.float() - 127) # [K//32, N]
|
| 389 |
+
|
| 390 |
+
# Normalize by scale
|
| 391 |
+
normalized = reshaped / actual_scale.unsqueeze(1) # [K//32, 32, N]
|
| 392 |
+
|
| 393 |
+
# Clamp to E2M1 range [-6, 6]
|
| 394 |
+
normalized = normalized.clamp(-6.0, 6.0)
|
| 395 |
+
|
| 396 |
+
# Generate clip mask for gradient (QuEST)
|
| 397 |
+
if use_quest:
|
| 398 |
+
clip_mask = (reshaped.abs() > clip_bound.unsqueeze(1) * 6.0).view(K, N)
|
| 399 |
+
else:
|
| 400 |
+
clip_mask = None
|
| 401 |
+
|
| 402 |
+
# Quantize to nearest E2M1 value via vectorized bucketize.
|
| 403 |
+
# O(K*N) instead of O(K*N*16) distance matrix — eliminates 1GB temp alloc.
|
| 404 |
+
boundaries = _E2M1_BOUNDARIES.to(device)
|
| 405 |
+
abs_norm = normalized.abs().reshape(-1) # [K * N]
|
| 406 |
+
unsigned_idx = torch.bucketize(abs_norm, boundaries) # [K * N], values 0-7
|
| 407 |
+
# Sign bit only when magnitude > 0 (±0 both decode to 0.0, use index 0)
|
| 408 |
+
sign_bit = ((normalized.reshape(-1) < 0) & (unsigned_idx > 0)).to(torch.uint8) << 3
|
| 409 |
+
indices = (sign_bit | unsigned_idx.to(torch.uint8)).reshape(K, N)
|
| 410 |
+
packed = (indices[0::2] | (indices[1::2] << 4)) # [K//2, N]
|
| 411 |
+
|
| 412 |
+
return cls(
|
| 413 |
+
packed=packed,
|
| 414 |
+
scales=exponent,
|
| 415 |
+
shape=(K, N),
|
| 416 |
+
clip_mask=clip_mask,
|
| 417 |
+
)
|
| 418 |
+
|
| 419 |
+
def to_float(self) -> torch.Tensor:
|
| 420 |
+
"""Dequantize MXFP4 back to float."""
|
| 421 |
+
K, N = self.shape
|
| 422 |
+
device = self.packed.device
|
| 423 |
+
|
| 424 |
+
e2m1_grid = _E2M1_VALUES.to(device)
|
| 425 |
+
|
| 426 |
+
# Unpack nibbles
|
| 427 |
+
low = (self.packed & 0xF).long()
|
| 428 |
+
high = (self.packed >> 4).long()
|
| 429 |
+
|
| 430 |
+
# Decode E2M1 values
|
| 431 |
+
low_vals = e2m1_grid[low.flatten()].view(K // 2, N)
|
| 432 |
+
high_vals = e2m1_grid[high.flatten()].view(K // 2, N)
|
| 433 |
+
|
| 434 |
+
# Interleave
|
| 435 |
+
unpacked = torch.zeros(K, N, device=device, dtype=torch.float32)
|
| 436 |
+
unpacked[0::2] = low_vals
|
| 437 |
+
unpacked[1::2] = high_vals
|
| 438 |
+
|
| 439 |
+
# Apply E8M0 scales
|
| 440 |
+
scale_float = torch.pow(2.0, self.scales.float() - 127) # [K//32, N]
|
| 441 |
+
unpacked = unpacked.view(K // 32, 32, N)
|
| 442 |
+
unpacked = unpacked * scale_float.unsqueeze(1)
|
| 443 |
+
|
| 444 |
+
return unpacked.view(K, N)
|
| 445 |
+
|
| 446 |
+
@property
|
| 447 |
+
def compression_ratio(self) -> float:
|
| 448 |
+
"""Memory compression ratio vs FP16."""
|
| 449 |
+
K, N = self.shape
|
| 450 |
+
fp16_bytes = K * N * 2
|
| 451 |
+
mxfp4_bytes = self.packed.numel() + self.scales.numel()
|
| 452 |
+
return fp16_bytes / mxfp4_bytes
|
| 453 |
+
|
| 454 |
+
def to_native(self) -> 'NativeMXFP4':
|
| 455 |
+
"""
|
| 456 |
+
Convert to native FP4 format for tl.dot_scaled (SM100+).
|
| 457 |
+
|
| 458 |
+
One-time conversion that:
|
| 459 |
+
1. Transposes packed weights: [K//2, N] -> [N, K//2]
|
| 460 |
+
2. Converts E8M0 scales to 5D preshuffled MXScaleTensor layout:
|
| 461 |
+
[N//128, K//32//4, 32, 4, 4]
|
| 462 |
+
3. Caches the result so subsequent calls return immediately.
|
| 463 |
+
|
| 464 |
+
Returns:
|
| 465 |
+
NativeMXFP4 with preshuffled layout for hardware MMA.
|
| 466 |
+
"""
|
| 467 |
+
if hasattr(self, '_native_cache') and self._native_cache is not None:
|
| 468 |
+
return self._native_cache
|
| 469 |
+
|
| 470 |
+
K, N = self.shape
|
| 471 |
+
|
| 472 |
+
# Transpose packed weights for column-major access pattern
|
| 473 |
+
packed_t = self.packed.T.contiguous() # [N, K//2]
|
| 474 |
+
|
| 475 |
+
# Build 5D preshuffled scale tensor for MXScaleTensor layout
|
| 476 |
+
# Hardware expects: [N//128, K//32//4, 32, 4, 4]
|
| 477 |
+
# This arranges scales so tensor core warps can load them directly.
|
| 478 |
+
num_scale_k = K // 32
|
| 479 |
+
num_scale_n = N
|
| 480 |
+
|
| 481 |
+
# Pad N to multiple of 128 for the 5D layout
|
| 482 |
+
n_blocks = (N + 127) // 128
|
| 483 |
+
|
| 484 |
+
# Reshape scales [K//32, N] -> 5D preshuffled
|
| 485 |
+
scales_flat = self.scales.contiguous() # [K//32, N]
|
| 486 |
+
|
| 487 |
+
# Group K scales into groups of 4
|
| 488 |
+
k_groups = (num_scale_k + 3) // 4
|
| 489 |
+
|
| 490 |
+
scales_5d = torch.zeros(
|
| 491 |
+
n_blocks, k_groups, 32, 4, 4,
|
| 492 |
+
dtype=torch.uint8, device=self.packed.device,
|
| 493 |
+
)
|
| 494 |
+
|
| 495 |
+
# Fill the 5D tensor: map (k_scale_idx, n_idx) -> 5D position
|
| 496 |
+
for nb in range(n_blocks):
|
| 497 |
+
for kg in range(k_groups):
|
| 498 |
+
for inner_n in range(min(128, N - nb * 128)):
|
| 499 |
+
n_idx = nb * 128 + inner_n
|
| 500 |
+
if n_idx >= N:
|
| 501 |
+
break
|
| 502 |
+
# Map inner_n into (d2, d4) where d2 is in [0,32), d4 in [0,4)
|
| 503 |
+
d4 = inner_n % 4
|
| 504 |
+
d2 = (inner_n // 4) % 32
|
| 505 |
+
for d3 in range(min(4, num_scale_k - kg * 4)):
|
| 506 |
+
k_idx = kg * 4 + d3
|
| 507 |
+
if k_idx < num_scale_k:
|
| 508 |
+
scales_5d[nb, kg, d2, d3, d4] = scales_flat[k_idx, n_idx]
|
| 509 |
+
|
| 510 |
+
native = NativeMXFP4(
|
| 511 |
+
packed_t=packed_t,
|
| 512 |
+
scales_5d=scales_5d,
|
| 513 |
+
shape=(K, N),
|
| 514 |
+
)
|
| 515 |
+
self._native_cache = native
|
| 516 |
+
return native
|
| 517 |
+
|
| 518 |
+
|
| 519 |
+
@dataclass
|
| 520 |
+
class NativeMXFP4:
|
| 521 |
+
"""
|
| 522 |
+
Native FP4 format for tl.dot_scaled hardware path (SM100+).
|
| 523 |
+
|
| 524 |
+
Preshuffled layout matching MXScaleTensor requirements:
|
| 525 |
+
- packed_t: [N, K//2] uint8 — transposed packed weights
|
| 526 |
+
- scales_5d: [N//128, K//32//4, 32, 4, 4] uint8 — preshuffled E8M0
|
| 527 |
+
|
| 528 |
+
Created via MXFP4Weights.to_native(). Cached so conversion is one-time.
|
| 529 |
+
"""
|
| 530 |
+
packed_t: torch.Tensor # [N, K//2] uint8
|
| 531 |
+
scales_5d: torch.Tensor # [N//128, K//32//4, 32, 4, 4] uint8
|
| 532 |
+
shape: Tuple[int, int] # Original (K, N)
|
| 533 |
+
|
| 534 |
+
|
| 535 |
+
# =============================================================================
|
| 536 |
+
# E4M3 (FP8) Encode/Decode Helpers
|
| 537 |
+
# =============================================================================
|
| 538 |
+
|
| 539 |
+
def _encode_e4m3(values: torch.Tensor) -> torch.Tensor:
|
| 540 |
+
"""Encode FP32 values to E4M3 (FP8) as uint8. Vectorized."""
|
| 541 |
+
if hasattr(torch, 'float8_e4m3fn'):
|
| 542 |
+
return values.clamp(-448.0, 448.0).to(torch.float8_e4m3fn).view(torch.uint8)
|
| 543 |
+
# Manual fallback: clamp to representable range and use bit manipulation
|
| 544 |
+
v = values.float().clamp(-448.0, 448.0)
|
| 545 |
+
sign = (v < 0).to(torch.uint8) << 7
|
| 546 |
+
av = v.abs().clamp(min=0.0)
|
| 547 |
+
# E4M3: bias=7, subnormal threshold = 2^-6
|
| 548 |
+
# Normal: (1 + mant/8) * 2^(exp-7)
|
| 549 |
+
# Subnormal (exp=0): (mant/8) * 2^-6
|
| 550 |
+
log2_av = torch.log2(av.clamp(min=2**-9)) # min subnormal = 2^-9
|
| 551 |
+
exp_raw = torch.floor(log2_av).clamp(-6, 8)
|
| 552 |
+
exp_biased = (exp_raw + 7).clamp(0, 15)
|
| 553 |
+
# For normal values
|
| 554 |
+
mantissa_f = (av / torch.pow(2.0, exp_raw) - 1.0) * 8.0
|
| 555 |
+
mantissa = mantissa_f.round().clamp(0, 7).to(torch.uint8)
|
| 556 |
+
# For subnormal (exp_biased == 0)
|
| 557 |
+
sub_mant = (av / (2**-6) * 8.0).round().clamp(0, 7).to(torch.uint8)
|
| 558 |
+
is_sub = exp_biased == 0
|
| 559 |
+
final_mant = torch.where(is_sub, sub_mant, mantissa)
|
| 560 |
+
return sign | (exp_biased.to(torch.uint8) << 3) | final_mant
|
| 561 |
+
|
| 562 |
+
|
| 563 |
+
def _decode_e4m3(encoded: torch.Tensor) -> torch.Tensor:
|
| 564 |
+
"""Decode E4M3 uint8 back to FP32. Vectorized."""
|
| 565 |
+
if hasattr(torch, 'float8_e4m3fn'):
|
| 566 |
+
return encoded.view(torch.float8_e4m3fn).float()
|
| 567 |
+
# Manual fallback
|
| 568 |
+
sign = ((encoded >> 7) & 1).float()
|
| 569 |
+
exp = ((encoded >> 3) & 0xF).long()
|
| 570 |
+
mant = (encoded & 0x7).long()
|
| 571 |
+
is_normal = exp > 0
|
| 572 |
+
normal_val = (8 + mant).float() * torch.pow(2.0, (exp - 10).float())
|
| 573 |
+
subnormal_val = mant.float() * (2.0 ** -9)
|
| 574 |
+
unsigned = torch.where(is_normal, normal_val, subnormal_val)
|
| 575 |
+
return torch.where(sign != 0, -unsigned, unsigned)
|
| 576 |
+
|
| 577 |
+
|
| 578 |
+
# =============================================================================
|
| 579 |
+
# NVFP4 Weights (NVIDIA Blackwell native format)
|
| 580 |
+
# =============================================================================
|
| 581 |
+
|
| 582 |
+
@dataclass
|
| 583 |
+
class NVFP4Weights:
|
| 584 |
+
"""
|
| 585 |
+
NVFP4 quantized weights — NVIDIA Blackwell native format.
|
| 586 |
+
|
| 587 |
+
Format: 16 E2M1 values share 1 E4M3 (FP8) scale + per-tensor FP32 scale.
|
| 588 |
+
Two-level hierarchical scaling enables native 5th-gen Tensor Core support.
|
| 589 |
+
|
| 590 |
+
Memory layout:
|
| 591 |
+
- packed: [K//2, N] uint8 (2 nibbles per byte, same E2M1 encoding)
|
| 592 |
+
- block_scales: [K//16, N] uint8 (E4M3 per-block scale)
|
| 593 |
+
- tensor_scale: float (FP32 per-tensor global scale)
|
| 594 |
+
|
| 595 |
+
Optional FP8 residual correction (double-buff):
|
| 596 |
+
- residual: [K, N] uint8 (E4M3 encoded quantization error)
|
| 597 |
+
- residual_scales: [K//16, N] float32 (per-block scales for residual)
|
| 598 |
+
When present, the fused kernel adds the decoded residual to recover
|
| 599 |
+
near-FP16 accuracy at 1.625 B/elem (vs 2.0 for FP16).
|
| 600 |
+
|
| 601 |
+
Total size without residual: K*N/2 + K*N/16 ~ 0.5625 * original
|
| 602 |
+
Total size with residual: ~1.625 * original (75% of FP16)
|
| 603 |
+
"""
|
| 604 |
+
packed: torch.Tensor # [K//2, N] uint8 — E2M1 nibble packing
|
| 605 |
+
block_scales: torch.Tensor # [K//16, N] uint8 — E4M3 per-block scale
|
| 606 |
+
tensor_scale: float # FP32 per-tensor global scale
|
| 607 |
+
shape: Tuple[int, int] # (K, N)
|
| 608 |
+
clip_mask: Optional[torch.Tensor] = None
|
| 609 |
+
# FP8 residual correction (optional, "double-buff")
|
| 610 |
+
residual: Optional[torch.Tensor] = None # [K, N] uint8 — E4M3 encoded
|
| 611 |
+
residual_scales: Optional[torch.Tensor] = None # [K//16, N] float32 per-block
|
| 612 |
+
|
| 613 |
+
@classmethod
|
| 614 |
+
def from_float(cls, weights: torch.Tensor, use_quest: bool = True,
|
| 615 |
+
compute_residual: bool = False) -> 'NVFP4Weights':
|
| 616 |
+
"""
|
| 617 |
+
Quantize FP16/FP32 weights to NVFP4 with hierarchical scaling.
|
| 618 |
+
|
| 619 |
+
Two-level scaling:
|
| 620 |
+
1. Per-tensor FP32 scale (global_absmax / 448)
|
| 621 |
+
2. Per-block E4M3 scale (block_absmax / (tensor_scale * 6.0))
|
| 622 |
+
|
| 623 |
+
Args:
|
| 624 |
+
weights: Input tensor [K, N]
|
| 625 |
+
use_quest: Use QuEST optimal clipping (recommended)
|
| 626 |
+
compute_residual: Compute FP8 residual correction (double-buff).
|
| 627 |
+
When True, the quantization error (original - FP4 dequant) is
|
| 628 |
+
quantized to E4M3 FP8 with per-block scaling and stored alongside
|
| 629 |
+
the FP4 weights. The fused kernel adds this residual for near-FP16
|
| 630 |
+
accuracy at 1.625 B/elem.
|
| 631 |
+
|
| 632 |
+
Returns:
|
| 633 |
+
NVFP4Weights with packed values, E4M3 block scales, and FP32 tensor scale
|
| 634 |
+
(plus optional residual and residual_scales when compute_residual=True)
|
| 635 |
+
"""
|
| 636 |
+
K, N = weights.shape
|
| 637 |
+
assert K % 16 == 0, f"K ({K}) must be multiple of 16 for NVFP4"
|
| 638 |
+
|
| 639 |
+
device = weights.device
|
| 640 |
+
weights_f = weights.float()
|
| 641 |
+
|
| 642 |
+
# Reshape to blocks of 16
|
| 643 |
+
reshaped = weights_f.view(K // 16, 16, N)
|
| 644 |
+
|
| 645 |
+
# Block-wise absmax
|
| 646 |
+
absmax = reshaped.abs().amax(dim=1) # [K//16, N]
|
| 647 |
+
absmax = absmax.clamp(min=1e-10)
|
| 648 |
+
|
| 649 |
+
# Apply QuEST clipping
|
| 650 |
+
if use_quest:
|
| 651 |
+
clip_bound = absmax * _QUEST_CLIP_FACTOR
|
| 652 |
+
else:
|
| 653 |
+
clip_bound = absmax
|
| 654 |
+
|
| 655 |
+
# Level 1: per-tensor scale
|
| 656 |
+
global_absmax = clip_bound.max().clamp(min=1e-10)
|
| 657 |
+
tensor_scale = (global_absmax / 448.0).item() # 448 = E4M3 max
|
| 658 |
+
|
| 659 |
+
# Level 2: per-block E4M3 scale
|
| 660 |
+
target_scale = clip_bound / (tensor_scale * 6.0) # 6.0 = E2M1 max
|
| 661 |
+
target_scale = target_scale.clamp(min=1e-10)
|
| 662 |
+
block_scales_fp8 = _encode_e4m3(target_scale) # [K//16, N] uint8
|
| 663 |
+
|
| 664 |
+
# Actual scale per block = decode(block_scales_fp8) * tensor_scale
|
| 665 |
+
actual_block_scale = _decode_e4m3(block_scales_fp8) * tensor_scale # [K//16, N]
|
| 666 |
+
actual_block_scale = actual_block_scale.clamp(min=1e-10)
|
| 667 |
+
|
| 668 |
+
# Normalize and clamp
|
| 669 |
+
normalized = reshaped / actual_block_scale.unsqueeze(1) # [K//16, 16, N]
|
| 670 |
+
normalized = normalized.clamp(-6.0, 6.0)
|
| 671 |
+
|
| 672 |
+
# Generate clip mask for gradient (QuEST)
|
| 673 |
+
if use_quest:
|
| 674 |
+
clip_mask = (reshaped.abs() > clip_bound.unsqueeze(1) * 6.0).view(K, N)
|
| 675 |
+
else:
|
| 676 |
+
clip_mask = None
|
| 677 |
+
|
| 678 |
+
# Quantize via vectorized bucketize (same as MXFP4 Step 1)
|
| 679 |
+
boundaries = _E2M1_BOUNDARIES.to(device)
|
| 680 |
+
abs_norm = normalized.abs().reshape(-1)
|
| 681 |
+
unsigned_idx = torch.bucketize(abs_norm, boundaries)
|
| 682 |
+
sign_bit = ((normalized.reshape(-1) < 0) & (unsigned_idx > 0)).to(torch.uint8) << 3
|
| 683 |
+
indices = (sign_bit | unsigned_idx.to(torch.uint8)).reshape(K, N)
|
| 684 |
+
|
| 685 |
+
# Pack pairs of nibbles
|
| 686 |
+
packed = (indices[0::2] | (indices[1::2] << 4)) # [K//2, N]
|
| 687 |
+
|
| 688 |
+
# --- FP8 residual correction (double-buff) ---
|
| 689 |
+
residual_e4m3 = None
|
| 690 |
+
residual_scales = None
|
| 691 |
+
if compute_residual:
|
| 692 |
+
# Dequant the FP4 approximation
|
| 693 |
+
fp4_approx = cls(
|
| 694 |
+
packed=packed, block_scales=block_scales_fp8,
|
| 695 |
+
tensor_scale=tensor_scale, shape=(K, N),
|
| 696 |
+
).to_float()
|
| 697 |
+
# Residual = original - FP4 approximation
|
| 698 |
+
residual_float = weights_f - fp4_approx # [K, N]
|
| 699 |
+
|
| 700 |
+
# Quantize residual to FP8 E4M3 with per-block scaling (blocks of 16)
|
| 701 |
+
res_blocks = residual_float.view(K // 16, 16, N)
|
| 702 |
+
res_absmax = res_blocks.abs().amax(dim=1).clamp(min=1e-10) # [K//16, N]
|
| 703 |
+
res_scale = res_absmax / 448.0 # E4M3 max value
|
| 704 |
+
res_normalized = res_blocks / res_scale.unsqueeze(1)
|
| 705 |
+
res_normalized = res_normalized.clamp(-448.0, 448.0)
|
| 706 |
+
# Encode to E4M3 using native PyTorch path
|
| 707 |
+
residual_e4m3 = res_normalized.view(K, N).to(torch.float8_e4m3fn).view(torch.uint8)
|
| 708 |
+
residual_scales = res_scale # [K//16, N] float32
|
| 709 |
+
|
| 710 |
+
return cls(
|
| 711 |
+
packed=packed,
|
| 712 |
+
block_scales=block_scales_fp8,
|
| 713 |
+
tensor_scale=tensor_scale,
|
| 714 |
+
shape=(K, N),
|
| 715 |
+
clip_mask=clip_mask,
|
| 716 |
+
residual=residual_e4m3,
|
| 717 |
+
residual_scales=residual_scales,
|
| 718 |
+
)
|
| 719 |
+
|
| 720 |
+
def to_float(self) -> torch.Tensor:
|
| 721 |
+
"""Dequantize NVFP4 back to float with two-level scaling."""
|
| 722 |
+
K, N = self.shape
|
| 723 |
+
device = self.packed.device
|
| 724 |
+
|
| 725 |
+
e2m1_grid = _E2M1_VALUES.to(device)
|
| 726 |
+
|
| 727 |
+
# Unpack nibbles
|
| 728 |
+
low = (self.packed & 0xF).long()
|
| 729 |
+
high = (self.packed >> 4).long()
|
| 730 |
+
|
| 731 |
+
# Decode E2M1 values
|
| 732 |
+
low_vals = e2m1_grid[low.flatten()].view(K // 2, N)
|
| 733 |
+
high_vals = e2m1_grid[high.flatten()].view(K // 2, N)
|
| 734 |
+
|
| 735 |
+
# Interleave
|
| 736 |
+
unpacked = torch.zeros(K, N, device=device, dtype=torch.float32)
|
| 737 |
+
unpacked[0::2] = low_vals
|
| 738 |
+
unpacked[1::2] = high_vals
|
| 739 |
+
|
| 740 |
+
# Two-level scale: E4M3 block scale * FP32 tensor scale
|
| 741 |
+
block_sf = _decode_e4m3(self.block_scales) # [K//16, N]
|
| 742 |
+
scale = block_sf * self.tensor_scale
|
| 743 |
+
unpacked = unpacked.view(K // 16, 16, N) * scale.unsqueeze(1)
|
| 744 |
+
return unpacked.view(K, N)
|
| 745 |
+
|
| 746 |
+
@property
|
| 747 |
+
def compression_ratio(self) -> float:
|
| 748 |
+
"""Memory compression ratio vs FP16."""
|
| 749 |
+
K, N = self.shape
|
| 750 |
+
fp16_bytes = K * N * 2
|
| 751 |
+
nvfp4_bytes = self.packed.numel() + self.block_scales.numel()
|
| 752 |
+
if self.residual is not None:
|
| 753 |
+
nvfp4_bytes += self.residual.numel() # [K, N] uint8
|
| 754 |
+
if self.residual_scales is not None:
|
| 755 |
+
nvfp4_bytes += self.residual_scales.numel() * 4 # float32
|
| 756 |
+
return fp16_bytes / nvfp4_bytes
|
| 757 |
+
|
| 758 |
+
|
| 759 |
+
# Alias: FP4Weights now points to NVFP4 (the better format)
|
| 760 |
+
FP4Weights = NVFP4Weights
|
| 761 |
+
|
| 762 |
+
|
| 763 |
+
def mxfp4_gemm(
|
| 764 |
+
activations: torch.Tensor,
|
| 765 |
+
weights: MXFP4Weights,
|
| 766 |
+
bias: Optional[torch.Tensor] = None,
|
| 767 |
+
use_hadamard: bool = True,
|
| 768 |
+
) -> torch.Tensor:
|
| 769 |
+
"""
|
| 770 |
+
MXFP4 GEMM using the Quartet algorithm with fused dequant-matmul.
|
| 771 |
+
|
| 772 |
+
Implements the forward pass:
|
| 773 |
+
1. Apply Hadamard transform for outlier mitigation
|
| 774 |
+
2. Quantize activations with QuEST optimal clipping
|
| 775 |
+
3. Fused dequant-matmul (weight tile dequantized in registers, never in global memory)
|
| 776 |
+
|
| 777 |
+
Two-tier dispatch:
|
| 778 |
+
- If native FP4 tensor cores are available (tl.dot_scaled, future SM fix):
|
| 779 |
+
use hardware FP4 MMA
|
| 780 |
+
- Otherwise: use fused dequant-matmul Triton kernel (our implementation)
|
| 781 |
+
|
| 782 |
+
Reference: "Quartet: Native FP4 Training Can Be Optimal for LLMs"
|
| 783 |
+
https://arxiv.org/html/2505.14669v1
|
| 784 |
+
|
| 785 |
+
Args:
|
| 786 |
+
activations: Input [M, K] in BF16/FP16
|
| 787 |
+
weights: MXFP4Weights with packed E2M1 values and E8M0 scales
|
| 788 |
+
bias: Optional bias [N]
|
| 789 |
+
use_hadamard: Apply Hadamard transform (recommended)
|
| 790 |
+
|
| 791 |
+
Returns:
|
| 792 |
+
Output [M, N] in BF16
|
| 793 |
+
"""
|
| 794 |
+
M, K = activations.shape
|
| 795 |
+
K_w, N = weights.shape
|
| 796 |
+
assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
|
| 797 |
+
assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
|
| 798 |
+
|
| 799 |
+
# Step 1: Hadamard transform on activations (outlier mitigation)
|
| 800 |
+
if use_hadamard and K >= 32:
|
| 801 |
+
x = activations.float().view(M, K // 32, 32)
|
| 802 |
+
x = _hadamard_transform_32(x)
|
| 803 |
+
x = x.view(M, K)
|
| 804 |
+
else:
|
| 805 |
+
x = activations.float()
|
| 806 |
+
|
| 807 |
+
# Step 2: Quantize activations to MXFP4 with QuEST, then dequant back
|
| 808 |
+
# (activations need to go through quantize->dequantize to simulate FP4 noise)
|
| 809 |
+
x_for_quant = x.T.contiguous() # [K, M]
|
| 810 |
+
x_quant = MXFP4Weights.from_float(x_for_quant, use_quest=True)
|
| 811 |
+
x_dequant = x_quant.to_float().T.contiguous() # [M, K]
|
| 812 |
+
|
| 813 |
+
# Step 3: Dispatch to fused kernel or native FP4
|
| 814 |
+
if not activations.is_cuda:
|
| 815 |
+
# CPU fallback: full dequant + torch.matmul
|
| 816 |
+
w_dequant = weights.to_float()
|
| 817 |
+
d = torch.matmul(x_dequant, w_dequant)
|
| 818 |
+
if bias is not None:
|
| 819 |
+
d = d + bias.float()
|
| 820 |
+
return d.to(torch.bfloat16)
|
| 821 |
+
|
| 822 |
+
if _can_use_native_fp4():
|
| 823 |
+
return _native_fp4_matmul(x_dequant, weights.to_native(), bias)
|
| 824 |
+
else:
|
| 825 |
+
return _fused_fp4_matmul(x_dequant, weights, bias)
|
| 826 |
+
|
| 827 |
+
|
| 828 |
+
def mxfp4_gemm_legacy(
|
| 829 |
+
activations: torch.Tensor,
|
| 830 |
+
weights: MXFP4Weights,
|
| 831 |
+
bias: Optional[torch.Tensor] = None,
|
| 832 |
+
use_hadamard: bool = True,
|
| 833 |
+
) -> torch.Tensor:
|
| 834 |
+
"""
|
| 835 |
+
Legacy MXFP4 GEMM: full dequant to global memory + torch.matmul.
|
| 836 |
+
|
| 837 |
+
Kept for benchmarking comparison against the fused kernel.
|
| 838 |
+
"""
|
| 839 |
+
M, K = activations.shape
|
| 840 |
+
K_w, N = weights.shape
|
| 841 |
+
assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
|
| 842 |
+
assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
|
| 843 |
+
|
| 844 |
+
if use_hadamard and K >= 32:
|
| 845 |
+
x = activations.float().view(M, K // 32, 32)
|
| 846 |
+
x = _hadamard_transform_32(x)
|
| 847 |
+
x = x.view(M, K)
|
| 848 |
+
else:
|
| 849 |
+
x = activations.float()
|
| 850 |
+
|
| 851 |
+
x_for_quant = x.T.contiguous()
|
| 852 |
+
x_quant = MXFP4Weights.from_float(x_for_quant, use_quest=True)
|
| 853 |
+
x_dequant = x_quant.to_float().T.contiguous()
|
| 854 |
+
w_dequant = weights.to_float()
|
| 855 |
+
d = torch.matmul(x_dequant, w_dequant)
|
| 856 |
+
|
| 857 |
+
if bias is not None:
|
| 858 |
+
d = d + bias.float()
|
| 859 |
+
|
| 860 |
+
return d.to(torch.bfloat16)
|
| 861 |
+
|
| 862 |
+
|
| 863 |
+
|
| 864 |
+
def _hadamard_transform_32(x: torch.Tensor) -> torch.Tensor:
|
| 865 |
+
"""
|
| 866 |
+
Fast Hadamard Transform on last dimension (size 32).
|
| 867 |
+
|
| 868 |
+
Applies orthonormal Hadamard rotation to spread outliers.
|
| 869 |
+
Uses radix-2 butterfly operations.
|
| 870 |
+
"""
|
| 871 |
+
assert x.shape[-1] == 32
|
| 872 |
+
|
| 873 |
+
def hadamard_matrix(n):
|
| 874 |
+
if n == 1:
|
| 875 |
+
return torch.ones(1, 1, device=x.device, dtype=x.dtype)
|
| 876 |
+
h = hadamard_matrix(n // 2)
|
| 877 |
+
return torch.cat([
|
| 878 |
+
torch.cat([h, h], dim=1),
|
| 879 |
+
torch.cat([h, -h], dim=1),
|
| 880 |
+
], dim=0) / (2 ** 0.5)
|
| 881 |
+
|
| 882 |
+
H = hadamard_matrix(32)
|
| 883 |
+
return x @ H
|
| 884 |
+
|
| 885 |
+
|
| 886 |
+
# =============================================================================
|
| 887 |
+
# Arithmetic E2M1 Decoder (Triton JIT helper)
|
| 888 |
+
# =============================================================================
|
| 889 |
+
# Decode 4-bit E2M1 index -> float32 using pure register arithmetic.
|
| 890 |
+
# No LUT needed — bitfield extraction + tl.exp2() computes the value.
|
| 891 |
+
#
|
| 892 |
+
# E2M1 encoding (OCP Microscaling Spec v1.0):
|
| 893 |
+
# bit[3] = sign, bit[2:1] = exponent (2 bits), bit[0] = mantissa (1 bit)
|
| 894 |
+
# Subnormal (exp==0): value = mantissa * 0.5 -> {0.0, 0.5}
|
| 895 |
+
# Normal (exp>0): value = (2 + mantissa) * 2^(exp - 2)
|
| 896 |
+
# Values: 0, 0.5, 1, 1.5, 2, 3, 4, 6, -0, -0.5, -1, -1.5, -2, -3, -4, -6
|
| 897 |
+
|
| 898 |
+
@triton.jit
|
| 899 |
+
def _e2m1_decode(idx):
|
| 900 |
+
"""Decode 4-bit E2M1 index -> float32. Register-only, no LUT."""
|
| 901 |
+
sign = (idx >> 3) & 1
|
| 902 |
+
exp = (idx >> 1) & 3
|
| 903 |
+
mant = idx & 1
|
| 904 |
+
is_normal = exp > 0 # bool
|
| 905 |
+
subnormal_val = mant.to(tl.float32) * 0.5
|
| 906 |
+
normal_val = (2 + mant).to(tl.float32) * tl.exp2((exp - 2).to(tl.float32))
|
| 907 |
+
unsigned_val = tl.where(is_normal, normal_val, subnormal_val)
|
| 908 |
+
return tl.where(sign != 0, -unsigned_val, unsigned_val)
|
| 909 |
+
|
| 910 |
+
|
| 911 |
+
# =============================================================================
|
| 912 |
+
# Fused FP4 Dequant-MatMul Triton Kernel (Tier 2)
|
| 913 |
+
# =============================================================================
|
| 914 |
+
# Instead of materializing the full dequantized weight matrix in global memory,
|
| 915 |
+
# this kernel loads packed FP4 tiles, dequantizes in registers via arithmetic
|
| 916 |
+
# E2M1 decode, applies E8M0 block scales, and feeds BF16 into tl.dot().
|
| 917 |
+
# The full dequantized matrix NEVER exists in global memory.
|
| 918 |
+
# ~16x less memory traffic on the weight side vs the old full-dequant path.
|
| 919 |
+
|
| 920 |
+
@triton.autotune(
|
| 921 |
+
configs=[
|
| 922 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
|
| 923 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_stages=4, num_warps=8),
|
| 924 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=4),
|
| 925 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 926 |
+
],
|
| 927 |
+
key=['M', 'N', 'K'],
|
| 928 |
+
)
|
| 929 |
+
@triton.jit
|
| 930 |
+
def _fused_fp4_dequant_matmul_kernel(
|
| 931 |
+
a_ptr, # [M, K] BF16 activations
|
| 932 |
+
w_packed_ptr, # [K//2, N] uint8 packed FP4 weights
|
| 933 |
+
w_scales_ptr, # [K//32, N] uint8 E8M0 scales
|
| 934 |
+
out_ptr, # [M, N] BF16 output
|
| 935 |
+
bias_ptr, # [N] optional bias
|
| 936 |
+
M, N, K,
|
| 937 |
+
stride_am, stride_ak,
|
| 938 |
+
stride_wk, stride_wn, # strides for packed [K//2, N]
|
| 939 |
+
stride_sk, stride_sn, # strides for scales [K//32, N]
|
| 940 |
+
stride_om, stride_on,
|
| 941 |
+
HAS_BIAS: tl.constexpr,
|
| 942 |
+
BLOCK_M: tl.constexpr,
|
| 943 |
+
BLOCK_N: tl.constexpr,
|
| 944 |
+
BLOCK_K: tl.constexpr,
|
| 945 |
+
):
|
| 946 |
+
"""
|
| 947 |
+
Fused dequant-matmul: loads packed FP4, dequantizes in registers, matmuls.
|
| 948 |
+
|
| 949 |
+
Inner loop per K-tile:
|
| 950 |
+
1. Load A tile [BLOCK_M, BLOCK_K] BF16 via pointer arithmetic
|
| 951 |
+
2. Load packed weight tile [BLOCK_K//2, BLOCK_N] uint8
|
| 952 |
+
3. Unpack nibbles: low = packed & 0xF, high = packed >> 4
|
| 953 |
+
4. Arithmetic E2M1 decode via _e2m1_decode() — pure register ops, no LUT
|
| 954 |
+
5. Load scale tile [BLOCK_K//32, BLOCK_N] uint8, compute 2^(s-127)
|
| 955 |
+
6. Apply per-group scale, interleave even/odd → [BLOCK_K, BLOCK_N] BF16
|
| 956 |
+
7. acc += tl.dot(a_tile, w_tile)
|
| 957 |
+
"""
|
| 958 |
+
pid_m = tl.program_id(0)
|
| 959 |
+
pid_n = tl.program_id(1)
|
| 960 |
+
|
| 961 |
+
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
| 962 |
+
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
| 963 |
+
|
| 964 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 965 |
+
|
| 966 |
+
HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
|
| 967 |
+
SCALES_PER_TILE: tl.constexpr = BLOCK_K // 32
|
| 968 |
+
|
| 969 |
+
for k_start in range(0, K, BLOCK_K):
|
| 970 |
+
# --- Load A as even/odd column halves ---
|
| 971 |
+
# Even columns (0, 2, 4, ...) correspond to low nibbles
|
| 972 |
+
# Odd columns (1, 3, 5, ...) correspond to high nibbles
|
| 973 |
+
# This avoids building a full [BLOCK_K, BLOCK_N] interleaved tile.
|
| 974 |
+
even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2 # 0,2,4,...
|
| 975 |
+
odd_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1 # 1,3,5,...
|
| 976 |
+
|
| 977 |
+
a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
|
| 978 |
+
a_odd_ptrs = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :] * stride_ak
|
| 979 |
+
mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
|
| 980 |
+
mask_a_odd = (offs_m[:, None] < M) & (odd_k_offs[None, :] < K)
|
| 981 |
+
a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0) # [BLOCK_M, HALF_BLOCK_K]
|
| 982 |
+
a_odd = tl.load(a_odd_ptrs, mask=mask_a_odd, other=0.0) # [BLOCK_M, HALF_BLOCK_K]
|
| 983 |
+
|
| 984 |
+
# --- Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8 ---
|
| 985 |
+
packed_row_start = k_start // 2
|
| 986 |
+
offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
|
| 987 |
+
w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
|
| 988 |
+
mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
|
| 989 |
+
packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
|
| 990 |
+
|
| 991 |
+
# --- Unpack nibbles + arithmetic E2M1 decode ---
|
| 992 |
+
low_f = _e2m1_decode(packed & 0xF) # [HALF_BLOCK_K, BLOCK_N] even rows
|
| 993 |
+
high_f = _e2m1_decode((packed >> 4) & 0xF) # [HALF_BLOCK_K, BLOCK_N] odd rows
|
| 994 |
+
|
| 995 |
+
# --- Load E8M0 scales and broadcast per 32-element group ---
|
| 996 |
+
# Each scale covers 32 original K rows = 16 packed rows.
|
| 997 |
+
scale_row_start = k_start // 32
|
| 998 |
+
offs_local_packed = tl.arange(0, HALF_BLOCK_K)
|
| 999 |
+
group_idx = offs_local_packed // 16 # which scale group each packed row belongs to
|
| 1000 |
+
|
| 1001 |
+
scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
|
| 1002 |
+
for sg in tl.static_range(0, SCALES_PER_TILE):
|
| 1003 |
+
sg_row = scale_row_start + sg
|
| 1004 |
+
sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
|
| 1005 |
+
sg_load_mask = (sg_row < (K // 32)) & (offs_n < N)
|
| 1006 |
+
sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=127).to(tl.float32)
|
| 1007 |
+
sg_val = tl.exp2(sg_raw - 127.0) # [BLOCK_N]
|
| 1008 |
+
sg_match = (group_idx == sg) # [HALF_BLOCK_K] bool
|
| 1009 |
+
scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
|
| 1010 |
+
|
| 1011 |
+
# Apply scales
|
| 1012 |
+
w_even = (low_f * scale_broadcast).to(tl.bfloat16) # [HALF_BLOCK_K, BLOCK_N]
|
| 1013 |
+
w_odd = (high_f * scale_broadcast).to(tl.bfloat16) # [HALF_BLOCK_K, BLOCK_N]
|
| 1014 |
+
|
| 1015 |
+
# --- Two half-sized dot products instead of interleaved full tile ---
|
| 1016 |
+
# A @ W = A_even_cols @ W_even_rows + A_odd_cols @ W_odd_rows
|
| 1017 |
+
acc += tl.dot(a_even.to(tl.bfloat16), w_even)
|
| 1018 |
+
acc += tl.dot(a_odd.to(tl.bfloat16), w_odd)
|
| 1019 |
+
|
| 1020 |
+
# --- Bias ---
|
| 1021 |
+
if HAS_BIAS:
|
| 1022 |
+
bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
|
| 1023 |
+
acc += bias_vals[None, :]
|
| 1024 |
+
|
| 1025 |
+
# --- Store ---
|
| 1026 |
+
out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
|
| 1027 |
+
mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
| 1028 |
+
tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
|
| 1029 |
+
|
| 1030 |
+
|
| 1031 |
+
# =============================================================================
|
| 1032 |
+
# Native FP4 dot_scaled Kernel (Tier 1, SM100+)
|
| 1033 |
+
# =============================================================================
|
| 1034 |
+
# Uses tl.dot_scaled for hardware FP4 tensor core support (tcgen05.mma.mxf4).
|
| 1035 |
+
# Follows Triton tutorial #10 pattern with TMA loads.
|
| 1036 |
+
# Currently falls back to BF16 MMA on SM120 (RTX 5090) due to Triton #7550.
|
| 1037 |
+
# This path activates only when runtime probe confirms real FP4 execution.
|
| 1038 |
+
#
|
| 1039 |
+
# Config: BLOCK_M=128, BLOCK_N=256, BLOCK_K=128, VEC_SIZE=32, stages=4
|
| 1040 |
+
|
| 1041 |
+
@triton.jit
|
| 1042 |
+
def _native_fp4_matmul_kernel(
|
| 1043 |
+
a_ptr, # [M, K] BF16 activations
|
| 1044 |
+
b_packed_ptr, # [N, K//2] uint8 packed FP4 (transposed)
|
| 1045 |
+
b_scales_ptr, # [N//128, K//32//4, 32, 4, 4] uint8 preshuffled E8M0
|
| 1046 |
+
out_ptr, # [M, N] BF16 output
|
| 1047 |
+
bias_ptr, # [N] optional
|
| 1048 |
+
M, N, K,
|
| 1049 |
+
stride_am, stride_ak,
|
| 1050 |
+
stride_bn, stride_bk, # strides for packed_t [N, K//2]
|
| 1051 |
+
stride_om, stride_on,
|
| 1052 |
+
HAS_BIAS: tl.constexpr,
|
| 1053 |
+
BLOCK_M: tl.constexpr,
|
| 1054 |
+
BLOCK_N: tl.constexpr,
|
| 1055 |
+
BLOCK_K: tl.constexpr,
|
| 1056 |
+
):
|
| 1057 |
+
"""
|
| 1058 |
+
Native FP4 matmul using tl.dot_scaled (SM100+ hardware path).
|
| 1059 |
+
|
| 1060 |
+
Uses transposed packed weights and preshuffled 5D scale tensor
|
| 1061 |
+
matching MXScaleTensor layout for direct tensor core consumption.
|
| 1062 |
+
When tl.dot_scaled maps to real tcgen05.mma.mxf4 instructions,
|
| 1063 |
+
this achieves native FP4 throughput.
|
| 1064 |
+
"""
|
| 1065 |
+
pid_m = tl.program_id(0)
|
| 1066 |
+
pid_n = tl.program_id(1)
|
| 1067 |
+
|
| 1068 |
+
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
| 1069 |
+
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
| 1070 |
+
|
| 1071 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 1072 |
+
|
| 1073 |
+
HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
|
| 1074 |
+
|
| 1075 |
+
SCALES_PER_TILE: tl.constexpr = BLOCK_K // 32
|
| 1076 |
+
|
| 1077 |
+
for k_start in range(0, K, BLOCK_K):
|
| 1078 |
+
# --- Load A as even/odd column halves ---
|
| 1079 |
+
even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
|
| 1080 |
+
odd_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
|
| 1081 |
+
|
| 1082 |
+
a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
|
| 1083 |
+
a_odd_ptrs = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :] * stride_ak
|
| 1084 |
+
mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
|
| 1085 |
+
mask_a_odd = (offs_m[:, None] < M) & (odd_k_offs[None, :] < K)
|
| 1086 |
+
a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
|
| 1087 |
+
a_odd = tl.load(a_odd_ptrs, mask=mask_a_odd, other=0.0)
|
| 1088 |
+
|
| 1089 |
+
# --- Load packed B tile [BLOCK_N, HALF_BLOCK_K] from transposed layout ---
|
| 1090 |
+
packed_col_start = k_start // 2
|
| 1091 |
+
offs_pk = packed_col_start + tl.arange(0, HALF_BLOCK_K)
|
| 1092 |
+
b_ptrs = b_packed_ptr + offs_n[:, None] * stride_bn + offs_pk[None, :] * stride_bk
|
| 1093 |
+
mask_b = (offs_n[:, None] < N) & (offs_pk[None, :] < (K // 2))
|
| 1094 |
+
b_packed_tile = tl.load(b_ptrs, mask=mask_b, other=0).to(tl.int32)
|
| 1095 |
+
|
| 1096 |
+
# Unpack + decode
|
| 1097 |
+
low_f = _e2m1_decode(b_packed_tile & 0xF)
|
| 1098 |
+
high_f = _e2m1_decode((b_packed_tile >> 4) & 0xF)
|
| 1099 |
+
|
| 1100 |
+
# --- Load scales from 5D layout, broadcast per group ---
|
| 1101 |
+
scale_row_start = k_start // 32
|
| 1102 |
+
offs_local_pk = tl.arange(0, HALF_BLOCK_K)
|
| 1103 |
+
group_idx = offs_local_pk // 16
|
| 1104 |
+
|
| 1105 |
+
scale_broadcast = tl.zeros((BLOCK_N, HALF_BLOCK_K), dtype=tl.float32)
|
| 1106 |
+
for sg in tl.static_range(0, SCALES_PER_TILE):
|
| 1107 |
+
k_idx = scale_row_start + sg
|
| 1108 |
+
nb = offs_n // 128
|
| 1109 |
+
inner_n = offs_n % 128
|
| 1110 |
+
d4 = inner_n % 4
|
| 1111 |
+
d2 = (inner_n // 4) % 32
|
| 1112 |
+
kg = k_idx // 4
|
| 1113 |
+
d3 = k_idx % 4
|
| 1114 |
+
kg_total = (K // 32 + 3) // 4
|
| 1115 |
+
s_offset = (nb * kg_total * 32 * 4 * 4 +
|
| 1116 |
+
kg * 32 * 4 * 4 +
|
| 1117 |
+
d2 * 4 * 4 +
|
| 1118 |
+
d3 * 4 +
|
| 1119 |
+
d4)
|
| 1120 |
+
s_val_raw = tl.load(b_scales_ptr + s_offset, mask=offs_n < N, other=127).to(tl.float32)
|
| 1121 |
+
s_val = tl.exp2(s_val_raw - 127.0) # [BLOCK_N]
|
| 1122 |
+
sg_match = (group_idx == sg)
|
| 1123 |
+
scale_broadcast = tl.where(sg_match[None, :], s_val[:, None], scale_broadcast)
|
| 1124 |
+
|
| 1125 |
+
# Apply scales: [BLOCK_N, HALF_BLOCK_K]
|
| 1126 |
+
w_low = (low_f * scale_broadcast).to(tl.bfloat16)
|
| 1127 |
+
w_high = (high_f * scale_broadcast).to(tl.bfloat16)
|
| 1128 |
+
|
| 1129 |
+
# Transpose weight halves: [BLOCK_N, HALF_BLOCK_K] -> [HALF_BLOCK_K, BLOCK_N]
|
| 1130 |
+
w_low_t = tl.trans(w_low)
|
| 1131 |
+
w_high_t = tl.trans(w_high)
|
| 1132 |
+
|
| 1133 |
+
# Two half-sized dot products
|
| 1134 |
+
acc += tl.dot(a_even.to(tl.bfloat16), w_low_t)
|
| 1135 |
+
acc += tl.dot(a_odd.to(tl.bfloat16), w_high_t)
|
| 1136 |
+
|
| 1137 |
+
if HAS_BIAS:
|
| 1138 |
+
bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
|
| 1139 |
+
acc += bias_vals[None, :]
|
| 1140 |
+
|
| 1141 |
+
out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
|
| 1142 |
+
mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
| 1143 |
+
tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
|
| 1144 |
+
|
| 1145 |
+
|
| 1146 |
+
# =============================================================================
|
| 1147 |
+
# E4M3 Decode (Triton JIT helper for NVFP4 kernel)
|
| 1148 |
+
# =============================================================================
|
| 1149 |
+
|
| 1150 |
+
@triton.jit
|
| 1151 |
+
def _decode_e4m3_triton(raw_uint8):
|
| 1152 |
+
"""Decode E4M3 FP8 in Triton registers. No LUT, pure bitfield arithmetic."""
|
| 1153 |
+
sign = (raw_uint8 >> 7) & 1
|
| 1154 |
+
exp = (raw_uint8 >> 3) & 0xF
|
| 1155 |
+
mant = raw_uint8 & 0x7
|
| 1156 |
+
is_normal = exp > 0
|
| 1157 |
+
normal_val = (8 + mant).to(tl.float32) * tl.exp2((exp - 10).to(tl.float32))
|
| 1158 |
+
subnormal_val = mant.to(tl.float32) * tl.exp2(tl.full(mant.shape, -9.0, tl.float32))
|
| 1159 |
+
unsigned = tl.where(is_normal, normal_val, subnormal_val)
|
| 1160 |
+
return tl.where(sign != 0, -unsigned, unsigned)
|
| 1161 |
+
|
| 1162 |
+
|
| 1163 |
+
# =============================================================================
|
| 1164 |
+
# Fused NVFP4 Dequant-MatMul Triton Kernel
|
| 1165 |
+
# =============================================================================
|
| 1166 |
+
# NVFP4 variant: 16-element blocks with E4M3 scales + per-tensor FP32 scale.
|
| 1167 |
+
# Two-level hierarchical scaling for native Blackwell tensor core format.
|
| 1168 |
+
# Scale groups every 16 elements (8 packed rows) instead of 32.
|
| 1169 |
+
|
| 1170 |
+
@triton.autotune(
|
| 1171 |
+
configs=[
|
| 1172 |
+
# --- Blackwell 5090 prefill configs (high throughput, 170 SMs) ---
|
| 1173 |
+
triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
|
| 1174 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 256}, num_stages=7, num_warps=8),
|
| 1175 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
|
| 1176 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1177 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=8),
|
| 1178 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1179 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1180 |
+
# --- Decode-optimized (small M, maximize N-parallelism across SMs) ---
|
| 1181 |
+
triton.Config({'BLOCK_M': 16, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1182 |
+
triton.Config({'BLOCK_M': 16, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1183 |
+
triton.Config({'BLOCK_M': 16, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1184 |
+
triton.Config({'BLOCK_M': 32, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1185 |
+
triton.Config({'BLOCK_M': 32, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1186 |
+
],
|
| 1187 |
+
key=['M', 'N', 'K'],
|
| 1188 |
+
)
|
| 1189 |
+
@triton.jit
|
| 1190 |
+
def _fused_nvfp4_dequant_matmul_kernel(
|
| 1191 |
+
a_ptr, # [M, K] BF16 activations
|
| 1192 |
+
w_packed_ptr, # [K//2, N] uint8 packed FP4 weights
|
| 1193 |
+
w_scales_ptr, # [K//16, N] uint8 E4M3 scales
|
| 1194 |
+
out_ptr, # [M, N] BF16 output
|
| 1195 |
+
bias_ptr, # [N] optional bias
|
| 1196 |
+
tensor_scale, # FP32 per-tensor global scale
|
| 1197 |
+
M, N, K,
|
| 1198 |
+
stride_am, stride_ak,
|
| 1199 |
+
stride_wk, stride_wn, # strides for packed [K//2, N]
|
| 1200 |
+
stride_sk, stride_sn, # strides for scales [K//16, N]
|
| 1201 |
+
stride_om, stride_on,
|
| 1202 |
+
HAS_BIAS: tl.constexpr,
|
| 1203 |
+
BLOCK_M: tl.constexpr,
|
| 1204 |
+
BLOCK_N: tl.constexpr,
|
| 1205 |
+
BLOCK_K: tl.constexpr,
|
| 1206 |
+
):
|
| 1207 |
+
"""
|
| 1208 |
+
Fused NVFP4 dequant-matmul: 16-element blocks, E4M3 scales, tensor scale.
|
| 1209 |
+
|
| 1210 |
+
Inner loop per K-tile:
|
| 1211 |
+
1. Load A tile as even/odd column halves
|
| 1212 |
+
2. Load packed weight tile, unpack nibbles
|
| 1213 |
+
3. Arithmetic E2M1 decode via _e2m1_decode()
|
| 1214 |
+
4. Load E4M3 scale tile [BLOCK_K//16, BLOCK_N], decode via _decode_e4m3_triton()
|
| 1215 |
+
5. Apply two-level scale: decoded_e4m3 * tensor_scale
|
| 1216 |
+
6. acc += tl.dot(a_half, w_half) for even and odd halves
|
| 1217 |
+
"""
|
| 1218 |
+
pid_m = tl.program_id(0)
|
| 1219 |
+
pid_n = tl.program_id(1)
|
| 1220 |
+
|
| 1221 |
+
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
| 1222 |
+
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
| 1223 |
+
|
| 1224 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 1225 |
+
|
| 1226 |
+
HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
|
| 1227 |
+
SCALES_PER_TILE: tl.constexpr = BLOCK_K // 16 # 16-element blocks (not 32)
|
| 1228 |
+
|
| 1229 |
+
for k_start in range(0, K, BLOCK_K):
|
| 1230 |
+
# --- Load A as even/odd column halves ---
|
| 1231 |
+
even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
|
| 1232 |
+
odd_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
|
| 1233 |
+
|
| 1234 |
+
a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
|
| 1235 |
+
a_odd_ptrs = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :] * stride_ak
|
| 1236 |
+
mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
|
| 1237 |
+
mask_a_odd = (offs_m[:, None] < M) & (odd_k_offs[None, :] < K)
|
| 1238 |
+
a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
|
| 1239 |
+
a_odd = tl.load(a_odd_ptrs, mask=mask_a_odd, other=0.0)
|
| 1240 |
+
|
| 1241 |
+
# --- Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8 ---
|
| 1242 |
+
packed_row_start = k_start // 2
|
| 1243 |
+
offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
|
| 1244 |
+
w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
|
| 1245 |
+
mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
|
| 1246 |
+
packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
|
| 1247 |
+
|
| 1248 |
+
# --- Unpack nibbles + arithmetic E2M1 decode ---
|
| 1249 |
+
low_f = _e2m1_decode(packed & 0xF)
|
| 1250 |
+
high_f = _e2m1_decode((packed >> 4) & 0xF)
|
| 1251 |
+
|
| 1252 |
+
# --- Load E4M3 scales and broadcast per 16-element group ---
|
| 1253 |
+
# Each scale covers 16 original K rows = 8 packed rows.
|
| 1254 |
+
scale_row_start = k_start // 16
|
| 1255 |
+
offs_local_packed = tl.arange(0, HALF_BLOCK_K)
|
| 1256 |
+
group_idx = offs_local_packed // 8 # 8 packed rows per scale group
|
| 1257 |
+
|
| 1258 |
+
scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
|
| 1259 |
+
for sg in tl.static_range(0, SCALES_PER_TILE):
|
| 1260 |
+
sg_row = scale_row_start + sg
|
| 1261 |
+
sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
|
| 1262 |
+
sg_load_mask = (sg_row < (K // 16)) & (offs_n < N)
|
| 1263 |
+
sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=0).to(tl.int32)
|
| 1264 |
+
# Decode E4M3 and apply tensor_scale
|
| 1265 |
+
sg_val = _decode_e4m3_triton(sg_raw) * tensor_scale # [BLOCK_N]
|
| 1266 |
+
sg_match = (group_idx == sg)
|
| 1267 |
+
scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
|
| 1268 |
+
|
| 1269 |
+
# Apply scales
|
| 1270 |
+
w_even = (low_f * scale_broadcast).to(tl.bfloat16)
|
| 1271 |
+
w_odd = (high_f * scale_broadcast).to(tl.bfloat16)
|
| 1272 |
+
|
| 1273 |
+
# Two half-sized dot products
|
| 1274 |
+
acc += tl.dot(a_even.to(tl.bfloat16), w_even)
|
| 1275 |
+
acc += tl.dot(a_odd.to(tl.bfloat16), w_odd)
|
| 1276 |
+
|
| 1277 |
+
# --- Bias ---
|
| 1278 |
+
if HAS_BIAS:
|
| 1279 |
+
bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
|
| 1280 |
+
acc += bias_vals[None, :]
|
| 1281 |
+
|
| 1282 |
+
# --- Store ---
|
| 1283 |
+
out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
|
| 1284 |
+
mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
| 1285 |
+
tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
|
| 1286 |
+
|
| 1287 |
+
|
| 1288 |
+
# =============================================================================
|
| 1289 |
+
# Fused NVFP4 + FP8 Residual Dequant-MatMul Triton Kernel ("Double Buff")
|
| 1290 |
+
# =============================================================================
|
| 1291 |
+
# Same structure as _fused_nvfp4_dequant_matmul_kernel, but each K-tile also
|
| 1292 |
+
# loads the FP8 E4M3 residual and its per-block scales, decodes, and adds
|
| 1293 |
+
# a third tl.dot for the residual correction. Three dots per tile:
|
| 1294 |
+
# acc += dot(a_even, w_fp4_even) + dot(a_odd, w_fp4_odd) + dot(a_full, w_residual)
|
| 1295 |
+
|
| 1296 |
+
@triton.autotune(
|
| 1297 |
+
configs=[
|
| 1298 |
+
# --- Blackwell 5090 prefill configs ---
|
| 1299 |
+
triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
|
| 1300 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1301 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
|
| 1302 |
+
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=8),
|
| 1303 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1304 |
+
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1305 |
+
# --- Decode-optimized ---
|
| 1306 |
+
triton.Config({'BLOCK_M': 16, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1307 |
+
triton.Config({'BLOCK_M': 16, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1308 |
+
triton.Config({'BLOCK_M': 32, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
|
| 1309 |
+
triton.Config({'BLOCK_M': 32, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
|
| 1310 |
+
],
|
| 1311 |
+
key=['M', 'N', 'K'],
|
| 1312 |
+
)
|
| 1313 |
+
@triton.jit
|
| 1314 |
+
def _fused_nvfp4_residual_matmul_kernel(
|
| 1315 |
+
a_ptr, # [M, K] BF16 activations
|
| 1316 |
+
w_packed_ptr, # [K//2, N] uint8 packed FP4 weights
|
| 1317 |
+
w_scales_ptr, # [K//16, N] uint8 E4M3 scales
|
| 1318 |
+
res_ptr, # [K, N] uint8 E4M3 residual
|
| 1319 |
+
res_scales_ptr, # [K//16, N] float32 residual scales
|
| 1320 |
+
out_ptr, # [M, N] BF16 output
|
| 1321 |
+
bias_ptr, # [N] optional bias
|
| 1322 |
+
tensor_scale, # FP32 per-tensor global scale
|
| 1323 |
+
M, N, K,
|
| 1324 |
+
stride_am, stride_ak,
|
| 1325 |
+
stride_wk, stride_wn, # strides for packed [K//2, N]
|
| 1326 |
+
stride_sk, stride_sn, # strides for scales [K//16, N]
|
| 1327 |
+
stride_rk, stride_rn, # strides for residual [K, N]
|
| 1328 |
+
stride_rsk, stride_rsn, # strides for residual_scales [K//16, N]
|
| 1329 |
+
stride_om, stride_on,
|
| 1330 |
+
HAS_BIAS: tl.constexpr,
|
| 1331 |
+
BLOCK_M: tl.constexpr,
|
| 1332 |
+
BLOCK_N: tl.constexpr,
|
| 1333 |
+
BLOCK_K: tl.constexpr,
|
| 1334 |
+
):
|
| 1335 |
+
"""
|
| 1336 |
+
Fused NVFP4 + FP8 residual dequant-matmul (double-buff).
|
| 1337 |
+
|
| 1338 |
+
Per K-tile:
|
| 1339 |
+
1. FP4 path: unpack nibbles, decode E2M1, apply two-level scale (same as base kernel)
|
| 1340 |
+
2. FP8 residual path: load E4M3 residual, decode, apply per-block residual_scales
|
| 1341 |
+
3. Three dots: a_even * w_fp4_even + a_odd * w_fp4_odd + a_full * w_residual
|
| 1342 |
+
"""
|
| 1343 |
+
pid_m = tl.program_id(0)
|
| 1344 |
+
pid_n = tl.program_id(1)
|
| 1345 |
+
|
| 1346 |
+
offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
|
| 1347 |
+
offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
|
| 1348 |
+
|
| 1349 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 1350 |
+
|
| 1351 |
+
HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
|
| 1352 |
+
SCALES_PER_TILE: tl.constexpr = BLOCK_K // 16
|
| 1353 |
+
|
| 1354 |
+
for k_start in range(0, K, BLOCK_K):
|
| 1355 |
+
# ===== FP4 path (identical to base kernel) =====
|
| 1356 |
+
# Load A as even/odd column halves
|
| 1357 |
+
even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
|
| 1358 |
+
odd_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
|
| 1359 |
+
|
| 1360 |
+
a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
|
| 1361 |
+
a_odd_ptrs = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :] * stride_ak
|
| 1362 |
+
mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
|
| 1363 |
+
mask_a_odd = (offs_m[:, None] < M) & (odd_k_offs[None, :] < K)
|
| 1364 |
+
a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
|
| 1365 |
+
a_odd = tl.load(a_odd_ptrs, mask=mask_a_odd, other=0.0)
|
| 1366 |
+
|
| 1367 |
+
# Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8
|
| 1368 |
+
packed_row_start = k_start // 2
|
| 1369 |
+
offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
|
| 1370 |
+
w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
|
| 1371 |
+
mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
|
| 1372 |
+
packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
|
| 1373 |
+
|
| 1374 |
+
# Unpack nibbles + arithmetic E2M1 decode
|
| 1375 |
+
low_f = _e2m1_decode(packed & 0xF)
|
| 1376 |
+
high_f = _e2m1_decode((packed >> 4) & 0xF)
|
| 1377 |
+
|
| 1378 |
+
# Load E4M3 scales and broadcast per 16-element group
|
| 1379 |
+
scale_row_start = k_start // 16
|
| 1380 |
+
offs_local_packed = tl.arange(0, HALF_BLOCK_K)
|
| 1381 |
+
group_idx = offs_local_packed // 8
|
| 1382 |
+
|
| 1383 |
+
scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
|
| 1384 |
+
for sg in tl.static_range(0, SCALES_PER_TILE):
|
| 1385 |
+
sg_row = scale_row_start + sg
|
| 1386 |
+
sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
|
| 1387 |
+
sg_load_mask = (sg_row < (K // 16)) & (offs_n < N)
|
| 1388 |
+
sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=0).to(tl.int32)
|
| 1389 |
+
sg_val = _decode_e4m3_triton(sg_raw) * tensor_scale
|
| 1390 |
+
sg_match = (group_idx == sg)
|
| 1391 |
+
scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
|
| 1392 |
+
|
| 1393 |
+
# Apply FP4 scales and accumulate
|
| 1394 |
+
w_even = (low_f * scale_broadcast).to(tl.bfloat16)
|
| 1395 |
+
w_odd = (high_f * scale_broadcast).to(tl.bfloat16)
|
| 1396 |
+
|
| 1397 |
+
acc += tl.dot(a_even.to(tl.bfloat16), w_even)
|
| 1398 |
+
acc += tl.dot(a_odd.to(tl.bfloat16), w_odd)
|
| 1399 |
+
|
| 1400 |
+
# ===== FP8 residual correction path =====
|
| 1401 |
+
# Load full contiguous activation tile [BLOCK_M, BLOCK_K]
|
| 1402 |
+
full_k_offs = k_start + tl.arange(0, BLOCK_K)
|
| 1403 |
+
a_full_ptrs = a_ptr + offs_m[:, None] * stride_am + full_k_offs[None, :] * stride_ak
|
| 1404 |
+
mask_a_full = (offs_m[:, None] < M) & (full_k_offs[None, :] < K)
|
| 1405 |
+
a_full = tl.load(a_full_ptrs, mask=mask_a_full, other=0.0)
|
| 1406 |
+
|
| 1407 |
+
# Load residual [BLOCK_K, BLOCK_N] uint8 E4M3
|
| 1408 |
+
res_k_offs = k_start + tl.arange(0, BLOCK_K)
|
| 1409 |
+
res_ptrs = res_ptr + res_k_offs[:, None] * stride_rk + offs_n[None, :] * stride_rn
|
| 1410 |
+
mask_res = (res_k_offs[:, None] < K) & (offs_n[None, :] < N)
|
| 1411 |
+
res_raw = tl.load(res_ptrs, mask=mask_res, other=0).to(tl.int32)
|
| 1412 |
+
res_decoded = _decode_e4m3_triton(res_raw) # [BLOCK_K, BLOCK_N] float32
|
| 1413 |
+
|
| 1414 |
+
# Load residual per-block scales [SCALES_PER_TILE, BLOCK_N] float32
|
| 1415 |
+
# and broadcast to [BLOCK_K, BLOCK_N]
|
| 1416 |
+
offs_full_k = tl.arange(0, BLOCK_K)
|
| 1417 |
+
res_group_idx = offs_full_k // 16 # 16 elements per scale group
|
| 1418 |
+
|
| 1419 |
+
res_scale_broadcast = tl.zeros((BLOCK_K, BLOCK_N), dtype=tl.float32)
|
| 1420 |
+
for rsg in tl.static_range(0, SCALES_PER_TILE):
|
| 1421 |
+
rsg_row = scale_row_start + rsg
|
| 1422 |
+
rsg_ptrs = res_scales_ptr + rsg_row * stride_rsk + offs_n * stride_rsn
|
| 1423 |
+
rsg_load_mask = (rsg_row < (K // 16)) & (offs_n < N)
|
| 1424 |
+
rsg_val = tl.load(rsg_ptrs, mask=rsg_load_mask, other=0.0) # [BLOCK_N] float32
|
| 1425 |
+
rsg_match = (res_group_idx == rsg)
|
| 1426 |
+
res_scale_broadcast = tl.where(rsg_match[:, None], rsg_val[None, :], res_scale_broadcast)
|
| 1427 |
+
|
| 1428 |
+
# Apply residual scales and accumulate
|
| 1429 |
+
res_scaled = (res_decoded * res_scale_broadcast).to(tl.bfloat16)
|
| 1430 |
+
acc += tl.dot(a_full.to(tl.bfloat16), res_scaled)
|
| 1431 |
+
|
| 1432 |
+
# --- Bias ---
|
| 1433 |
+
if HAS_BIAS:
|
| 1434 |
+
bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
|
| 1435 |
+
acc += bias_vals[None, :]
|
| 1436 |
+
|
| 1437 |
+
# --- Store ---
|
| 1438 |
+
out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
|
| 1439 |
+
mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
|
| 1440 |
+
tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
|
| 1441 |
+
|
| 1442 |
+
|
| 1443 |
+
# =============================================================================
|
| 1444 |
+
# Native FP4 capability probe (cached)
|
| 1445 |
+
# =============================================================================
|
| 1446 |
+
|
| 1447 |
+
_native_fp4_probe_result: Optional[bool] = None
|
| 1448 |
+
|
| 1449 |
+
|
| 1450 |
+
def _can_use_native_fp4() -> bool:
|
| 1451 |
+
"""
|
| 1452 |
+
One-time probe to determine if tl.dot_scaled produces real FP4 results.
|
| 1453 |
+
|
| 1454 |
+
Checks:
|
| 1455 |
+
1. CUDA available with SM >= 10.0 (Blackwell+)
|
| 1456 |
+
2. tl.dot_scaled API exists in current Triton
|
| 1457 |
+
3. Small test matmul via our native kernel produces results that
|
| 1458 |
+
differ from what pure BF16 dequant+matmul would give.
|
| 1459 |
+
If they match exactly, Triton is falling back to BF16 MMA
|
| 1460 |
+
(Triton #7550) and the native path offers no benefit.
|
| 1461 |
+
|
| 1462 |
+
Result is cached in module global _native_fp4_probe_result.
|
| 1463 |
+
"""
|
| 1464 |
+
global _native_fp4_probe_result
|
| 1465 |
+
if _native_fp4_probe_result is not None:
|
| 1466 |
+
return _native_fp4_probe_result
|
| 1467 |
+
|
| 1468 |
+
_native_fp4_probe_result = False
|
| 1469 |
+
|
| 1470 |
+
if not torch.cuda.is_available():
|
| 1471 |
+
return False
|
| 1472 |
+
|
| 1473 |
+
# SM >= 10.0 required (Blackwell architecture)
|
| 1474 |
+
major, _ = torch.cuda.get_device_capability()
|
| 1475 |
+
if major < 10:
|
| 1476 |
+
return False
|
| 1477 |
+
|
| 1478 |
+
# Check Triton API availability
|
| 1479 |
+
if not hasattr(tl, 'dot_scaled'):
|
| 1480 |
+
return False
|
| 1481 |
+
|
| 1482 |
+
# Runtime correctness probe: run a small matmul and compare
|
| 1483 |
+
# native kernel output vs BF16 reference
|
| 1484 |
+
try:
|
| 1485 |
+
test_m, test_n, test_k = 32, 32, 64
|
| 1486 |
+
a_test = torch.randn(test_m, test_k, device='cuda', dtype=torch.bfloat16)
|
| 1487 |
+
w_test = torch.randn(test_k, test_n, device='cuda', dtype=torch.float32)
|
| 1488 |
+
w_quant = MXFP4Weights.from_float(w_test, use_quest=False)
|
| 1489 |
+
w_deq = w_quant.to_float()
|
| 1490 |
+
|
| 1491 |
+
# BF16 reference (what fallback would give)
|
| 1492 |
+
ref_bf16 = torch.matmul(a_test.float(), w_deq).bfloat16()
|
| 1493 |
+
|
| 1494 |
+
# Run our native kernel path
|
| 1495 |
+
native_w = w_quant.to_native()
|
| 1496 |
+
native_out = _native_fp4_matmul(a_test, native_w, bias=None)
|
| 1497 |
+
|
| 1498 |
+
# If native output matches BF16 reference EXACTLY (all elements equal),
|
| 1499 |
+
# Triton is silently falling back to BF16 MMA — no benefit.
|
| 1500 |
+
# Real FP4 tensor cores produce different rounding patterns.
|
| 1501 |
+
if torch.equal(native_out, ref_bf16):
|
| 1502 |
+
_native_fp4_probe_result = False
|
| 1503 |
+
else:
|
| 1504 |
+
# Verify native output is reasonable (not garbage)
|
| 1505 |
+
rel_err = (native_out.float() - ref_bf16.float()).abs().mean() / ref_bf16.float().abs().mean()
|
| 1506 |
+
_native_fp4_probe_result = rel_err.item() < 0.1
|
| 1507 |
+
except Exception:
|
| 1508 |
+
_native_fp4_probe_result = False
|
| 1509 |
+
|
| 1510 |
+
return _native_fp4_probe_result
|
| 1511 |
+
|
| 1512 |
+
|
| 1513 |
+
# =============================================================================
|
| 1514 |
+
# Fused FP4 matmul wrapper (internal)
|
| 1515 |
+
# =============================================================================
|
| 1516 |
+
|
| 1517 |
+
def _fused_fp4_matmul(
|
| 1518 |
+
activations: torch.Tensor,
|
| 1519 |
+
weights: MXFP4Weights,
|
| 1520 |
+
bias: Optional[torch.Tensor] = None,
|
| 1521 |
+
) -> torch.Tensor:
|
| 1522 |
+
"""
|
| 1523 |
+
Fused dequant-matmul via Triton kernel.
|
| 1524 |
+
|
| 1525 |
+
The full dequantized weight matrix never exists in global memory —
|
| 1526 |
+
each tile is unpacked from uint8, looked up in the E2M1 table,
|
| 1527 |
+
scaled by E8M0, and fed directly into tl.dot().
|
| 1528 |
+
"""
|
| 1529 |
+
M, K = activations.shape
|
| 1530 |
+
_, N = weights.shape
|
| 1531 |
+
|
| 1532 |
+
# Ensure inputs are contiguous and on CUDA
|
| 1533 |
+
a = activations.contiguous()
|
| 1534 |
+
if a.dtype != torch.bfloat16:
|
| 1535 |
+
a = a.to(torch.bfloat16)
|
| 1536 |
+
|
| 1537 |
+
w_packed = weights.packed.contiguous()
|
| 1538 |
+
w_scales = weights.scales.contiguous()
|
| 1539 |
+
|
| 1540 |
+
out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
|
| 1541 |
+
|
| 1542 |
+
# Bias setup
|
| 1543 |
+
has_bias = bias is not None
|
| 1544 |
+
if has_bias:
|
| 1545 |
+
bias = bias.contiguous().float()
|
| 1546 |
+
else:
|
| 1547 |
+
bias = torch.empty(0, device=a.device, dtype=torch.float32)
|
| 1548 |
+
|
| 1549 |
+
grid = lambda META: (
|
| 1550 |
+
triton.cdiv(M, META['BLOCK_M']),
|
| 1551 |
+
triton.cdiv(N, META['BLOCK_N']),
|
| 1552 |
+
)
|
| 1553 |
+
|
| 1554 |
+
_fused_fp4_dequant_matmul_kernel[grid](
|
| 1555 |
+
a, w_packed, w_scales, out, bias,
|
| 1556 |
+
M, N, K,
|
| 1557 |
+
a.stride(0), a.stride(1),
|
| 1558 |
+
w_packed.stride(0), w_packed.stride(1),
|
| 1559 |
+
w_scales.stride(0), w_scales.stride(1),
|
| 1560 |
+
out.stride(0), out.stride(1),
|
| 1561 |
+
HAS_BIAS=has_bias,
|
| 1562 |
+
)
|
| 1563 |
+
|
| 1564 |
+
return out
|
| 1565 |
+
|
| 1566 |
+
|
| 1567 |
+
# =============================================================================
|
| 1568 |
+
# Native FP4 matmul wrapper (internal, future path)
|
| 1569 |
+
# =============================================================================
|
| 1570 |
+
|
| 1571 |
+
def _native_fp4_matmul(
|
| 1572 |
+
activations: torch.Tensor,
|
| 1573 |
+
weights: 'NativeMXFP4',
|
| 1574 |
+
bias: Optional[torch.Tensor] = None,
|
| 1575 |
+
) -> torch.Tensor:
|
| 1576 |
+
"""
|
| 1577 |
+
Native FP4 matmul using the Tier 1 kernel with transposed/preshuffled layout.
|
| 1578 |
+
|
| 1579 |
+
Args:
|
| 1580 |
+
activations: [M, K] BF16 tensor
|
| 1581 |
+
weights: NativeMXFP4 with packed_t and scales_5d
|
| 1582 |
+
bias: Optional [N] bias
|
| 1583 |
+
"""
|
| 1584 |
+
M, K = activations.shape
|
| 1585 |
+
K_w, N = weights.shape
|
| 1586 |
+
|
| 1587 |
+
a = activations.contiguous()
|
| 1588 |
+
if a.dtype != torch.bfloat16:
|
| 1589 |
+
a = a.to(torch.bfloat16)
|
| 1590 |
+
|
| 1591 |
+
packed_t = weights.packed_t.contiguous()
|
| 1592 |
+
scales_5d = weights.scales_5d.contiguous()
|
| 1593 |
+
|
| 1594 |
+
out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
|
| 1595 |
+
|
| 1596 |
+
has_bias = bias is not None
|
| 1597 |
+
if has_bias:
|
| 1598 |
+
bias = bias.contiguous().float()
|
| 1599 |
+
else:
|
| 1600 |
+
bias = torch.empty(0, device=a.device, dtype=torch.float32)
|
| 1601 |
+
|
| 1602 |
+
# Use fixed config matching plan spec
|
| 1603 |
+
BLOCK_M = 128
|
| 1604 |
+
BLOCK_N = 128
|
| 1605 |
+
BLOCK_K = 128
|
| 1606 |
+
|
| 1607 |
+
grid = (triton.cdiv(M, BLOCK_M), triton.cdiv(N, BLOCK_N))
|
| 1608 |
+
|
| 1609 |
+
_native_fp4_matmul_kernel[grid](
|
| 1610 |
+
a, packed_t, scales_5d, out, bias,
|
| 1611 |
+
M, N, K,
|
| 1612 |
+
a.stride(0), a.stride(1),
|
| 1613 |
+
packed_t.stride(0), packed_t.stride(1),
|
| 1614 |
+
out.stride(0), out.stride(1),
|
| 1615 |
+
HAS_BIAS=has_bias,
|
| 1616 |
+
BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N, BLOCK_K=BLOCK_K,
|
| 1617 |
+
)
|
| 1618 |
+
|
| 1619 |
+
return out
|
| 1620 |
+
|
| 1621 |
+
|
| 1622 |
+
def quantize_to_mxfp4(weights: torch.Tensor, use_quest: bool = True) -> MXFP4Weights:
|
| 1623 |
+
"""
|
| 1624 |
+
Quantize weights to MXFP4 format.
|
| 1625 |
+
|
| 1626 |
+
Args:
|
| 1627 |
+
weights: Input tensor [K, N]
|
| 1628 |
+
use_quest: Use QuEST optimal clipping
|
| 1629 |
+
|
| 1630 |
+
Returns:
|
| 1631 |
+
MXFP4Weights ready for native GEMM
|
| 1632 |
+
"""
|
| 1633 |
+
return MXFP4Weights.from_float(weights, use_quest=use_quest)
|
| 1634 |
+
|
| 1635 |
+
|
| 1636 |
+
# =============================================================================
|
| 1637 |
+
# Fused NVFP4 matmul wrapper (internal)
|
| 1638 |
+
# =============================================================================
|
| 1639 |
+
|
| 1640 |
+
def _fused_nvfp4_matmul(
|
| 1641 |
+
activations: torch.Tensor,
|
| 1642 |
+
weights: 'NVFP4Weights',
|
| 1643 |
+
bias: Optional[torch.Tensor] = None,
|
| 1644 |
+
) -> torch.Tensor:
|
| 1645 |
+
"""
|
| 1646 |
+
Fused NVFP4 dequant-matmul via Triton kernel.
|
| 1647 |
+
|
| 1648 |
+
Uses 16-element blocks with E4M3 scales and per-tensor FP32 scale.
|
| 1649 |
+
The full dequantized weight matrix never exists in global memory.
|
| 1650 |
+
|
| 1651 |
+
Automatically dispatches to the double-buff (FP4+FP8 residual) kernel
|
| 1652 |
+
when weights have residual data, for near-FP16 accuracy.
|
| 1653 |
+
"""
|
| 1654 |
+
# Dispatch to residual kernel when weights have FP8 correction data
|
| 1655 |
+
if weights.residual is not None and weights.residual_scales is not None:
|
| 1656 |
+
return _fused_nvfp4_residual_matmul(activations, weights, bias)
|
| 1657 |
+
|
| 1658 |
+
M, K = activations.shape
|
| 1659 |
+
_, N = weights.shape
|
| 1660 |
+
|
| 1661 |
+
a = activations.contiguous()
|
| 1662 |
+
if a.dtype != torch.bfloat16:
|
| 1663 |
+
a = a.to(torch.bfloat16)
|
| 1664 |
+
|
| 1665 |
+
w_packed = weights.packed.contiguous()
|
| 1666 |
+
w_scales = weights.block_scales.contiguous()
|
| 1667 |
+
|
| 1668 |
+
out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
|
| 1669 |
+
|
| 1670 |
+
has_bias = bias is not None
|
| 1671 |
+
if has_bias:
|
| 1672 |
+
bias = bias.contiguous().float()
|
| 1673 |
+
else:
|
| 1674 |
+
bias = torch.empty(0, device=a.device, dtype=torch.float32)
|
| 1675 |
+
|
| 1676 |
+
grid = lambda META: (
|
| 1677 |
+
triton.cdiv(M, META['BLOCK_M']),
|
| 1678 |
+
triton.cdiv(N, META['BLOCK_N']),
|
| 1679 |
+
)
|
| 1680 |
+
|
| 1681 |
+
_fused_nvfp4_dequant_matmul_kernel[grid](
|
| 1682 |
+
a, w_packed, w_scales, out, bias,
|
| 1683 |
+
weights.tensor_scale,
|
| 1684 |
+
M, N, K,
|
| 1685 |
+
a.stride(0), a.stride(1),
|
| 1686 |
+
w_packed.stride(0), w_packed.stride(1),
|
| 1687 |
+
w_scales.stride(0), w_scales.stride(1),
|
| 1688 |
+
out.stride(0), out.stride(1),
|
| 1689 |
+
HAS_BIAS=has_bias,
|
| 1690 |
+
)
|
| 1691 |
+
|
| 1692 |
+
return out
|
| 1693 |
+
|
| 1694 |
+
|
| 1695 |
+
def _fused_nvfp4_residual_matmul(
|
| 1696 |
+
activations: torch.Tensor,
|
| 1697 |
+
weights: 'NVFP4Weights',
|
| 1698 |
+
bias: Optional[torch.Tensor] = None,
|
| 1699 |
+
) -> torch.Tensor:
|
| 1700 |
+
"""
|
| 1701 |
+
Fused NVFP4 + FP8 residual dequant-matmul (double-buff).
|
| 1702 |
+
|
| 1703 |
+
Same as _fused_nvfp4_matmul but passes FP8 residual and per-block
|
| 1704 |
+
residual_scales to the residual kernel for near-FP16 accuracy.
|
| 1705 |
+
Requires weights.residual and weights.residual_scales to be set.
|
| 1706 |
+
"""
|
| 1707 |
+
M, K = activations.shape
|
| 1708 |
+
_, N = weights.shape
|
| 1709 |
+
|
| 1710 |
+
a = activations.contiguous()
|
| 1711 |
+
if a.dtype != torch.bfloat16:
|
| 1712 |
+
a = a.to(torch.bfloat16)
|
| 1713 |
+
|
| 1714 |
+
w_packed = weights.packed.contiguous()
|
| 1715 |
+
w_scales = weights.block_scales.contiguous()
|
| 1716 |
+
res = weights.residual.contiguous()
|
| 1717 |
+
res_scales = weights.residual_scales.contiguous()
|
| 1718 |
+
|
| 1719 |
+
out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
|
| 1720 |
+
|
| 1721 |
+
has_bias = bias is not None
|
| 1722 |
+
if has_bias:
|
| 1723 |
+
bias = bias.contiguous().float()
|
| 1724 |
+
else:
|
| 1725 |
+
bias = torch.empty(0, device=a.device, dtype=torch.float32)
|
| 1726 |
+
|
| 1727 |
+
grid = lambda META: (
|
| 1728 |
+
triton.cdiv(M, META['BLOCK_M']),
|
| 1729 |
+
triton.cdiv(N, META['BLOCK_N']),
|
| 1730 |
+
)
|
| 1731 |
+
|
| 1732 |
+
_fused_nvfp4_residual_matmul_kernel[grid](
|
| 1733 |
+
a, w_packed, w_scales, res, res_scales, out, bias,
|
| 1734 |
+
weights.tensor_scale,
|
| 1735 |
+
M, N, K,
|
| 1736 |
+
a.stride(0), a.stride(1),
|
| 1737 |
+
w_packed.stride(0), w_packed.stride(1),
|
| 1738 |
+
w_scales.stride(0), w_scales.stride(1),
|
| 1739 |
+
res.stride(0), res.stride(1),
|
| 1740 |
+
res_scales.stride(0), res_scales.stride(1),
|
| 1741 |
+
out.stride(0), out.stride(1),
|
| 1742 |
+
HAS_BIAS=has_bias,
|
| 1743 |
+
)
|
| 1744 |
+
|
| 1745 |
+
return out
|
| 1746 |
+
|
| 1747 |
+
|
| 1748 |
+
# =============================================================================
|
| 1749 |
+
# PyTorch _scaled_mm FP4 Probe (Native Tensor Core Path)
|
| 1750 |
+
# =============================================================================
|
| 1751 |
+
|
| 1752 |
+
_scaled_mm_fp4_probe_result: Optional[bool] = None
|
| 1753 |
+
|
| 1754 |
+
|
| 1755 |
+
def _can_use_scaled_mm_fp4() -> bool:
|
| 1756 |
+
"""
|
| 1757 |
+
Probe for PyTorch native FP4 scaled matmul (cuBLAS NVFP4 path).
|
| 1758 |
+
|
| 1759 |
+
Uses 1x16 blockwise scaling: FP4 packed as uint8.view(float4_e2m1fn_x2),
|
| 1760 |
+
E4M3 flat scale tensors with ceil(rows/128)*128 * max(K/16, 4) elements.
|
| 1761 |
+
|
| 1762 |
+
DISABLED: cuBLAS 1x16 blockwise FP4 has correctness issues with non-128-aligned
|
| 1763 |
+
dimensions (cos_sim drops to 0.30-0.50 for M=1 decode). The Triton fused
|
| 1764 |
+
dequant kernel achieves cos_sim=0.999+ for all shapes. Re-enable when PyTorch
|
| 1765 |
+
exposes a proper NVFP4 GEMM API with 2D scale tensors + SwizzleType support.
|
| 1766 |
+
"""
|
| 1767 |
+
return False
|
| 1768 |
+
|
| 1769 |
+
|
| 1770 |
+
def _scaled_mm_fp4(
|
| 1771 |
+
activations: torch.Tensor,
|
| 1772 |
+
weights: 'NVFP4Weights',
|
| 1773 |
+
bias: Optional[torch.Tensor] = None,
|
| 1774 |
+
) -> torch.Tensor:
|
| 1775 |
+
"""
|
| 1776 |
+
Native cuBLAS NVFP4 matmul via torch._scaled_mm.
|
| 1777 |
+
|
| 1778 |
+
Activations: BF16 [M, K] — quantized to FP4 on-the-fly.
|
| 1779 |
+
Weights: NVFP4Weights with packed [K//2, N] uint8, block_scales [K//16, N] E4M3.
|
| 1780 |
+
|
| 1781 |
+
Uses 1x16 blockwise scaling on Blackwell 5th-gen tensor cores.
|
| 1782 |
+
Scale layout: flat 1D, ceil(rows/128)*128 * max(K//16, 4) elements.
|
| 1783 |
+
Output is multiplied by both tensor_scales (activation + weight).
|
| 1784 |
+
"""
|
| 1785 |
+
import math
|
| 1786 |
+
M, K = activations.shape
|
| 1787 |
+
K_w, N = weights.shape
|
| 1788 |
+
|
| 1789 |
+
# --- Quantize activations to FP4 ---
|
| 1790 |
+
act_q = NVFP4Weights.from_float(activations.T.contiguous().float(), use_quest=True)
|
| 1791 |
+
a_packed = act_q.packed.T.contiguous() # [M, K//2]
|
| 1792 |
+
a_fp4 = a_packed.view(torch.float4_e2m1fn_x2)
|
| 1793 |
+
|
| 1794 |
+
b_packed = weights.packed.T.contiguous() # [N, K//2]
|
| 1795 |
+
b_fp4 = b_packed.view(torch.float4_e2m1fn_x2)
|
| 1796 |
+
|
| 1797 |
+
# --- Build flat scale tensors (1x16 blockwise, padded) ---
|
| 1798 |
+
# cuBLAS requires minimum 4 scale groups per row along K
|
| 1799 |
+
k_groups = max(K // 16, 4)
|
| 1800 |
+
|
| 1801 |
+
# scale_a: [M, K//16] -> pad rows to 128, pad K groups to min 4
|
| 1802 |
+
sa_2d = act_q.block_scales.T.contiguous().view(torch.float8_e4m3fn) # [M, K//16]
|
| 1803 |
+
sa_padded_rows = math.ceil(M / 128) * 128
|
| 1804 |
+
# Pad K dimension if needed (fill with 1.0 = 0x3C in E4M3)
|
| 1805 |
+
if k_groups > K // 16:
|
| 1806 |
+
k_pad = torch.full((sa_2d.shape[0], k_groups - K // 16), 0x3C,
|
| 1807 |
+
dtype=torch.uint8, device=sa_2d.device).view(torch.float8_e4m3fn)
|
| 1808 |
+
sa_2d = torch.cat([sa_2d, k_pad], dim=1)
|
| 1809 |
+
if sa_padded_rows > M:
|
| 1810 |
+
row_pad = torch.full((sa_padded_rows - M, k_groups), 0x3C,
|
| 1811 |
+
dtype=torch.uint8, device=sa_2d.device).view(torch.float8_e4m3fn)
|
| 1812 |
+
sa_2d = torch.cat([sa_2d, row_pad], dim=0)
|
| 1813 |
+
sa_flat = sa_2d.contiguous().view(-1)
|
| 1814 |
+
|
| 1815 |
+
# scale_b: [N, K//16] -> same padding
|
| 1816 |
+
sb_2d = weights.block_scales.T.contiguous().view(torch.float8_e4m3fn) # [N, K//16]
|
| 1817 |
+
sb_padded_rows = math.ceil(N / 128) * 128
|
| 1818 |
+
if k_groups > K // 16:
|
| 1819 |
+
k_pad = torch.full((sb_2d.shape[0], k_groups - K // 16), 0x3C,
|
| 1820 |
+
dtype=torch.uint8, device=sb_2d.device).view(torch.float8_e4m3fn)
|
| 1821 |
+
sb_2d = torch.cat([sb_2d, k_pad], dim=1)
|
| 1822 |
+
if sb_padded_rows > N:
|
| 1823 |
+
row_pad = torch.full((sb_padded_rows - N, k_groups), 0x3C,
|
| 1824 |
+
dtype=torch.uint8, device=sb_2d.device).view(torch.float8_e4m3fn)
|
| 1825 |
+
sb_2d = torch.cat([sb_2d, row_pad], dim=0)
|
| 1826 |
+
sb_flat = sb_2d.contiguous().view(-1)
|
| 1827 |
+
|
| 1828 |
+
# --- cuBLAS native FP4 matmul ---
|
| 1829 |
+
out = torch._scaled_mm(a_fp4, b_fp4.T, scale_a=sa_flat, scale_b=sb_flat,
|
| 1830 |
+
out_dtype=torch.bfloat16)
|
| 1831 |
+
|
| 1832 |
+
# Apply per-tensor scales (cuBLAS only handles block scales)
|
| 1833 |
+
ts = act_q.tensor_scale * weights.tensor_scale
|
| 1834 |
+
out = out.float() * ts
|
| 1835 |
+
|
| 1836 |
+
if bias is not None:
|
| 1837 |
+
out = out + bias.float()
|
| 1838 |
+
|
| 1839 |
+
return out.to(torch.bfloat16)
|
| 1840 |
+
|
| 1841 |
+
|
| 1842 |
+
# =============================================================================
|
| 1843 |
+
# NVFP4 GEMM (public API)
|
| 1844 |
+
# =============================================================================
|
| 1845 |
+
|
| 1846 |
+
def nvfp4_gemm(
|
| 1847 |
+
activations: torch.Tensor,
|
| 1848 |
+
weights: 'NVFP4Weights',
|
| 1849 |
+
bias: Optional[torch.Tensor] = None,
|
| 1850 |
+
use_hadamard: bool = True,
|
| 1851 |
+
) -> torch.Tensor:
|
| 1852 |
+
"""
|
| 1853 |
+
NVFP4 GEMM with hierarchical dispatch.
|
| 1854 |
+
|
| 1855 |
+
Pipeline:
|
| 1856 |
+
1. Apply Hadamard transform for outlier mitigation
|
| 1857 |
+
2. Quantize activations with bucketize (O(K*N) instead of O(K*N*16))
|
| 1858 |
+
3. Dispatch to best available kernel:
|
| 1859 |
+
- Tier 0: Native cuBLAS via torch._scaled_mm (if PyTorch supports FP4)
|
| 1860 |
+
- Tier 1: Fused NVFP4 Triton kernel (16-element blocks, E4M3 scales)
|
| 1861 |
+
- Tier 2: CPU fallback
|
| 1862 |
+
|
| 1863 |
+
Args:
|
| 1864 |
+
activations: Input [M, K] in BF16/FP16
|
| 1865 |
+
weights: NVFP4Weights with packed E2M1 values, E4M3 scales, tensor scale
|
| 1866 |
+
bias: Optional bias [N]
|
| 1867 |
+
use_hadamard: Apply Hadamard transform (recommended)
|
| 1868 |
+
|
| 1869 |
+
Returns:
|
| 1870 |
+
Output [M, N] in BF16
|
| 1871 |
+
"""
|
| 1872 |
+
M, K = activations.shape
|
| 1873 |
+
K_w, N = weights.shape
|
| 1874 |
+
assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
|
| 1875 |
+
assert K % 16 == 0, f"K ({K}) must be multiple of 16 for NVFP4"
|
| 1876 |
+
|
| 1877 |
+
# Step 1: Hadamard transform on activations
|
| 1878 |
+
if use_hadamard and K >= 32:
|
| 1879 |
+
x = activations.float().view(M, K // 32, 32)
|
| 1880 |
+
x = _hadamard_transform_32(x)
|
| 1881 |
+
x = x.view(M, K)
|
| 1882 |
+
else:
|
| 1883 |
+
x = activations.float()
|
| 1884 |
+
|
| 1885 |
+
# Step 2: Dispatch
|
| 1886 |
+
if not activations.is_cuda:
|
| 1887 |
+
# CPU fallback: quant/dequant round-trip + matmul
|
| 1888 |
+
x_for_quant = x.T.contiguous()
|
| 1889 |
+
x_quant = NVFP4Weights.from_float(x_for_quant, use_quest=True)
|
| 1890 |
+
x_dequant = x_quant.to_float().T.contiguous()
|
| 1891 |
+
w_dequant = weights.to_float()
|
| 1892 |
+
d = torch.matmul(x_dequant, w_dequant)
|
| 1893 |
+
if bias is not None:
|
| 1894 |
+
d = d + bias.float()
|
| 1895 |
+
return d.to(torch.bfloat16)
|
| 1896 |
+
|
| 1897 |
+
# Tier 0: Native cuBLAS FP4 (quantizes activations to FP4 internally)
|
| 1898 |
+
if _can_use_scaled_mm_fp4():
|
| 1899 |
+
return _scaled_mm_fp4(x.to(torch.bfloat16), weights, bias)
|
| 1900 |
+
|
| 1901 |
+
# Tier 1: Triton kernel (BF16 activations with FP4 noise pre-applied)
|
| 1902 |
+
x_for_quant = x.T.contiguous()
|
| 1903 |
+
x_quant = NVFP4Weights.from_float(x_for_quant, use_quest=True)
|
| 1904 |
+
x_dequant = x_quant.to_float().T.contiguous()
|
| 1905 |
+
return _fused_nvfp4_matmul(x_dequant, weights, bias)
|
| 1906 |
+
|
| 1907 |
+
|
| 1908 |
+
def quantize_to_nvfp4(weights: torch.Tensor, use_quest: bool = True) -> NVFP4Weights:
|
| 1909 |
+
"""
|
| 1910 |
+
Quantize weights to NVFP4 format.
|
| 1911 |
+
|
| 1912 |
+
Args:
|
| 1913 |
+
weights: Input tensor [K, N]
|
| 1914 |
+
use_quest: Use QuEST optimal clipping
|
| 1915 |
+
|
| 1916 |
+
Returns:
|
| 1917 |
+
NVFP4Weights ready for NVFP4 GEMM
|
| 1918 |
+
"""
|
| 1919 |
+
return NVFP4Weights.from_float(weights, use_quest=use_quest)
|
| 1920 |
+
|
| 1921 |
+
|
| 1922 |
+
# Updated aliases: FP4 now points to NVFP4 (the better format)
|
| 1923 |
+
fp4_gemm = nvfp4_gemm
|
| 1924 |
+
quantize_to_fp4 = quantize_to_nvfp4
|
| 1925 |
+
|
| 1926 |
+
|
| 1927 |
+
# =============================================================================
|
| 1928 |
+
# L2 Cache Control (ctypes / libcudart.so)
|
| 1929 |
+
# =============================================================================
|
| 1930 |
+
|
| 1931 |
+
# --- ctypes structures for cudaAccessPolicyWindow -------------------------
|
| 1932 |
+
|
| 1933 |
+
class _AccessPolicyWindow(ctypes.Structure):
|
| 1934 |
+
"""Maps to cudaAccessPolicyWindow (CUDA Runtime API)."""
|
| 1935 |
+
_fields_ = [
|
| 1936 |
+
("base_ptr", ctypes.c_void_p),
|
| 1937 |
+
("num_bytes", ctypes.c_size_t),
|
| 1938 |
+
("hitRatio", ctypes.c_float),
|
| 1939 |
+
("hitProp", ctypes.c_int),
|
| 1940 |
+
("missProp", ctypes.c_int),
|
| 1941 |
+
]
|
| 1942 |
+
|
| 1943 |
+
|
| 1944 |
+
class _StreamAttrValue(ctypes.Union):
|
| 1945 |
+
"""Maps to cudaStreamAttrValue (union)."""
|
| 1946 |
+
_fields_ = [
|
| 1947 |
+
("accessPolicyWindow", _AccessPolicyWindow),
|
| 1948 |
+
("syncPolicy", ctypes.c_int),
|
| 1949 |
+
]
|
| 1950 |
+
|
| 1951 |
+
|
| 1952 |
+
# cudaAccessProperty enum
|
| 1953 |
+
_CUDA_ACCESS_PROPERTY_NORMAL = 0
|
| 1954 |
+
_CUDA_ACCESS_PROPERTY_STREAMING = 1
|
| 1955 |
+
_CUDA_ACCESS_PROPERTY_PERSISTING = 2
|
| 1956 |
+
|
| 1957 |
+
# cudaStreamAttrID enum
|
| 1958 |
+
_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW = 1
|
| 1959 |
+
|
| 1960 |
+
# cudaLimit enum
|
| 1961 |
+
_CUDA_LIMIT_PERSISTING_L2_CACHE_SIZE = 0x06
|
| 1962 |
+
|
| 1963 |
+
# cudaDeviceAttr enum
|
| 1964 |
+
_CUDA_DEV_ATTR_L2_CACHE_SIZE = 89
|
| 1965 |
+
_CUDA_DEV_ATTR_MAX_PERSISTING_L2_CACHE_SIZE = 108
|
| 1966 |
+
|
| 1967 |
+
|
| 1968 |
+
def _load_cudart():
|
| 1969 |
+
"""Load the CUDA runtime shared library, return handle or None."""
|
| 1970 |
+
for name in ("libcudart.so", "libcudart.so.12", "libcudart.so.11.0"):
|
| 1971 |
+
try:
|
| 1972 |
+
return ctypes.CDLL(name)
|
| 1973 |
+
except OSError:
|
| 1974 |
+
continue
|
| 1975 |
+
try:
|
| 1976 |
+
path = ctypes.util.find_library("cudart")
|
| 1977 |
+
if path:
|
| 1978 |
+
return ctypes.CDLL(path)
|
| 1979 |
+
except (OSError, TypeError):
|
| 1980 |
+
pass
|
| 1981 |
+
return None
|
| 1982 |
+
|
| 1983 |
+
|
| 1984 |
+
_cudart = _load_cudart()
|
| 1985 |
+
|
| 1986 |
+
|
| 1987 |
+
# =============================================================================
|
| 1988 |
+
# L2CacheManager (public API)
|
| 1989 |
+
# =============================================================================
|
| 1990 |
+
|
| 1991 |
+
class L2CacheManager:
|
| 1992 |
+
"""
|
| 1993 |
+
L2 Cache Manager for SM90+ GPUs.
|
| 1994 |
+
|
| 1995 |
+
Uses ctypes/libcudart.so cudaAccessPolicyWindow to pin hot data
|
| 1996 |
+
(embeddings, weights, KV cache) in L2 for 10-20% inference speedup.
|
| 1997 |
+
|
| 1998 |
+
When libcudart is not loadable the manager degrades to no-op stubs
|
| 1999 |
+
so the rest of the engine remains functional.
|
| 2000 |
+
|
| 2001 |
+
Usage:
|
| 2002 |
+
l2 = L2CacheManager()
|
| 2003 |
+
|
| 2004 |
+
# Pin embedding table
|
| 2005 |
+
l2.pin(embedding_table)
|
| 2006 |
+
|
| 2007 |
+
# Configure for inference
|
| 2008 |
+
l2.configure_inference(
|
| 2009 |
+
embedding=embedding_table,
|
| 2010 |
+
attention_weights=attn_weights,
|
| 2011 |
+
kv_cache=kv_cache,
|
| 2012 |
+
)
|
| 2013 |
+
|
| 2014 |
+
# Reset between batches
|
| 2015 |
+
l2.reset()
|
| 2016 |
+
"""
|
| 2017 |
+
|
| 2018 |
+
def __init__(self, device: int = 0):
|
| 2019 |
+
self.device = device
|
| 2020 |
+
self._hw_available = False
|
| 2021 |
+
self._l2_size = 0
|
| 2022 |
+
self._max_persisting = 0
|
| 2023 |
+
|
| 2024 |
+
self._initialize()
|
| 2025 |
+
|
| 2026 |
+
def _initialize(self):
|
| 2027 |
+
"""Query device L2 geometry via cudart."""
|
| 2028 |
+
if not torch.cuda.is_available():
|
| 2029 |
+
return
|
| 2030 |
+
|
| 2031 |
+
# Start with PyTorch device properties
|
| 2032 |
+
props = torch.cuda.get_device_properties(self.device)
|
| 2033 |
+
self._l2_size = getattr(props, 'l2_cache_size', 0)
|
| 2034 |
+
|
| 2035 |
+
if _cudart is not None:
|
| 2036 |
+
try:
|
| 2037 |
+
# Total L2
|
| 2038 |
+
val = ctypes.c_int(0)
|
| 2039 |
+
if (_cudart.cudaDeviceGetAttribute(
|
| 2040 |
+
ctypes.byref(val),
|
| 2041 |
+
ctypes.c_int(_CUDA_DEV_ATTR_L2_CACHE_SIZE),
|
| 2042 |
+
ctypes.c_int(self.device),
|
| 2043 |
+
) == 0 and val.value > 0):
|
| 2044 |
+
self._l2_size = val.value
|
| 2045 |
+
|
| 2046 |
+
# Max persisting
|
| 2047 |
+
val2 = ctypes.c_int(0)
|
| 2048 |
+
if (_cudart.cudaDeviceGetAttribute(
|
| 2049 |
+
ctypes.byref(val2),
|
| 2050 |
+
ctypes.c_int(_CUDA_DEV_ATTR_MAX_PERSISTING_L2_CACHE_SIZE),
|
| 2051 |
+
ctypes.c_int(self.device),
|
| 2052 |
+
) == 0 and val2.value > 0):
|
| 2053 |
+
self._max_persisting = val2.value
|
| 2054 |
+
else:
|
| 2055 |
+
self._max_persisting = int(self._l2_size * 0.75)
|
| 2056 |
+
|
| 2057 |
+
self._hw_available = True
|
| 2058 |
+
except Exception:
|
| 2059 |
+
pass
|
| 2060 |
+
|
| 2061 |
+
if self._max_persisting == 0:
|
| 2062 |
+
self._max_persisting = int(self._l2_size * 0.75)
|
| 2063 |
+
|
| 2064 |
+
# Apply persisting limit
|
| 2065 |
+
self._set_persisting_limit(self._max_persisting)
|
| 2066 |
+
|
| 2067 |
+
# ------------------------------------------------------------------
|
| 2068 |
+
# Internal CUDA helpers
|
| 2069 |
+
# ------------------------------------------------------------------
|
| 2070 |
+
|
| 2071 |
+
def _set_persisting_limit(self, num_bytes: int) -> bool:
|
| 2072 |
+
if not self._hw_available or _cudart is None:
|
| 2073 |
+
return False
|
| 2074 |
+
return _cudart.cudaDeviceSetLimit(
|
| 2075 |
+
ctypes.c_int(_CUDA_LIMIT_PERSISTING_L2_CACHE_SIZE),
|
| 2076 |
+
ctypes.c_size_t(num_bytes),
|
| 2077 |
+
) == 0
|
| 2078 |
+
|
| 2079 |
+
def _apply_access_policy(self, tensor: torch.Tensor, hit_ratio: float,
|
| 2080 |
+
stream_ptr: int) -> bool:
|
| 2081 |
+
if not self._hw_available or _cudart is None:
|
| 2082 |
+
return False
|
| 2083 |
+
|
| 2084 |
+
window = _AccessPolicyWindow()
|
| 2085 |
+
window.base_ptr = tensor.data_ptr()
|
| 2086 |
+
window.num_bytes = min(
|
| 2087 |
+
tensor.numel() * tensor.element_size(),
|
| 2088 |
+
self._max_persisting,
|
| 2089 |
+
)
|
| 2090 |
+
window.hitRatio = hit_ratio
|
| 2091 |
+
window.hitProp = _CUDA_ACCESS_PROPERTY_PERSISTING
|
| 2092 |
+
window.missProp = _CUDA_ACCESS_PROPERTY_STREAMING
|
| 2093 |
+
|
| 2094 |
+
attr = _StreamAttrValue()
|
| 2095 |
+
attr.accessPolicyWindow = window
|
| 2096 |
+
|
| 2097 |
+
return _cudart.cudaStreamSetAttribute(
|
| 2098 |
+
ctypes.c_void_p(stream_ptr),
|
| 2099 |
+
ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
|
| 2100 |
+
ctypes.byref(attr),
|
| 2101 |
+
) == 0
|
| 2102 |
+
|
| 2103 |
+
def _reset_stream_policy(self, stream_ptr: int) -> bool:
|
| 2104 |
+
if not self._hw_available or _cudart is None:
|
| 2105 |
+
return False
|
| 2106 |
+
attr = _StreamAttrValue()
|
| 2107 |
+
attr.accessPolicyWindow = _AccessPolicyWindow()
|
| 2108 |
+
return _cudart.cudaStreamSetAttribute(
|
| 2109 |
+
ctypes.c_void_p(stream_ptr),
|
| 2110 |
+
ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
|
| 2111 |
+
ctypes.byref(attr),
|
| 2112 |
+
) == 0
|
| 2113 |
+
|
| 2114 |
+
def _reset_persisting_l2(self) -> bool:
|
| 2115 |
+
if not self._hw_available or _cudart is None:
|
| 2116 |
+
return False
|
| 2117 |
+
return _cudart.cudaCtxResetPersistingL2Cache() == 0
|
| 2118 |
+
|
| 2119 |
+
# ------------------------------------------------------------------
|
| 2120 |
+
# Public API
|
| 2121 |
+
# ------------------------------------------------------------------
|
| 2122 |
+
|
| 2123 |
+
@property
|
| 2124 |
+
def l2_size(self) -> int:
|
| 2125 |
+
"""Total L2 cache size in bytes."""
|
| 2126 |
+
return self._l2_size
|
| 2127 |
+
|
| 2128 |
+
@property
|
| 2129 |
+
def max_persisting(self) -> int:
|
| 2130 |
+
"""Maximum persisting L2 size in bytes."""
|
| 2131 |
+
return self._max_persisting
|
| 2132 |
+
|
| 2133 |
+
def pin(
|
| 2134 |
+
self,
|
| 2135 |
+
tensor: torch.Tensor,
|
| 2136 |
+
hit_ratio: float = 1.0,
|
| 2137 |
+
stream: Optional[torch.cuda.Stream] = None,
|
| 2138 |
+
) -> bool:
|
| 2139 |
+
"""
|
| 2140 |
+
Pin a tensor in L2 cache via cudaAccessPolicyWindow.
|
| 2141 |
+
|
| 2142 |
+
Args:
|
| 2143 |
+
tensor: Tensor to pin (must be on CUDA)
|
| 2144 |
+
hit_ratio: Fraction of accesses to persist (0.0-1.0)
|
| 2145 |
+
stream: CUDA stream (default: current)
|
| 2146 |
+
|
| 2147 |
+
Returns:
|
| 2148 |
+
True on success (or no-op when HW unavailable)
|
| 2149 |
+
"""
|
| 2150 |
+
if not tensor.is_cuda:
|
| 2151 |
+
return False
|
| 2152 |
+
|
| 2153 |
+
if self._hw_available:
|
| 2154 |
+
stream_ptr = (
|
| 2155 |
+
stream.cuda_stream if stream is not None
|
| 2156 |
+
else torch.cuda.current_stream(self.device).cuda_stream
|
| 2157 |
+
)
|
| 2158 |
+
return self._apply_access_policy(tensor, hit_ratio, stream_ptr)
|
| 2159 |
+
|
| 2160 |
+
return True # no-op fallback
|
| 2161 |
+
|
| 2162 |
+
def set_streaming(
|
| 2163 |
+
self,
|
| 2164 |
+
tensor: torch.Tensor,
|
| 2165 |
+
stream: Optional[torch.cuda.Stream] = None,
|
| 2166 |
+
) -> bool:
|
| 2167 |
+
"""
|
| 2168 |
+
Mark tensor as streaming (bypass L2 cache).
|
| 2169 |
+
|
| 2170 |
+
Use for one-time access data to avoid L2 pollution.
|
| 2171 |
+
"""
|
| 2172 |
+
if not tensor.is_cuda:
|
| 2173 |
+
return False
|
| 2174 |
+
|
| 2175 |
+
if self._hw_available:
|
| 2176 |
+
stream_ptr = (
|
| 2177 |
+
stream.cuda_stream if stream is not None
|
| 2178 |
+
else torch.cuda.current_stream(self.device).cuda_stream
|
| 2179 |
+
)
|
| 2180 |
+
window = _AccessPolicyWindow()
|
| 2181 |
+
window.base_ptr = tensor.data_ptr()
|
| 2182 |
+
window.num_bytes = tensor.numel() * tensor.element_size()
|
| 2183 |
+
window.hitRatio = 0.0
|
| 2184 |
+
window.hitProp = _CUDA_ACCESS_PROPERTY_STREAMING
|
| 2185 |
+
window.missProp = _CUDA_ACCESS_PROPERTY_STREAMING
|
| 2186 |
+
|
| 2187 |
+
attr = _StreamAttrValue()
|
| 2188 |
+
attr.accessPolicyWindow = window
|
| 2189 |
+
|
| 2190 |
+
return _cudart.cudaStreamSetAttribute(
|
| 2191 |
+
ctypes.c_void_p(stream_ptr),
|
| 2192 |
+
ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
|
| 2193 |
+
ctypes.byref(attr),
|
| 2194 |
+
) == 0
|
| 2195 |
+
|
| 2196 |
+
return True # no-op fallback
|
| 2197 |
+
|
| 2198 |
+
def reset(self) -> bool:
|
| 2199 |
+
"""Reset persisting L2 cache. Call between inference batches."""
|
| 2200 |
+
return self._reset_persisting_l2() if self._hw_available else True
|
| 2201 |
+
|
| 2202 |
+
def configure_inference(
|
| 2203 |
+
self,
|
| 2204 |
+
embedding: Optional[torch.Tensor] = None,
|
| 2205 |
+
attention_weights: Optional[torch.Tensor] = None,
|
| 2206 |
+
kv_cache: Optional[torch.Tensor] = None,
|
| 2207 |
+
stream: Optional[torch.cuda.Stream] = None,
|
| 2208 |
+
) -> bool:
|
| 2209 |
+
"""
|
| 2210 |
+
Configure L2 cache for transformer inference.
|
| 2211 |
+
|
| 2212 |
+
Pins tensors with appropriate priorities:
|
| 2213 |
+
1. Embedding table (highest — hit_ratio=1.0)
|
| 2214 |
+
2. Attention weights (hit_ratio=0.9)
|
| 2215 |
+
3. KV cache (lowest — hit_ratio=0.7)
|
| 2216 |
+
|
| 2217 |
+
Args:
|
| 2218 |
+
embedding: Embedding table tensor
|
| 2219 |
+
attention_weights: Combined attention weights
|
| 2220 |
+
kv_cache: KV cache tensor
|
| 2221 |
+
stream: CUDA stream
|
| 2222 |
+
"""
|
| 2223 |
+
success = True
|
| 2224 |
+
if embedding is not None:
|
| 2225 |
+
success = success and self.pin(embedding, 1.0, stream)
|
| 2226 |
+
if attention_weights is not None:
|
| 2227 |
+
success = success and self.pin(attention_weights, 0.9, stream)
|
| 2228 |
+
if kv_cache is not None:
|
| 2229 |
+
success = success and self.pin(kv_cache, 0.7, stream)
|
| 2230 |
+
return success
|
| 2231 |
+
|
| 2232 |
+
|
| 2233 |
+
# =============================================================================
|
| 2234 |
+
# Benchmark Utilities
|
| 2235 |
+
# =============================================================================
|
| 2236 |
+
|
| 2237 |
+
def benchmark_tma_vs_cublas(sizes=None, warmup=10, iters=100):
|
| 2238 |
+
"""Benchmark TMA MatMul vs cuBLAS."""
|
| 2239 |
+
import time
|
| 2240 |
+
|
| 2241 |
+
if sizes is None:
|
| 2242 |
+
sizes = [(2048, 2048, 2048), (4096, 4096, 4096), (8192, 8192, 8192)]
|
| 2243 |
+
|
| 2244 |
+
print("=" * 60)
|
| 2245 |
+
print("TMA MatMul vs cuBLAS Benchmark")
|
| 2246 |
+
print("=" * 60)
|
| 2247 |
+
|
| 2248 |
+
for M, N, K in sizes:
|
| 2249 |
+
a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
|
| 2250 |
+
b = torch.randn(K, N, device='cuda', dtype=torch.bfloat16)
|
| 2251 |
+
|
| 2252 |
+
# Warmup
|
| 2253 |
+
for _ in range(warmup):
|
| 2254 |
+
_ = tma_matmul(a, b)
|
| 2255 |
+
_ = torch.matmul(a, b)
|
| 2256 |
+
torch.cuda.synchronize()
|
| 2257 |
+
|
| 2258 |
+
# TMA MatMul
|
| 2259 |
+
start = time.perf_counter()
|
| 2260 |
+
for _ in range(iters):
|
| 2261 |
+
_ = tma_matmul(a, b)
|
| 2262 |
+
torch.cuda.synchronize()
|
| 2263 |
+
tma_time = (time.perf_counter() - start) / iters
|
| 2264 |
+
|
| 2265 |
+
# cuBLAS
|
| 2266 |
+
start = time.perf_counter()
|
| 2267 |
+
for _ in range(iters):
|
| 2268 |
+
_ = torch.matmul(a, b)
|
| 2269 |
+
torch.cuda.synchronize()
|
| 2270 |
+
cublas_time = (time.perf_counter() - start) / iters
|
| 2271 |
+
|
| 2272 |
+
flops = 2 * M * N * K
|
| 2273 |
+
tma_tflops = flops / tma_time / 1e12
|
| 2274 |
+
cublas_tflops = flops / cublas_time / 1e12
|
| 2275 |
+
speedup = cublas_time / tma_time
|
| 2276 |
+
|
| 2277 |
+
print(f"{M}x{N}x{K}:")
|
| 2278 |
+
print(f" TMA: {tma_tflops:.1f} TFLOPS ({tma_time*1000:.2f}ms)")
|
| 2279 |
+
print(f" cuBLAS: {cublas_tflops:.1f} TFLOPS ({cublas_time*1000:.2f}ms)")
|
| 2280 |
+
print(f" Speedup: {speedup:.2f}x")
|
| 2281 |
+
print()
|
| 2282 |
+
|
| 2283 |
+
|
| 2284 |
+
def benchmark_fp4_vs_fp16(M=4096, N=4096, K=4096, warmup=10, iters=100):
|
| 2285 |
+
"""Benchmark NVFP4, MXFP4, and FP16 GEMM paths."""
|
| 2286 |
+
import time
|
| 2287 |
+
|
| 2288 |
+
print("=" * 60)
|
| 2289 |
+
print("FP4 vs FP16 GEMM Benchmark")
|
| 2290 |
+
print("=" * 60)
|
| 2291 |
+
|
| 2292 |
+
# Create weights in both formats
|
| 2293 |
+
w_fp16 = torch.randn(K, N, device='cuda', dtype=torch.float16)
|
| 2294 |
+
w_mxfp4 = quantize_to_mxfp4(w_fp16)
|
| 2295 |
+
w_nvfp4 = quantize_to_nvfp4(w_fp16)
|
| 2296 |
+
a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
|
| 2297 |
+
|
| 2298 |
+
# Memory usage
|
| 2299 |
+
fp16_bytes = w_fp16.numel() * 2
|
| 2300 |
+
mxfp4_bytes = w_mxfp4.packed.numel() + w_mxfp4.scales.numel()
|
| 2301 |
+
nvfp4_bytes = w_nvfp4.packed.numel() + w_nvfp4.block_scales.numel()
|
| 2302 |
+
|
| 2303 |
+
print(f"Weight memory:")
|
| 2304 |
+
print(f" FP16: {fp16_bytes / 1e6:.1f} MB")
|
| 2305 |
+
print(f" MXFP4: {mxfp4_bytes / 1e6:.1f} MB ({fp16_bytes / mxfp4_bytes:.1f}x smaller)")
|
| 2306 |
+
print(f" NVFP4: {nvfp4_bytes / 1e6:.1f} MB ({fp16_bytes / nvfp4_bytes:.1f}x smaller)")
|
| 2307 |
+
print()
|
| 2308 |
+
|
| 2309 |
+
# ---- Kernel-only benchmark (isolates kernel from activation quant) ----
|
| 2310 |
+
print(f"{M}x{N}x{K} Kernel-only (no activation quant overhead):")
|
| 2311 |
+
|
| 2312 |
+
for _ in range(warmup):
|
| 2313 |
+
_fused_nvfp4_matmul(a, w_nvfp4)
|
| 2314 |
+
_fused_fp4_matmul(a, w_mxfp4)
|
| 2315 |
+
torch.matmul(a.half(), w_fp16)
|
| 2316 |
+
torch.cuda.synchronize()
|
| 2317 |
+
|
| 2318 |
+
start = time.perf_counter()
|
| 2319 |
+
for _ in range(iters):
|
| 2320 |
+
_fused_nvfp4_matmul(a, w_nvfp4)
|
| 2321 |
+
torch.cuda.synchronize()
|
| 2322 |
+
nvfp4_kern_time = (time.perf_counter() - start) / iters
|
| 2323 |
+
|
| 2324 |
+
start = time.perf_counter()
|
| 2325 |
+
for _ in range(iters):
|
| 2326 |
+
_fused_fp4_matmul(a, w_mxfp4)
|
| 2327 |
+
torch.cuda.synchronize()
|
| 2328 |
+
mxfp4_kern_time = (time.perf_counter() - start) / iters
|
| 2329 |
+
|
| 2330 |
+
start = time.perf_counter()
|
| 2331 |
+
for _ in range(iters):
|
| 2332 |
+
torch.matmul(a.half(), w_fp16)
|
| 2333 |
+
torch.cuda.synchronize()
|
| 2334 |
+
fp16_time = (time.perf_counter() - start) / iters
|
| 2335 |
+
|
| 2336 |
+
flops = 2 * M * N * K
|
| 2337 |
+
print(f" NVFP4 kernel: {flops/nvfp4_kern_time/1e12:.1f} TFLOPS ({nvfp4_kern_time*1000:.2f}ms)")
|
| 2338 |
+
print(f" MXFP4 kernel: {flops/mxfp4_kern_time/1e12:.1f} TFLOPS ({mxfp4_kern_time*1000:.2f}ms)")
|
| 2339 |
+
print(f" BF16 cuBLAS: {flops/fp16_time/1e12:.1f} TFLOPS ({fp16_time*1000:.2f}ms)")
|
| 2340 |
+
print()
|
| 2341 |
+
|
| 2342 |
+
# ---- Full pipeline benchmark (includes Hadamard + activation quant) ----
|
| 2343 |
+
print(f"{M}x{N}x{K} Full pipeline (Hadamard + act quant + kernel):")
|
| 2344 |
+
|
| 2345 |
+
for _ in range(warmup):
|
| 2346 |
+
nvfp4_gemm(a, w_nvfp4)
|
| 2347 |
+
mxfp4_gemm(a, w_mxfp4)
|
| 2348 |
+
mxfp4_gemm_legacy(a, w_mxfp4)
|
| 2349 |
+
torch.cuda.synchronize()
|
| 2350 |
+
|
| 2351 |
+
start = time.perf_counter()
|
| 2352 |
+
for _ in range(iters):
|
| 2353 |
+
nvfp4_gemm(a, w_nvfp4)
|
| 2354 |
+
torch.cuda.synchronize()
|
| 2355 |
+
nvfp4_pipe_time = (time.perf_counter() - start) / iters
|
| 2356 |
+
|
| 2357 |
+
start = time.perf_counter()
|
| 2358 |
+
for _ in range(iters):
|
| 2359 |
+
mxfp4_gemm(a, w_mxfp4)
|
| 2360 |
+
torch.cuda.synchronize()
|
| 2361 |
+
mxfp4_pipe_time = (time.perf_counter() - start) / iters
|
| 2362 |
+
|
| 2363 |
+
start = time.perf_counter()
|
| 2364 |
+
for _ in range(iters):
|
| 2365 |
+
mxfp4_gemm_legacy(a, w_mxfp4)
|
| 2366 |
+
torch.cuda.synchronize()
|
| 2367 |
+
legacy_time = (time.perf_counter() - start) / iters
|
| 2368 |
+
|
| 2369 |
+
print(f" NVFP4 pipeline: {flops/nvfp4_pipe_time/1e12:.1f} TFLOPS ({nvfp4_pipe_time*1000:.2f}ms)")
|
| 2370 |
+
print(f" MXFP4 pipeline: {flops/mxfp4_pipe_time/1e12:.1f} TFLOPS ({mxfp4_pipe_time*1000:.2f}ms)")
|
| 2371 |
+
print(f" MXFP4 legacy: {flops/legacy_time/1e12:.1f} TFLOPS ({legacy_time*1000:.2f}ms)")
|
| 2372 |
+
act_overhead_nv = nvfp4_pipe_time - nvfp4_kern_time
|
| 2373 |
+
act_overhead_mx = mxfp4_pipe_time - mxfp4_kern_time
|
| 2374 |
+
print(f" Act quant overhead: NVFP4={act_overhead_nv*1000:.2f}ms MXFP4={act_overhead_mx*1000:.2f}ms")
|
| 2375 |
+
print()
|
| 2376 |
+
|
| 2377 |
+
# ---- Probes ----
|
| 2378 |
+
print(f" Native FP4 probe: {_can_use_native_fp4()}")
|
| 2379 |
+
print(f" Scaled MM FP4 probe: {_can_use_scaled_mm_fp4()}")
|
| 2380 |
+
|
| 2381 |
+
# ---- Accuracy (kernel-only, apples-to-apples) ----
|
| 2382 |
+
# Compare fused kernel output vs torch.matmul with same dequantized weights
|
| 2383 |
+
# using the SAME activations (no Hadamard/quant noise difference)
|
| 2384 |
+
out_nv_kern = _fused_nvfp4_matmul(a, w_nvfp4)
|
| 2385 |
+
out_nv_ref = torch.matmul(a.float(), w_nvfp4.to_float()).bfloat16()
|
| 2386 |
+
rel_err_nv = (out_nv_kern.float() - out_nv_ref.float()).abs().mean() / out_nv_ref.float().abs().mean()
|
| 2387 |
+
|
| 2388 |
+
out_mx_kern = _fused_fp4_matmul(a, w_mxfp4)
|
| 2389 |
+
out_mx_ref = torch.matmul(a.float(), w_mxfp4.to_float()).bfloat16()
|
| 2390 |
+
rel_err_mx = (out_mx_kern.float() - out_mx_ref.float()).abs().mean() / out_mx_ref.float().abs().mean()
|
| 2391 |
+
|
| 2392 |
+
# MXFP4 fused vs legacy (both use same pipeline, should match exactly)
|
| 2393 |
+
out_mxfp4_fused = mxfp4_gemm(a, w_mxfp4)
|
| 2394 |
+
out_legacy = mxfp4_gemm_legacy(a, w_mxfp4)
|
| 2395 |
+
rel_err_mx_pipe = (out_mxfp4_fused - out_legacy).abs().mean() / out_legacy.abs().mean()
|
| 2396 |
+
|
| 2397 |
+
print(f" NVFP4 kernel rel_err (vs matmul): {rel_err_nv:.6f}")
|
| 2398 |
+
print(f" MXFP4 kernel rel_err (vs matmul): {rel_err_mx:.6f}")
|
| 2399 |
+
print(f" MXFP4 fused vs legacy rel_err: {rel_err_mx_pipe:.6f}")
|
| 2400 |
+
|
| 2401 |
+
|
| 2402 |
+
if __name__ == "__main__":
|
| 2403 |
+
print("FireEcho CUTLASS-Compatible Kernels (self-contained)")
|
| 2404 |
+
print("=" * 60)
|
| 2405 |
+
print(f"Triton available: True")
|
| 2406 |
+
print(f"cudart loaded: {_cudart is not None}")
|
| 2407 |
+
|
| 2408 |
+
if torch.cuda.is_available():
|
| 2409 |
+
l2 = L2CacheManager()
|
| 2410 |
+
print(f"L2 Cache size: {l2.l2_size / 1e6:.0f} MB")
|
| 2411 |
+
print(f"Max persisting: {l2.max_persisting / 1e6:.0f} MB")
|
| 2412 |
+
print(f"HW L2 pinning: {l2._hw_available}")
|
| 2413 |
+
print(f"Native FP4 (dot_scaled): {_can_use_native_fp4()}")
|
| 2414 |
+
print(f"Scaled MM FP4: {_can_use_scaled_mm_fp4()}")
|
| 2415 |
+
|
| 2416 |
+
print()
|
| 2417 |
+
benchmark_tma_vs_cublas(sizes=[(2048, 2048, 2048)])
|
| 2418 |
+
benchmark_fp4_vs_fp16(M=2048, N=2048, K=2048)
|
FireEcho Engine/debug_acceptance.log
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
nohup: ignoring input
|
| 2 |
+
Loading model...
|
| 3 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 4 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 5 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 6 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 7 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 8 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 9 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 10 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 11 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 12 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 13 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 14 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 15 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 16 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 17 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 18 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 19 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 20 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 21 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 22 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 23 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 24 |
+
Total params: 1.57B
|
| 25 |
+
Frozen params: 1.54B (base model, FP4)
|
| 26 |
+
Trainable params: 30.2M (Hebbian only)
|
| 27 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 28 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 29 |
+
Warmup...
|
| 30 |
+
|
| 31 |
+
============================================================
|
| 32 |
+
Testing D=2 (D=2 baseline)
|
| 33 |
+
============================================================
|
| 34 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
|
| 35 |
+
[EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 36 |
+
Target prefill logits: has_nan=True, min=nan, max=nan
|
| 37 |
+
First decoded token: 0 = '!'
|
| 38 |
+
Target predicts next: 0 = '!'
|
| 39 |
+
Feature layer 8: has_nan=True, min=nan, max=nan
|
| 40 |
+
Feature layer 24: has_nan=True, min=nan, max=nan
|
| 41 |
+
Feature layer 47: has_nan=True, min=nan, max=nan
|
| 42 |
+
Draft tokens:
|
| 43 |
+
[0] 0 = '!'
|
| 44 |
+
[1] 0 = '!'
|
| 45 |
+
[2] 0 = '!'
|
| 46 |
+
[3] 0 = '!'
|
| 47 |
+
[4] 0 = '!'
|
| 48 |
+
Draft logits[0]: has_nan=True, min=nan, max=nan
|
| 49 |
+
Target verify predictions:
|
| 50 |
+
[1] target=0 ('!'), draft=0 ('!') → MATCH
|
| 51 |
+
[2] target=0 ('!'), draft=0 ('!') → MATCH
|
| 52 |
+
[3] target=0 ('!'), draft=0 ('!') → MATCH
|
| 53 |
+
[4] target=0 ('!'), draft=0 ('!') → MATCH
|
| 54 |
+
Accepted: 5/5
|
| 55 |
+
|
| 56 |
+
--- Full speculative_generate (max_new=30) ---
|
| 57 |
+
[EAGLE-3] 5 rounds, 21 drafted, 21 accepted (100%), avg 4.2/round
|
| 58 |
+
Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
| 59 |
+
|
| 60 |
+
============================================================
|
| 61 |
+
Testing D=8 (D=8 with random layers 2-7)
|
| 62 |
+
============================================================
|
| 63 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
|
| 64 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 65 |
+
Target prefill logits: has_nan=True, min=nan, max=nan
|
| 66 |
+
First decoded token: 0 = '!'
|
| 67 |
+
Target predicts next: 0 = '!'
|
| 68 |
+
Feature layer 8: has_nan=True, min=nan, max=nan
|
| 69 |
+
Feature layer 24: has_nan=True, min=nan, max=nan
|
| 70 |
+
Feature layer 47: has_nan=True, min=nan, max=nan
|
| 71 |
+
Draft tokens:
|
| 72 |
+
[0] 0 = '!'
|
| 73 |
+
[1] 0 = '!'
|
| 74 |
+
[2] 0 = '!'
|
| 75 |
+
[3] 0 = '!'
|
| 76 |
+
[4] 0 = '!'
|
| 77 |
+
Draft logits[0]: has_nan=True, min=nan, max=nan
|
| 78 |
+
Target verify predictions:
|
| 79 |
+
[1] target=0 ('!'), draft=0 ('!') → MATCH
|
| 80 |
+
[2] target=0 ('!'), draft=0 ('!') → MATCH
|
| 81 |
+
[3] target=0 ('!'), draft=0 ('!') → MATCH
|
| 82 |
+
[4] target=0 ('!'), draft=0 ('!') → MATCH
|
| 83 |
+
Accepted: 5/5
|
| 84 |
+
|
| 85 |
+
--- Full speculative_generate (max_new=30) ---
|
| 86 |
+
[EAGLE-3] 5 rounds, 21 drafted, 21 accepted (100%), avg 4.2/round
|
| 87 |
+
Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
| 88 |
+
|
| 89 |
+
============================================================
|
| 90 |
+
D=2 accepted: 5/5
|
| 91 |
+
D=8 accepted: 5/5
|
| 92 |
+
============================================================
|
FireEcho Engine/debug_acceptance.py
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Debug: Why does D=8 eagle head show 100% acceptance?
|
| 3 |
+
Compare draft tokens vs target predictions for D=2 and D=8.
|
| 4 |
+
|
| 5 |
+
ROOT CAUSE FOUND: Missing torch.no_grad() caused NaN logits (Goliath FP4
|
| 6 |
+
Triton kernels don't support autograd). argmax(NaN)=0 for both draft and
|
| 7 |
+
target → fake 100% acceptance. This version fixes that.
|
| 8 |
+
"""
|
| 9 |
+
import sys, os, torch
|
| 10 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 11 |
+
from hebbian_finetune_demo import load_engine
|
| 12 |
+
|
| 13 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 14 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 15 |
+
|
| 16 |
+
@torch.no_grad()
|
| 17 |
+
def test_acceptance(engine, tokenizer, num_layers, label):
|
| 18 |
+
"""Enable eagle with given D, run one round of draft+verify, print details."""
|
| 19 |
+
print(f"\n{'='*60}")
|
| 20 |
+
print(f" Testing D={num_layers} ({label})")
|
| 21 |
+
print(f"{'='*60}")
|
| 22 |
+
|
| 23 |
+
# Enable eagle
|
| 24 |
+
engine.enable_eagle(
|
| 25 |
+
capture_layers=(8, 24, 47),
|
| 26 |
+
num_head_layers=num_layers,
|
| 27 |
+
checkpoint_path=EAGLE_CKPT if os.path.exists(EAGLE_CKPT) else None)
|
| 28 |
+
engine.eval()
|
| 29 |
+
|
| 30 |
+
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
|
| 31 |
+
ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 32 |
+
prompt_len = ids.shape[1]
|
| 33 |
+
|
| 34 |
+
# Prefill
|
| 35 |
+
engine.reset_cache()
|
| 36 |
+
engine._current_seq_id = 0
|
| 37 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 38 |
+
engine.kv_cache._graph_mode = False
|
| 39 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 40 |
+
current_pos = prompt_len
|
| 41 |
+
|
| 42 |
+
# Check for NaN in target logits
|
| 43 |
+
has_nan = logits.isnan().any().item()
|
| 44 |
+
print(f" Target prefill logits: has_nan={has_nan}, "
|
| 45 |
+
f"min={logits[:,-1,:].min().item():.2f}, max={logits[:,-1,:].max().item():.2f}")
|
| 46 |
+
|
| 47 |
+
# Decode first token
|
| 48 |
+
next_token = logits[:, -1:, :].argmax(dim=-1)
|
| 49 |
+
print(f" First decoded token: {next_token.item()} = '{tokenizer.decode([next_token.item()])}'")
|
| 50 |
+
|
| 51 |
+
# Forward it (stores KV, captures hidden states)
|
| 52 |
+
logits = engine.forward(next_token, use_cache=True, position=current_pos)
|
| 53 |
+
current_pos += 1
|
| 54 |
+
|
| 55 |
+
# Target model's prediction
|
| 56 |
+
main_pred = logits[:, -1, :].argmax(dim=-1).item()
|
| 57 |
+
print(f" Target predicts next: {main_pred} = '{tokenizer.decode([main_pred])}'")
|
| 58 |
+
|
| 59 |
+
# Draft 5 tokens
|
| 60 |
+
features = [engine._eagle_hidden_states[l]
|
| 61 |
+
for l in engine._eagle_capture_layers]
|
| 62 |
+
|
| 63 |
+
# Check features for NaN
|
| 64 |
+
for li, f in zip(engine._eagle_capture_layers, features):
|
| 65 |
+
print(f" Feature layer {li}: has_nan={f.isnan().any().item()}, "
|
| 66 |
+
f"min={f.min().item():.4f}, max={f.max().item():.4f}")
|
| 67 |
+
|
| 68 |
+
memory_ctx = engine._get_eagle_memory_context(
|
| 69 |
+
engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
|
| 70 |
+
|
| 71 |
+
draft_tokens, draft_logits = engine.eagle_head.generate_draft(
|
| 72 |
+
features, next_token, engine.embed, depth=5,
|
| 73 |
+
memory_context=memory_ctx)
|
| 74 |
+
|
| 75 |
+
print(f" Draft tokens:")
|
| 76 |
+
for i, dt in enumerate(draft_tokens):
|
| 77 |
+
tok_id = dt.item()
|
| 78 |
+
print(f" [{i}] {tok_id} = '{tokenizer.decode([tok_id])}'")
|
| 79 |
+
|
| 80 |
+
# Check draft logits for NaN
|
| 81 |
+
dl0 = draft_logits[0][0, 0, :]
|
| 82 |
+
print(f" Draft logits[0]: has_nan={dl0.isnan().any().item()}, "
|
| 83 |
+
f"min={dl0.min().item():.2f}, max={dl0.max().item():.2f}")
|
| 84 |
+
|
| 85 |
+
# Verify: forward draft tokens through target
|
| 86 |
+
draft_input = torch.cat(draft_tokens, dim=1)
|
| 87 |
+
verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
|
| 88 |
+
|
| 89 |
+
print(f" Target verify predictions:")
|
| 90 |
+
accepted = 0
|
| 91 |
+
if draft_tokens[0].item() == main_pred:
|
| 92 |
+
accepted = 1
|
| 93 |
+
for i in range(1, len(draft_tokens)):
|
| 94 |
+
target_pred = verify_logits[:, i - 1, :].argmax(dim=-1).item()
|
| 95 |
+
match = "MATCH" if draft_tokens[i].item() == target_pred else "MISS"
|
| 96 |
+
print(f" [{i}] target={target_pred} ('{tokenizer.decode([target_pred])}'), "
|
| 97 |
+
f"draft={draft_tokens[i].item()} ('{tokenizer.decode([draft_tokens[i].item()])}') → {match}")
|
| 98 |
+
if draft_tokens[i].item() == target_pred:
|
| 99 |
+
accepted += 1
|
| 100 |
+
else:
|
| 101 |
+
break
|
| 102 |
+
else:
|
| 103 |
+
print(f" [0] MISS: draft[0]={draft_tokens[0].item()} "
|
| 104 |
+
f"('{tokenizer.decode([draft_tokens[0].item()])}') "
|
| 105 |
+
f"!= main_pred={main_pred} ('{tokenizer.decode([main_pred])}')")
|
| 106 |
+
|
| 107 |
+
print(f" Accepted: {accepted}/{len(draft_tokens)}")
|
| 108 |
+
|
| 109 |
+
# Also run full speculative_generate to match training eval
|
| 110 |
+
print(f"\n --- Full speculative_generate (max_new=30) ---")
|
| 111 |
+
engine.reset_cache()
|
| 112 |
+
ids2 = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 113 |
+
out = engine.speculative_generate(
|
| 114 |
+
ids2, max_new_tokens=30, temperature=0.0,
|
| 115 |
+
stop_tokens=[199999, 200020])
|
| 116 |
+
text = tokenizer.decode(out[0, ids2.shape[1]:], skip_special_tokens=True)
|
| 117 |
+
print(f" Output: {text[:120]}")
|
| 118 |
+
|
| 119 |
+
# Cleanup eagle
|
| 120 |
+
del engine.eagle_head
|
| 121 |
+
engine._eagle_enabled = False
|
| 122 |
+
|
| 123 |
+
return accepted
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
if __name__ == "__main__":
|
| 127 |
+
print("Loading model...")
|
| 128 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 129 |
+
engine.pack_all_experts()
|
| 130 |
+
engine.kv_cache.enable_flat_decode()
|
| 131 |
+
engine.eval()
|
| 132 |
+
|
| 133 |
+
# Warmup
|
| 134 |
+
print("Warmup...")
|
| 135 |
+
warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 136 |
+
for _ in range(3):
|
| 137 |
+
engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 138 |
+
|
| 139 |
+
# Test D=2
|
| 140 |
+
acc2 = test_acceptance(engine, tokenizer, 2, "D=2 baseline")
|
| 141 |
+
|
| 142 |
+
# Test D=8
|
| 143 |
+
acc8 = test_acceptance(engine, tokenizer, 8, "D=8 with random layers 2-7")
|
| 144 |
+
|
| 145 |
+
print(f"\n{'='*60}")
|
| 146 |
+
print(f" D=2 accepted: {acc2}/5")
|
| 147 |
+
print(f" D=8 accepted: {acc8}/5")
|
| 148 |
+
if acc8 > acc2 + 2:
|
| 149 |
+
print(f" WARNING: D=8 significantly better than D=2 — investigate!")
|
| 150 |
+
elif acc2 <= 2 and acc8 <= 2:
|
| 151 |
+
print(f" EXPECTED: Both D=2 and D=8 have low acceptance (undertrained)")
|
| 152 |
+
print(f"{'='*60}")
|
FireEcho Engine/debug_bisect.log
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
Training Flow Bisection
|
| 3 |
+
============================================================
|
| 4 |
+
|
| 5 |
+
[Step 1] load_engine(max_seq_len=4096)...
|
| 6 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 7 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 8 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 9 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 10 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 11 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 12 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 13 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 14 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 15 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 16 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 17 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 18 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 19 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 20 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 21 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 22 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 23 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 24 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 25 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 26 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 27 |
+
Total params: 1.57B
|
| 28 |
+
Frozen params: 1.54B (base model, FP4)
|
| 29 |
+
Trainable params: 30.2M (Hebbian only)
|
| 30 |
+
Traceback (most recent call last):
|
| 31 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/debug_bisect.py", line 43, in <module>
|
| 32 |
+
check(engine, tokenizer, "after load")
|
| 33 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
|
| 34 |
+
return func(*args, **kwargs)
|
| 35 |
+
^^^^^^^^^^^^^^^^^^^^^
|
| 36 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/debug_bisect.py", line 23, in check
|
| 37 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 38 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 39 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 9964, in forward
|
| 40 |
+
x = layer(x, self.kv_cache, self._current_seq_id, position, use_cache)
|
| 41 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 42 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
|
| 43 |
+
return self._call_impl(*args, **kwargs)
|
| 44 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 45 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
|
| 46 |
+
return forward_call(*args, **kwargs)
|
| 47 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 48 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8820, in forward
|
| 49 |
+
x = x + self.ffn(self.norm2(x))
|
| 50 |
+
^^^^^^^^^^^^^^^^^^^^^^^
|
| 51 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
|
| 52 |
+
return self._call_impl(*args, **kwargs)
|
| 53 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 54 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
|
| 55 |
+
return forward_call(*args, **kwargs)
|
| 56 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 57 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8710, in forward
|
| 58 |
+
expert_out = self.experts[expert_idx](selected)
|
| 59 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 60 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
|
| 61 |
+
return self._call_impl(*args, **kwargs)
|
| 62 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 63 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
|
| 64 |
+
return forward_call(*args, **kwargs)
|
| 65 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 66 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 7565, in forward
|
| 67 |
+
gate_up = self.gate_up_proj(x) # [*, 2*intermediate]
|
| 68 |
+
^^^^^^^^^^^^^^^^^^^^
|
| 69 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
|
| 70 |
+
return self._call_impl(*args, **kwargs)
|
| 71 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 72 |
+
File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
|
| 73 |
+
return forward_call(*args, **kwargs)
|
| 74 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 75 |
+
File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 7339, in forward
|
| 76 |
+
return F.linear(x, self.weight, self.bias)
|
| 77 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 78 |
+
RuntimeError: size mismatch, got input (5), mat (5x2048), vec (0)
|
FireEcho Engine/debug_bisect.py
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Bisect: exactly which step of the training flow causes NaN.
|
| 3 |
+
|
| 4 |
+
Replicates train_eagle_head.py main() step by step, checking forward() after each.
|
| 5 |
+
FIXED: pack_all_experts + enable_flat_decode BEFORE first forward() call.
|
| 6 |
+
"""
|
| 7 |
+
import sys, os, torch, gc, time
|
| 8 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
+
from hebbian_finetune_demo import load_engine
|
| 10 |
+
from fireecho_kernel import FireEchoEagleHead
|
| 11 |
+
|
| 12 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 13 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 14 |
+
PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
@torch.no_grad()
|
| 18 |
+
def check(engine, tokenizer, label):
|
| 19 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 20 |
+
engine.reset_cache()
|
| 21 |
+
engine._current_seq_id = 0
|
| 22 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 23 |
+
engine.kv_cache._graph_mode = False
|
| 24 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 25 |
+
torch.cuda.synchronize()
|
| 26 |
+
has_nan = logits.isnan().any().item()
|
| 27 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 28 |
+
if has_nan:
|
| 29 |
+
print(f" [{label}] *** NaN DETECTED *** VRAM={vram:.2f}GB")
|
| 30 |
+
else:
|
| 31 |
+
top = logits[:, -1, :].argmax(dim=-1).item()
|
| 32 |
+
print(f" [{label}] OK top={top} ('{tokenizer.decode([top])}') VRAM={vram:.2f}GB")
|
| 33 |
+
return has_nan
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@torch.no_grad()
|
| 37 |
+
def check_speculative(engine, tokenizer, label):
|
| 38 |
+
"""Test speculative_generate specifically."""
|
| 39 |
+
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
|
| 40 |
+
ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
|
| 41 |
+
engine.reset_cache()
|
| 42 |
+
engine.eval()
|
| 43 |
+
eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
|
| 44 |
+
stop = [eos_id] if eos_id else [151645]
|
| 45 |
+
out = engine.speculative_generate(ids, max_new_tokens=20, temperature=0.0, stop_tokens=stop)
|
| 46 |
+
gen_tokens = out[0, ids.shape[1]:].tolist()
|
| 47 |
+
text = tokenizer.decode(gen_tokens, skip_special_tokens=True)
|
| 48 |
+
all_same = len(set(gen_tokens)) <= 1 if gen_tokens else True
|
| 49 |
+
if all_same and len(gen_tokens) > 3:
|
| 50 |
+
print(f" [{label}] *** ALL SAME TOKEN *** = NaN bug! tokens={gen_tokens[:5]}")
|
| 51 |
+
return True
|
| 52 |
+
else:
|
| 53 |
+
print(f" [{label}] OK: '{text[:80]}' ({len(gen_tokens)} tokens, {len(set(gen_tokens))} unique)")
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
if __name__ == "__main__":
|
| 58 |
+
print("=" * 60)
|
| 59 |
+
print(" Training Flow Bisection (v2 — fixed)")
|
| 60 |
+
print("=" * 60)
|
| 61 |
+
|
| 62 |
+
# === Step 1: load_engine (matches training exactly) ===
|
| 63 |
+
print("\n[Step 1] load_engine(max_seq_len=4096) + eval + flat_decode + pack...")
|
| 64 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 65 |
+
engine.eval()
|
| 66 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 67 |
+
engine.pack_all_experts()
|
| 68 |
+
nan1 = check(engine, tokenizer, "after load+pack+flat")
|
| 69 |
+
if nan1:
|
| 70 |
+
print(" FATAL: NaN at baseline! Cannot continue.")
|
| 71 |
+
sys.exit(1)
|
| 72 |
+
|
| 73 |
+
# === Step 2: enable_eagle D=8 (NO checkpoint, matches training) ===
|
| 74 |
+
print("\n[Step 2] enable_eagle(D=8, no checkpoint)...")
|
| 75 |
+
engine.enable_eagle(
|
| 76 |
+
capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 77 |
+
draft_depth=5, num_head_layers=8)
|
| 78 |
+
nan2 = check(engine, tokenizer, "after eagle D=8 random")
|
| 79 |
+
|
| 80 |
+
# === Step 3: create optimizer ===
|
| 81 |
+
print("\n[Step 3] create AdamW optimizer...")
|
| 82 |
+
eagle = engine.eagle_head
|
| 83 |
+
eagle_params = [p for n, p in eagle.named_parameters()
|
| 84 |
+
if 'lm_head' not in n and p.requires_grad]
|
| 85 |
+
optimizer = torch.optim.AdamW(eagle_params, lr=3e-4, betas=(0.9, 0.95), weight_decay=0.0)
|
| 86 |
+
nan3 = check(engine, tokenizer, "after optimizer")
|
| 87 |
+
|
| 88 |
+
# === Step 4: load_checkpoint (matches training: weights_only=False) ===
|
| 89 |
+
print("\n[Step 4] load_checkpoint...")
|
| 90 |
+
if os.path.exists(EAGLE_CKPT):
|
| 91 |
+
ckpt = torch.load(EAGLE_CKPT, weights_only=False, map_location='cuda')
|
| 92 |
+
sd = ckpt.get('eagle_head', ckpt)
|
| 93 |
+
is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
|
| 94 |
+
if is_legacy:
|
| 95 |
+
eagle.load_legacy_checkpoint(sd)
|
| 96 |
+
print(" Loaded legacy checkpoint")
|
| 97 |
+
else:
|
| 98 |
+
eagle.load_state_dict(sd, strict=False)
|
| 99 |
+
print(" Loaded new-format checkpoint")
|
| 100 |
+
if 'optimizer' in ckpt:
|
| 101 |
+
try:
|
| 102 |
+
optimizer.load_state_dict(ckpt['optimizer'])
|
| 103 |
+
print(" Loaded optimizer state")
|
| 104 |
+
except (ValueError, KeyError) as e:
|
| 105 |
+
print(f" Optimizer mismatch: {e}")
|
| 106 |
+
step = ckpt.get('step', 0)
|
| 107 |
+
print(f" Step={step}")
|
| 108 |
+
del ckpt
|
| 109 |
+
torch.cuda.empty_cache()
|
| 110 |
+
else:
|
| 111 |
+
print(" No checkpoint found, using random weights")
|
| 112 |
+
nan4 = check(engine, tokenizer, "after ckpt load")
|
| 113 |
+
|
| 114 |
+
# === Step 5: warmup ===
|
| 115 |
+
print("\n[Step 5] warmup 3x generate()...")
|
| 116 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 117 |
+
for i in range(3):
|
| 118 |
+
out = engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 119 |
+
text = tokenizer.decode(out[0, wids.shape[1]:], skip_special_tokens=True)
|
| 120 |
+
print(f" Warmup {i}: '{text}'")
|
| 121 |
+
del wids
|
| 122 |
+
nan5 = check(engine, tokenizer, "after warmup")
|
| 123 |
+
|
| 124 |
+
# === Step 6: speculative_generate (the actual eval path) ===
|
| 125 |
+
print("\n[Step 6] speculative_generate()...")
|
| 126 |
+
nan6 = check_speculative(engine, tokenizer, "speculative_generate")
|
| 127 |
+
|
| 128 |
+
# === Summary ===
|
| 129 |
+
print("\n" + "=" * 60)
|
| 130 |
+
print(" BISECTION RESULTS")
|
| 131 |
+
print("=" * 60)
|
| 132 |
+
results = [
|
| 133 |
+
("Step 1: load+pack+flat", nan1),
|
| 134 |
+
("Step 2: enable_eagle D=8", nan2),
|
| 135 |
+
("Step 3: create optimizer", nan3),
|
| 136 |
+
("Step 4: load checkpoint", nan4),
|
| 137 |
+
("Step 5: warmup", nan5),
|
| 138 |
+
("Step 6: speculative_generate", nan6),
|
| 139 |
+
]
|
| 140 |
+
for name, had_nan in results:
|
| 141 |
+
status = "*** NaN ***" if had_nan else "OK"
|
| 142 |
+
print(f" {name}: {status}")
|
| 143 |
+
|
| 144 |
+
first_fail = next((name for name, nan in results if nan), None)
|
| 145 |
+
if first_fail:
|
| 146 |
+
print(f"\n FIRST FAILURE: {first_fail}")
|
| 147 |
+
else:
|
| 148 |
+
print(f"\n ALL PASSED — no NaN detected!")
|
| 149 |
+
print("=" * 60)
|
FireEcho Engine/debug_d8_isolate.log
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
D=8 NaN Isolation
|
| 3 |
+
============================================================
|
| 4 |
+
|
| 5 |
+
[1] Loading model...
|
| 6 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 7 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 8 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 9 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 10 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 11 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 12 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 13 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 14 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 15 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 16 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 17 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 18 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 19 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 20 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 21 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 22 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 23 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 24 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 25 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 26 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 27 |
+
Total params: 1.57B
|
| 28 |
+
Frozen params: 1.54B (base model, FP4)
|
| 29 |
+
Trainable params: 30.2M (Hebbian only)
|
| 30 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 31 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 32 |
+
|
| 33 |
+
[2] Warmup...
|
| 34 |
+
VRAM baseline: 19.96 GB
|
| 35 |
+
|
| 36 |
+
[3] Baseline (no eagle)...
|
| 37 |
+
[baseline] OK — top=13048 ('Hi')
|
| 38 |
+
|
| 39 |
+
[4] D=2 eagle head...
|
| 40 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
|
| 41 |
+
[EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 42 |
+
VRAM: 20.17 GB (+0.21)
|
| 43 |
+
[D=2] OK — top=13048 ('Hi')
|
| 44 |
+
|
| 45 |
+
[5] D=8 eagle head (random init, no checkpoint)...
|
| 46 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 47 |
+
VRAM: 20.67 GB (+0.72)
|
| 48 |
+
[D=8 random] OK — top=13048 ('Hi')
|
| 49 |
+
|
| 50 |
+
[6] D=8 eagle head (with checkpoint)...
|
| 51 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
|
| 52 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 53 |
+
VRAM: 20.67 GB (+0.72)
|
| 54 |
+
[D=8 with ckpt] OK — top=13048 ('Hi')
|
| 55 |
+
|
| 56 |
+
[7] D=8 eagle head (allocated, NOT registered on engine)...
|
| 57 |
+
VRAM: 20.67 GB (+0.72)
|
| 58 |
+
[D=8 unregistered] OK — top=13048 ('Hi')
|
| 59 |
+
|
| 60 |
+
[8] D=4 eagle head (checkpoint)...
|
| 61 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 18 new layer params initialized randomly.
|
| 62 |
+
[FE-XT] Draft head: D=4, 188.8M params, 378 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 63 |
+
VRAM: 20.34 GB (+0.38)
|
| 64 |
+
[D=4] OK — top=13048 ('Hi')
|
| 65 |
+
|
| 66 |
+
[9] D=8 eagle head, but _eagle_enabled=False...
|
| 67 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
|
| 68 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 69 |
+
VRAM: 20.67 GB (+0.72)
|
| 70 |
+
[D=8 flag OFF] OK — top=13048 ('Hi')
|
| 71 |
+
|
| 72 |
+
============================================================
|
| 73 |
+
RESULTS
|
| 74 |
+
============================================================
|
| 75 |
+
D=8 random: OK
|
| 76 |
+
D=8 with ckpt: OK
|
| 77 |
+
D=8 unregistered: OK
|
| 78 |
+
D=4: OK
|
| 79 |
+
D=8 flag OFF: OK
|
FireEcho Engine/debug_d8_isolate.py
ADDED
|
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Isolate exactly what about D=8 causes NaN.
|
| 3 |
+
|
| 4 |
+
Tests:
|
| 5 |
+
1. D=2 eagle head → forward → should be OK
|
| 6 |
+
2. D=8 eagle head (random, no ckpt) → forward → is NaN from VRAM pressure?
|
| 7 |
+
3. D=8 eagle head (random, NOT assigned to engine) → forward → is NaN from registration?
|
| 8 |
+
4. D=8 allocated but eagle_enabled=False → forward → is NaN from .to() side effect?
|
| 9 |
+
"""
|
| 10 |
+
import sys, os, torch, gc
|
| 11 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 12 |
+
from hebbian_finetune_demo import load_engine
|
| 13 |
+
from fireecho_kernel import FireEchoEagleHead
|
| 14 |
+
|
| 15 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 16 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 17 |
+
|
| 18 |
+
PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@torch.no_grad()
|
| 22 |
+
def check(engine, tokenizer, label):
|
| 23 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 24 |
+
engine.reset_cache()
|
| 25 |
+
engine._current_seq_id = 0
|
| 26 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 27 |
+
engine.kv_cache._graph_mode = False
|
| 28 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 29 |
+
torch.cuda.synchronize()
|
| 30 |
+
has_nan = logits.isnan().any().item()
|
| 31 |
+
if has_nan:
|
| 32 |
+
print(f" [{label}] NaN DETECTED")
|
| 33 |
+
else:
|
| 34 |
+
top = logits[:, -1, :].argmax(dim=-1).item()
|
| 35 |
+
print(f" [{label}] OK — top={top} ('{tokenizer.decode([top])}')")
|
| 36 |
+
return has_nan
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
if __name__ == "__main__":
|
| 40 |
+
print("=" * 60)
|
| 41 |
+
print(" D=8 NaN Isolation")
|
| 42 |
+
print("=" * 60)
|
| 43 |
+
|
| 44 |
+
print("\n[1] Loading model...")
|
| 45 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 46 |
+
engine.pack_all_experts()
|
| 47 |
+
engine.kv_cache.enable_flat_decode()
|
| 48 |
+
engine.eval()
|
| 49 |
+
|
| 50 |
+
# Warmup
|
| 51 |
+
print("\n[2] Warmup...")
|
| 52 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 53 |
+
for _ in range(3):
|
| 54 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 55 |
+
|
| 56 |
+
vram_base = torch.cuda.memory_allocated() / 1e9
|
| 57 |
+
print(f" VRAM baseline: {vram_base:.2f} GB")
|
| 58 |
+
|
| 59 |
+
# Test 1: Baseline (no eagle)
|
| 60 |
+
print("\n[3] Baseline (no eagle)...")
|
| 61 |
+
check(engine, tokenizer, "baseline")
|
| 62 |
+
|
| 63 |
+
# Test 2: D=2 eagle head (should work)
|
| 64 |
+
print("\n[4] D=2 eagle head...")
|
| 65 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 66 |
+
num_head_layers=2, checkpoint_path=EAGLE_CKPT)
|
| 67 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 68 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 69 |
+
check(engine, tokenizer, "D=2")
|
| 70 |
+
# Cleanup
|
| 71 |
+
del engine.eagle_head
|
| 72 |
+
engine._eagle_enabled = False
|
| 73 |
+
engine._eagle_hidden_states = {}
|
| 74 |
+
torch.cuda.empty_cache()
|
| 75 |
+
gc.collect()
|
| 76 |
+
|
| 77 |
+
# Test 3: D=8 eagle head (NO checkpoint, random init)
|
| 78 |
+
print("\n[5] D=8 eagle head (random init, no checkpoint)...")
|
| 79 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 80 |
+
num_head_layers=8) # no checkpoint_path
|
| 81 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 82 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 83 |
+
nan_d8_random = check(engine, tokenizer, "D=8 random")
|
| 84 |
+
# Cleanup
|
| 85 |
+
del engine.eagle_head
|
| 86 |
+
engine._eagle_enabled = False
|
| 87 |
+
engine._eagle_hidden_states = {}
|
| 88 |
+
torch.cuda.empty_cache()
|
| 89 |
+
gc.collect()
|
| 90 |
+
|
| 91 |
+
# Test 4: D=8 eagle head WITH checkpoint
|
| 92 |
+
print("\n[6] D=8 eagle head (with checkpoint)...")
|
| 93 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 94 |
+
num_head_layers=8, checkpoint_path=EAGLE_CKPT)
|
| 95 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 96 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 97 |
+
nan_d8_ckpt = check(engine, tokenizer, "D=8 with ckpt")
|
| 98 |
+
# Cleanup
|
| 99 |
+
del engine.eagle_head
|
| 100 |
+
engine._eagle_enabled = False
|
| 101 |
+
engine._eagle_hidden_states = {}
|
| 102 |
+
torch.cuda.empty_cache()
|
| 103 |
+
gc.collect()
|
| 104 |
+
|
| 105 |
+
# Test 5: D=8 eagle head allocated but NOT registered as submodule
|
| 106 |
+
print("\n[7] D=8 eagle head (allocated, NOT registered on engine)...")
|
| 107 |
+
head_ext = FireEchoEagleHead(
|
| 108 |
+
dim=config.dim, num_capture_layers=3,
|
| 109 |
+
num_heads=16, ffn_mult=2, num_layers=8,
|
| 110 |
+
).to(dtype=torch.bfloat16, device='cuda')
|
| 111 |
+
# Do NOT assign to engine — keep as local variable
|
| 112 |
+
engine._eagle_enabled = True
|
| 113 |
+
engine._eagle_capture_set = {8, 24, 47}
|
| 114 |
+
engine._eagle_capture_layers = [8, 24, 47]
|
| 115 |
+
engine._eagle_hidden_states = {}
|
| 116 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 117 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 118 |
+
nan_d8_unreg = check(engine, tokenizer, "D=8 unregistered")
|
| 119 |
+
# Cleanup
|
| 120 |
+
del head_ext
|
| 121 |
+
engine._eagle_enabled = False
|
| 122 |
+
torch.cuda.empty_cache()
|
| 123 |
+
gc.collect()
|
| 124 |
+
|
| 125 |
+
# Test 6: D=4 eagle head (between D=2 and D=8)
|
| 126 |
+
print("\n[8] D=4 eagle head (checkpoint)...")
|
| 127 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 128 |
+
num_head_layers=4, checkpoint_path=EAGLE_CKPT)
|
| 129 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 130 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 131 |
+
nan_d4 = check(engine, tokenizer, "D=4")
|
| 132 |
+
# Cleanup
|
| 133 |
+
del engine.eagle_head
|
| 134 |
+
engine._eagle_enabled = False
|
| 135 |
+
engine._eagle_hidden_states = {}
|
| 136 |
+
torch.cuda.empty_cache()
|
| 137 |
+
gc.collect()
|
| 138 |
+
|
| 139 |
+
# Test 7: D=8 but eagle_enabled=False (head exists but flag off)
|
| 140 |
+
print("\n[9] D=8 eagle head, but _eagle_enabled=False...")
|
| 141 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 142 |
+
num_head_layers=8, checkpoint_path=EAGLE_CKPT)
|
| 143 |
+
engine._eagle_enabled = False # disable the flag
|
| 144 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 145 |
+
print(f" VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
|
| 146 |
+
nan_d8_flagoff = check(engine, tokenizer, "D=8 flag OFF")
|
| 147 |
+
|
| 148 |
+
# Summary
|
| 149 |
+
print(f"\n{'='*60}")
|
| 150 |
+
print(" RESULTS")
|
| 151 |
+
print(f"{'='*60}")
|
| 152 |
+
print(f" D=8 random: {'NaN' if nan_d8_random else 'OK'}")
|
| 153 |
+
print(f" D=8 with ckpt: {'NaN' if nan_d8_ckpt else 'OK'}")
|
| 154 |
+
print(f" D=8 unregistered: {'NaN' if nan_d8_unreg else 'OK'}")
|
| 155 |
+
print(f" D=4: {'NaN' if nan_d4 else 'OK'}")
|
| 156 |
+
print(f" D=8 flag OFF: {'NaN' if nan_d8_flagoff else 'OK'}")
|
FireEcho Engine/debug_eval_flow.log
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
Eval Flow Test (replicates training eval)
|
| 3 |
+
============================================================
|
| 4 |
+
|
| 5 |
+
[1] Loading model...
|
| 6 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 7 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 8 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 9 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 10 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 11 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 12 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 13 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 14 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 15 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 16 |
+
Layer 16/48: 393 weights, VRAM 7.3 GB, CPU 1.9 GB
|
| 17 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 18 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 19 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 20 |
+
Layer 32/48: 393 weights, VRAM 13.4 GB, CPU 2.5 GB
|
| 21 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 22 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 23 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 24 |
+
Layer 48/48: 393 weights, VRAM 19.5 GB, CPU 3.1 GB
|
| 25 |
+
[Qwen3 Streaming] Final VRAM: 19.5 GB (FP4 quantized)
|
| 26 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 27 |
+
Total params: 1.57B
|
| 28 |
+
Frozen params: 1.54B (base model, FP4)
|
| 29 |
+
Trainable params: 30.2M (Hebbian only)
|
| 30 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 31 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 32 |
+
|
| 33 |
+
[2] Enabling EAGLE (no checkpoint)...
|
| 34 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 35 |
+
|
| 36 |
+
[3] Loading checkpoint separately (like training script)...
|
| 37 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
|
| 38 |
+
Loaded checkpoint (step 4000)
|
| 39 |
+
VRAM: 21.25 GB
|
| 40 |
+
|
| 41 |
+
[4a] Running manual speculation test WITHOUT warmup...
|
| 42 |
+
|
| 43 |
+
--- Manual speculation test ---
|
| 44 |
+
Prefill logits: has_nan=True
|
| 45 |
+
FATAL: NaN in prefill! Cannot continue.
|
| 46 |
+
|
| 47 |
+
[4b] Warmup (3x generate)...
|
| 48 |
+
Warmup done
|
| 49 |
+
|
| 50 |
+
[4c] Running manual speculation test AFTER warmup...
|
| 51 |
+
|
| 52 |
+
--- Manual speculation test ---
|
| 53 |
+
Prefill logits: has_nan=True
|
| 54 |
+
FATAL: NaN in prefill! Cannot continue.
|
| 55 |
+
|
| 56 |
+
[5] Running full speculative_generate eval...
|
| 57 |
+
[EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
|
| 58 |
+
|
| 59 |
+
Prompt 0: 61 tokens, 21.3 tok/s
|
| 60 |
+
Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
| 61 |
+
WARNING: All tokens are the same (0) — likely NaN bug!
|
| 62 |
+
[EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
|
| 63 |
+
|
| 64 |
+
Prompt 1: 61 tokens, 32.5 tok/s
|
| 65 |
+
Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
| 66 |
+
WARNING: All tokens are the same (0) — likely NaN bug!
|
| 67 |
+
[EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
|
| 68 |
+
|
| 69 |
+
Prompt 2: 61 tokens, 31.7 tok/s
|
| 70 |
+
Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
|
| 71 |
+
WARNING: All tokens are the same (0) — likely NaN bug!
|
| 72 |
+
|
| 73 |
+
============================================================
|
| 74 |
+
Done
|
| 75 |
+
============================================================
|
FireEcho Engine/debug_eval_flow.py
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Replicate the exact training eval flow to verify acceptance rate.
|
| 3 |
+
|
| 4 |
+
Matches train_eagle_head.py: enable_eagle (no ckpt), load_checkpoint, evaluate.
|
| 5 |
+
"""
|
| 6 |
+
import sys, os, time, torch
|
| 7 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 8 |
+
from hebbian_finetune_demo import load_engine
|
| 9 |
+
|
| 10 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 11 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 12 |
+
|
| 13 |
+
EVAL_PROMPTS = [
|
| 14 |
+
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n",
|
| 15 |
+
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nExplain what a neural network is in simple terms.<|im_end|>\n<|im_start|>assistant\n",
|
| 16 |
+
"<|im_start|>system\nYou are a helpful coding assistant.<|im_end|>\n<|im_start|>user\nWrite a binary search function in Python.<|im_end|>\n<|im_start|>assistant\n",
|
| 17 |
+
]
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
@torch.no_grad()
|
| 21 |
+
def evaluate_verbose(engine, tokenizer, max_new=60):
|
| 22 |
+
"""Run speculative_generate and print acceptance + output for each prompt."""
|
| 23 |
+
engine.eval()
|
| 24 |
+
eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
|
| 25 |
+
stop_tokens = [eos_id] if eos_id is not None else [151645]
|
| 26 |
+
|
| 27 |
+
for pi, prompt in enumerate(EVAL_PROMPTS):
|
| 28 |
+
ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
|
| 29 |
+
engine.reset_cache()
|
| 30 |
+
|
| 31 |
+
t0 = time.perf_counter()
|
| 32 |
+
out = engine.speculative_generate(
|
| 33 |
+
ids, max_new_tokens=max_new, temperature=0.0,
|
| 34 |
+
stop_tokens=stop_tokens)
|
| 35 |
+
torch.cuda.synchronize()
|
| 36 |
+
t1 = time.perf_counter()
|
| 37 |
+
|
| 38 |
+
gen_len = out.shape[1] - ids.shape[1]
|
| 39 |
+
text = tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
|
| 40 |
+
tps = gen_len / max(t1 - t0, 1e-6)
|
| 41 |
+
print(f"\n Prompt {pi}: {gen_len} tokens, {tps:.1f} tok/s")
|
| 42 |
+
print(f" Output: {text[:150]}")
|
| 43 |
+
|
| 44 |
+
# Check for all-same-token output (sign of NaN)
|
| 45 |
+
gen_ids = out[0, ids.shape[1]:].tolist()
|
| 46 |
+
if len(set(gen_ids)) == 1 and len(gen_ids) > 5:
|
| 47 |
+
print(f" WARNING: All tokens are the same ({gen_ids[0]}) — likely NaN bug!")
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
@torch.no_grad()
|
| 51 |
+
def test_manual_speculation(engine, tokenizer):
|
| 52 |
+
"""Manually run one round of draft+verify and check each step."""
|
| 53 |
+
print("\n--- Manual speculation test ---")
|
| 54 |
+
engine.eval()
|
| 55 |
+
prompt = EVAL_PROMPTS[0]
|
| 56 |
+
ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
|
| 57 |
+
prompt_len = ids.shape[1]
|
| 58 |
+
|
| 59 |
+
engine.reset_cache()
|
| 60 |
+
engine._current_seq_id = 0
|
| 61 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 62 |
+
engine.kv_cache._graph_mode = False
|
| 63 |
+
|
| 64 |
+
# Prefill
|
| 65 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 66 |
+
has_nan = logits.isnan().any().item()
|
| 67 |
+
print(f" Prefill logits: has_nan={has_nan}")
|
| 68 |
+
if has_nan:
|
| 69 |
+
print(" FATAL: NaN in prefill! Cannot continue.")
|
| 70 |
+
return
|
| 71 |
+
|
| 72 |
+
# Decode first token
|
| 73 |
+
next_token = logits[:, -1:, :].argmax(dim=-1)
|
| 74 |
+
print(f" First token: {next_token.item()} = '{tokenizer.decode([next_token.item()])}'")
|
| 75 |
+
|
| 76 |
+
# Forward it
|
| 77 |
+
logits = engine.forward(next_token, use_cache=True, position=prompt_len)
|
| 78 |
+
has_nan = logits.isnan().any().item()
|
| 79 |
+
print(f" Post-first-token logits: has_nan={has_nan}")
|
| 80 |
+
if has_nan:
|
| 81 |
+
print(" FATAL: NaN after first token forward!")
|
| 82 |
+
return
|
| 83 |
+
|
| 84 |
+
main_pred = logits[:, -1, :].argmax(dim=-1).item()
|
| 85 |
+
print(f" Target predicts next: {main_pred} = '{tokenizer.decode([main_pred])}'")
|
| 86 |
+
|
| 87 |
+
# Draft 5 tokens
|
| 88 |
+
features = [engine._eagle_hidden_states[l] for l in engine._eagle_capture_layers]
|
| 89 |
+
for li, f in zip(engine._eagle_capture_layers, features):
|
| 90 |
+
print(f" Feature L{li}: has_nan={f.isnan().any().item()}, "
|
| 91 |
+
f"shape={list(f.shape)}")
|
| 92 |
+
|
| 93 |
+
memory_ctx = engine._get_eagle_memory_context(
|
| 94 |
+
engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
|
| 95 |
+
|
| 96 |
+
dt, dl = engine.eagle_head.generate_draft(
|
| 97 |
+
features, next_token, engine.embed, depth=5, memory_context=memory_ctx)
|
| 98 |
+
|
| 99 |
+
print(f"\n Draft tokens:")
|
| 100 |
+
for i, t in enumerate(dt):
|
| 101 |
+
print(f" [{i}] {t.item()} = '{tokenizer.decode([t.item()])}'")
|
| 102 |
+
|
| 103 |
+
# Verify
|
| 104 |
+
draft_input = torch.cat(dt, dim=1)
|
| 105 |
+
current_pos = prompt_len + 1
|
| 106 |
+
verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
|
| 107 |
+
has_nan = verify_logits.isnan().any().item()
|
| 108 |
+
print(f"\n Verify logits: has_nan={has_nan}")
|
| 109 |
+
|
| 110 |
+
accepted = 0
|
| 111 |
+
if dt[0].item() == main_pred:
|
| 112 |
+
accepted = 1
|
| 113 |
+
for i in range(1, len(dt)):
|
| 114 |
+
target_pred = verify_logits[:, i - 1, :].argmax(dim=-1).item()
|
| 115 |
+
match = "MATCH" if dt[i].item() == target_pred else "MISS"
|
| 116 |
+
print(f" [{i}] draft={dt[i].item()} target={target_pred} → {match}")
|
| 117 |
+
if dt[i].item() == target_pred:
|
| 118 |
+
accepted += 1
|
| 119 |
+
else:
|
| 120 |
+
break
|
| 121 |
+
else:
|
| 122 |
+
print(f" [0] MISS: draft={dt[0].item()} target={main_pred}")
|
| 123 |
+
|
| 124 |
+
print(f" Accepted: {accepted}/{len(dt)}")
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
if __name__ == "__main__":
|
| 128 |
+
print("=" * 60)
|
| 129 |
+
print(" Eval Flow Test (replicates training eval)")
|
| 130 |
+
print("=" * 60)
|
| 131 |
+
|
| 132 |
+
# === Match training script flow exactly ===
|
| 133 |
+
print("\n[1] Loading model...")
|
| 134 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=512, device="cuda")
|
| 135 |
+
engine.eval()
|
| 136 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 137 |
+
engine.pack_all_experts()
|
| 138 |
+
|
| 139 |
+
print("\n[2] Enabling EAGLE (no checkpoint)...")
|
| 140 |
+
engine.enable_eagle(
|
| 141 |
+
capture_layers=(8, 24, 47),
|
| 142 |
+
num_heads=16, ffn_mult=2,
|
| 143 |
+
draft_depth=5, num_head_layers=8)
|
| 144 |
+
|
| 145 |
+
print("\n[3] Loading checkpoint separately (like training script)...")
|
| 146 |
+
if os.path.exists(EAGLE_CKPT):
|
| 147 |
+
ckpt = torch.load(EAGLE_CKPT, weights_only=False, map_location='cuda')
|
| 148 |
+
sd = ckpt.get('eagle_head', ckpt)
|
| 149 |
+
is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
|
| 150 |
+
if is_legacy:
|
| 151 |
+
engine.eagle_head.load_legacy_checkpoint(sd)
|
| 152 |
+
else:
|
| 153 |
+
engine.eagle_head.load_state_dict(sd, strict=False)
|
| 154 |
+
print(f" Loaded checkpoint (step {ckpt.get('step', '?')})")
|
| 155 |
+
else:
|
| 156 |
+
print(f" No checkpoint found, using random init")
|
| 157 |
+
|
| 158 |
+
# Setup optimizer (like training script)
|
| 159 |
+
eagle_params = [p for n, p in engine.eagle_head.named_parameters()
|
| 160 |
+
if 'lm_head' not in n and p.requires_grad]
|
| 161 |
+
optimizer = torch.optim.AdamW(eagle_params, lr=3e-4, betas=(0.9, 0.95))
|
| 162 |
+
|
| 163 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 164 |
+
print(f" VRAM: {vram:.2f} GB")
|
| 165 |
+
|
| 166 |
+
# Test WITHOUT warmup first
|
| 167 |
+
print("\n[4a] Running manual speculation test WITHOUT warmup...")
|
| 168 |
+
test_manual_speculation(engine, tokenizer)
|
| 169 |
+
|
| 170 |
+
# Now do warmup
|
| 171 |
+
print("\n[4b] Warmup (3x generate)...")
|
| 172 |
+
warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 173 |
+
for _ in range(3):
|
| 174 |
+
engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 175 |
+
print(" Warmup done")
|
| 176 |
+
|
| 177 |
+
# Test AFTER warmup
|
| 178 |
+
print("\n[4c] Running manual speculation test AFTER warmup...")
|
| 179 |
+
test_manual_speculation(engine, tokenizer)
|
| 180 |
+
|
| 181 |
+
print("\n[5] Running full speculative_generate eval...")
|
| 182 |
+
evaluate_verbose(engine, tokenizer)
|
| 183 |
+
|
| 184 |
+
print("\n" + "=" * 60)
|
| 185 |
+
print(" Done")
|
| 186 |
+
print("=" * 60)
|
FireEcho Engine/debug_nan_isolate.log
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
NaN Isolation Test
|
| 3 |
+
============================================================
|
| 4 |
+
|
| 5 |
+
[1/6] Loading model...
|
| 6 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 7 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 8 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 9 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 10 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 11 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 12 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 13 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 14 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 15 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 16 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 17 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 18 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 19 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 20 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 21 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 22 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 23 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 24 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 25 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 26 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 27 |
+
Total params: 1.57B
|
| 28 |
+
Frozen params: 1.54B (base model, FP4)
|
| 29 |
+
Trainable params: 30.2M (Hebbian only)
|
| 30 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 31 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 32 |
+
VRAM after load: 19.95 GB
|
| 33 |
+
|
| 34 |
+
[2/6] Warmup...
|
| 35 |
+
|
| 36 |
+
[3/6] Test BEFORE enable_eagle()...
|
| 37 |
+
[before eagle] OK — top token=13048 ('Hi'), max=26.88
|
| 38 |
+
|
| 39 |
+
[4/6] Test: just set _eagle_enabled=True (no head creation)...
|
| 40 |
+
[flag only] OK — top token=13048 ('Hi'), max=26.88
|
| 41 |
+
|
| 42 |
+
[5/6] Test: create eagle head + assign as submodule...
|
| 43 |
+
VRAM after eagle head: 20.17 GB (+0.22 GB)
|
| 44 |
+
[with head (no ckpt)] OK — top token=13048 ('Hi'), max=26.88
|
| 45 |
+
|
| 46 |
+
[6/6] Test: load checkpoint into eagle head...
|
| 47 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
|
| 48 |
+
[with ckpt] OK — top token=13048 ('Hi'), max=26.88
|
| 49 |
+
|
| 50 |
+
============================================================
|
| 51 |
+
RESULTS
|
| 52 |
+
============================================================
|
| 53 |
+
Before eagle: OK
|
| 54 |
+
Flag only: OK
|
| 55 |
+
With head (no ckpt): OK
|
| 56 |
+
With checkpoint: OK
|
| 57 |
+
All tests passed — no NaN detected!
|
FireEcho Engine/debug_nan_isolate.py
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Isolate exactly which step of enable_eagle() causes NaN in target model.
|
| 3 |
+
|
| 4 |
+
Tests each sub-step of enable_eagle() independently to find the culprit.
|
| 5 |
+
Also checks per-layer output to find where NaN first appears.
|
| 6 |
+
"""
|
| 7 |
+
import sys, os, torch, gc
|
| 8 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
+
from hebbian_finetune_demo import load_engine
|
| 10 |
+
from fireecho_kernel import FireEchoEagleHead
|
| 11 |
+
|
| 12 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 13 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 14 |
+
|
| 15 |
+
PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
@torch.no_grad()
|
| 19 |
+
def check_forward(engine, tokenizer, label):
|
| 20 |
+
"""Run a forward pass and report NaN status."""
|
| 21 |
+
torch.cuda.synchronize()
|
| 22 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 23 |
+
engine.reset_cache()
|
| 24 |
+
engine._current_seq_id = 0
|
| 25 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 26 |
+
engine.kv_cache._graph_mode = False
|
| 27 |
+
|
| 28 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 29 |
+
torch.cuda.synchronize()
|
| 30 |
+
|
| 31 |
+
has_nan = logits.isnan().any().item()
|
| 32 |
+
last = logits[:, -1, :]
|
| 33 |
+
if has_nan:
|
| 34 |
+
print(f" [{label}] NaN DETECTED — logits all NaN")
|
| 35 |
+
else:
|
| 36 |
+
top_id = last.argmax(dim=-1).item()
|
| 37 |
+
top_val = last.max().item()
|
| 38 |
+
print(f" [{label}] OK — top token={top_id} "
|
| 39 |
+
f"('{tokenizer.decode([top_id])}'), max={top_val:.2f}")
|
| 40 |
+
return has_nan
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
@torch.no_grad()
|
| 44 |
+
def check_per_layer(engine, tokenizer, label):
|
| 45 |
+
"""Run forward pass manually layer-by-layer, check NaN at each layer."""
|
| 46 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 47 |
+
engine.reset_cache()
|
| 48 |
+
engine._current_seq_id = 0
|
| 49 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 50 |
+
engine.kv_cache._graph_mode = False
|
| 51 |
+
|
| 52 |
+
x = engine.embed(ids)
|
| 53 |
+
has_nan = x.isnan().any().item()
|
| 54 |
+
print(f" [{label}] After embed: has_nan={has_nan}")
|
| 55 |
+
if has_nan:
|
| 56 |
+
return
|
| 57 |
+
|
| 58 |
+
first_nan_layer = None
|
| 59 |
+
for i, layer in enumerate(engine.layers):
|
| 60 |
+
x = layer(x, engine.kv_cache, engine._current_seq_id, 0, True)
|
| 61 |
+
has_nan = x.isnan().any().item()
|
| 62 |
+
if has_nan and first_nan_layer is None:
|
| 63 |
+
first_nan_layer = i
|
| 64 |
+
print(f" [{label}] FIRST NaN at layer {i} !!!")
|
| 65 |
+
# Check sub-components
|
| 66 |
+
break
|
| 67 |
+
|
| 68 |
+
if first_nan_layer is None:
|
| 69 |
+
# Check norm + lm_head
|
| 70 |
+
x = engine.norm(x)
|
| 71 |
+
has_nan = x.isnan().any().item()
|
| 72 |
+
print(f" [{label}] After norm: has_nan={has_nan}")
|
| 73 |
+
logits = engine.lm_head(x)
|
| 74 |
+
has_nan = logits.isnan().any().item()
|
| 75 |
+
print(f" [{label}] After lm_head: has_nan={has_nan}")
|
| 76 |
+
if not has_nan:
|
| 77 |
+
top_id = logits[:, -1, :].argmax(dim=-1).item()
|
| 78 |
+
print(f" [{label}] Top token: {top_id} ('{tokenizer.decode([top_id])}')")
|
| 79 |
+
else:
|
| 80 |
+
print(f" [{label}] NaN starts at layer {first_nan_layer}")
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
if __name__ == "__main__":
|
| 84 |
+
print("=" * 60)
|
| 85 |
+
print(" NaN Isolation Test")
|
| 86 |
+
print("=" * 60)
|
| 87 |
+
|
| 88 |
+
print("\n[1/6] Loading model...")
|
| 89 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 90 |
+
engine.pack_all_experts()
|
| 91 |
+
engine.kv_cache.enable_flat_decode()
|
| 92 |
+
engine.eval()
|
| 93 |
+
|
| 94 |
+
# Check VRAM
|
| 95 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 96 |
+
print(f" VRAM after load: {vram:.2f} GB")
|
| 97 |
+
|
| 98 |
+
print("\n[2/6] Warmup...")
|
| 99 |
+
warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 100 |
+
for _ in range(3):
|
| 101 |
+
engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 102 |
+
|
| 103 |
+
print("\n[3/6] Test BEFORE enable_eagle()...")
|
| 104 |
+
nan_before = check_forward(engine, tokenizer, "before eagle")
|
| 105 |
+
|
| 106 |
+
if nan_before:
|
| 107 |
+
print("\n ERROR: NaN even before enable_eagle! Something wrong with model load.")
|
| 108 |
+
sys.exit(1)
|
| 109 |
+
|
| 110 |
+
print("\n[4/6] Test: just set _eagle_enabled=True (no head creation)...")
|
| 111 |
+
engine._eagle_enabled = True
|
| 112 |
+
engine._eagle_capture_set = {8, 24, 47}
|
| 113 |
+
engine._eagle_capture_layers = [8, 24, 47]
|
| 114 |
+
engine._eagle_hidden_states = {}
|
| 115 |
+
nan_flag_only = check_forward(engine, tokenizer, "flag only")
|
| 116 |
+
engine._eagle_enabled = False # reset
|
| 117 |
+
|
| 118 |
+
print("\n[5/6] Test: create eagle head + assign as submodule...")
|
| 119 |
+
eagle_head = FireEchoEagleHead(
|
| 120 |
+
dim=config.dim, num_capture_layers=3,
|
| 121 |
+
num_heads=16, ffn_mult=2, num_layers=2,
|
| 122 |
+
).to(dtype=torch.bfloat16, device='cuda')
|
| 123 |
+
eagle_head.lm_head = engine.lm_head
|
| 124 |
+
engine.eagle_head = eagle_head # registers as nn.Module submodule
|
| 125 |
+
vram2 = torch.cuda.memory_allocated() / 1e9
|
| 126 |
+
print(f" VRAM after eagle head: {vram2:.2f} GB (+{vram2 - vram:.2f} GB)")
|
| 127 |
+
nan_with_head = check_forward(engine, tokenizer, "with head (no ckpt)")
|
| 128 |
+
|
| 129 |
+
print("\n[6/6] Test: load checkpoint into eagle head...")
|
| 130 |
+
if os.path.exists(EAGLE_CKPT):
|
| 131 |
+
ckpt = torch.load(EAGLE_CKPT, map_location='cuda', weights_only=True)
|
| 132 |
+
sd = ckpt.get('eagle_head', ckpt)
|
| 133 |
+
is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
|
| 134 |
+
if is_legacy:
|
| 135 |
+
eagle_head.load_legacy_checkpoint(sd)
|
| 136 |
+
else:
|
| 137 |
+
eagle_head.load_state_dict(sd, strict=False)
|
| 138 |
+
nan_with_ckpt = check_forward(engine, tokenizer, "with ckpt")
|
| 139 |
+
else:
|
| 140 |
+
print(f" No checkpoint at {EAGLE_CKPT}, skipping")
|
| 141 |
+
nan_with_ckpt = nan_with_head
|
| 142 |
+
|
| 143 |
+
# Summary
|
| 144 |
+
print(f"\n{'=' * 60}")
|
| 145 |
+
print(" RESULTS")
|
| 146 |
+
print(f"{'=' * 60}")
|
| 147 |
+
print(f" Before eagle: {'NaN' if nan_before else 'OK'}")
|
| 148 |
+
print(f" Flag only: {'NaN' if nan_flag_only else 'OK'}")
|
| 149 |
+
print(f" With head (no ckpt): {'NaN' if nan_with_head else 'OK'}")
|
| 150 |
+
print(f" With checkpoint: {'NaN' if nan_with_ckpt else 'OK'}")
|
| 151 |
+
|
| 152 |
+
# If any NaN found, do per-layer analysis
|
| 153 |
+
if nan_flag_only or nan_with_head or nan_with_ckpt:
|
| 154 |
+
print(f"\n--- Per-layer NaN analysis ---")
|
| 155 |
+
if nan_flag_only:
|
| 156 |
+
engine._eagle_enabled = True
|
| 157 |
+
engine._eagle_capture_set = {8, 24, 47}
|
| 158 |
+
engine._eagle_capture_layers = [8, 24, 47]
|
| 159 |
+
engine._eagle_hidden_states = {}
|
| 160 |
+
check_per_layer(engine, tokenizer, "flag-only per-layer")
|
| 161 |
+
elif nan_with_head or nan_with_ckpt:
|
| 162 |
+
# eagle_head is still assigned
|
| 163 |
+
engine._eagle_enabled = True
|
| 164 |
+
engine._eagle_capture_set = {8, 24, 47}
|
| 165 |
+
engine._eagle_capture_layers = [8, 24, 47]
|
| 166 |
+
engine._eagle_hidden_states = {}
|
| 167 |
+
check_per_layer(engine, tokenizer, "full-eagle per-layer")
|
| 168 |
+
|
| 169 |
+
# Also test: head assigned but flag OFF
|
| 170 |
+
print(f"\n--- Test: head assigned but _eagle_enabled=False ---")
|
| 171 |
+
engine._eagle_enabled = False
|
| 172 |
+
check_forward(engine, tokenizer, "head assigned, flag OFF")
|
| 173 |
+
else:
|
| 174 |
+
print(" All tests passed — no NaN detected!")
|
FireEcho Engine/debug_promptlen.py
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Test: does prompt length cause NaN? Test with/without eagle."""
|
| 3 |
+
import sys, os, torch
|
| 4 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 5 |
+
from hebbian_finetune_demo import load_engine
|
| 6 |
+
|
| 7 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 8 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 9 |
+
|
| 10 |
+
SHORT = "Hello"
|
| 11 |
+
MEDIUM = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
|
| 12 |
+
LONG = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
@torch.no_grad()
|
| 16 |
+
def test_forward(engine, tokenizer, label, prompt):
|
| 17 |
+
ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 18 |
+
engine.reset_cache()
|
| 19 |
+
engine._current_seq_id = 0
|
| 20 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 21 |
+
engine.kv_cache._graph_mode = False
|
| 22 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 23 |
+
torch.cuda.synchronize()
|
| 24 |
+
has_nan = logits.isnan().any().item()
|
| 25 |
+
if has_nan:
|
| 26 |
+
# Count NaN positions
|
| 27 |
+
nan_count = sum(1 for s in range(logits.shape[1]) if logits[0, s].isnan().any())
|
| 28 |
+
print(f" [{label}] NaN! ({nan_count}/{logits.shape[1]} positions) len={ids.shape[1]}")
|
| 29 |
+
else:
|
| 30 |
+
top = logits[:, -1, :].argmax(dim=-1).item()
|
| 31 |
+
print(f" [{label}] OK top={top} ('{tokenizer.decode([top])}') len={ids.shape[1]}")
|
| 32 |
+
return has_nan
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
if __name__ == "__main__":
|
| 36 |
+
print("=" * 60)
|
| 37 |
+
print(" Prompt Length NaN Test")
|
| 38 |
+
print("=" * 60)
|
| 39 |
+
|
| 40 |
+
print("\n[SETUP] Loading engine...")
|
| 41 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 42 |
+
engine.eval()
|
| 43 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 44 |
+
engine.pack_all_experts()
|
| 45 |
+
|
| 46 |
+
# Test WITHOUT eagle
|
| 47 |
+
print("\n[Phase 1] No eagle — varying prompt lengths...")
|
| 48 |
+
test_forward(engine, tokenizer, "short (no eagle)", SHORT)
|
| 49 |
+
test_forward(engine, tokenizer, "medium (no eagle)", MEDIUM)
|
| 50 |
+
test_forward(engine, tokenizer, "long (no eagle)", LONG)
|
| 51 |
+
|
| 52 |
+
# Warmup
|
| 53 |
+
print("\n[Warmup]...")
|
| 54 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 55 |
+
for _ in range(3):
|
| 56 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 57 |
+
del wids
|
| 58 |
+
|
| 59 |
+
# Test again after warmup
|
| 60 |
+
print("\n[Phase 2] No eagle, after warmup...")
|
| 61 |
+
test_forward(engine, tokenizer, "short (warmed)", SHORT)
|
| 62 |
+
test_forward(engine, tokenizer, "medium (warmed)", MEDIUM)
|
| 63 |
+
test_forward(engine, tokenizer, "long (warmed)", LONG)
|
| 64 |
+
|
| 65 |
+
# Enable eagle WITH checkpoint
|
| 66 |
+
print("\n[Phase 3] Enable eagle D=8 with checkpoint...")
|
| 67 |
+
engine.enable_eagle(
|
| 68 |
+
capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 69 |
+
draft_depth=5, num_head_layers=8, checkpoint_path=EAGLE_CKPT)
|
| 70 |
+
|
| 71 |
+
test_forward(engine, tokenizer, "short (eagle+ckpt)", SHORT)
|
| 72 |
+
test_forward(engine, tokenizer, "medium (eagle+ckpt)", MEDIUM)
|
| 73 |
+
test_forward(engine, tokenizer, "long (eagle+ckpt)", LONG)
|
| 74 |
+
|
| 75 |
+
# Warmup again after eagle
|
| 76 |
+
print("\n[Warmup after eagle]...")
|
| 77 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 78 |
+
for _ in range(3):
|
| 79 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 80 |
+
del wids
|
| 81 |
+
|
| 82 |
+
print("\n[Phase 4] Eagle + ckpt, after warmup...")
|
| 83 |
+
test_forward(engine, tokenizer, "short (eagle warmed)", SHORT)
|
| 84 |
+
test_forward(engine, tokenizer, "medium (eagle warmed)", MEDIUM)
|
| 85 |
+
test_forward(engine, tokenizer, "long (eagle warmed)", LONG)
|
| 86 |
+
|
| 87 |
+
# Test: enable_eagle WITHOUT checkpoint
|
| 88 |
+
print("\n[Phase 5] Fresh engine, eagle D=8 NO checkpoint...")
|
| 89 |
+
del engine
|
| 90 |
+
torch.cuda.empty_cache()
|
| 91 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 92 |
+
engine.eval()
|
| 93 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 94 |
+
engine.pack_all_experts()
|
| 95 |
+
engine.enable_eagle(
|
| 96 |
+
capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 97 |
+
draft_depth=5, num_head_layers=8) # NO checkpoint
|
| 98 |
+
# Warmup
|
| 99 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 100 |
+
for _ in range(3):
|
| 101 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 102 |
+
del wids
|
| 103 |
+
|
| 104 |
+
test_forward(engine, tokenizer, "short (no ckpt)", SHORT)
|
| 105 |
+
test_forward(engine, tokenizer, "medium (no ckpt)", MEDIUM)
|
| 106 |
+
test_forward(engine, tokenizer, "long (no ckpt)", LONG)
|
| 107 |
+
|
| 108 |
+
print("\n" + "=" * 60)
|
| 109 |
+
print(" DONE")
|
| 110 |
+
print("=" * 60)
|
FireEcho Engine/debug_seqlen.py
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Test: does max_seq_len=512 vs 4096 cause NaN?"""
|
| 3 |
+
import sys, os, torch
|
| 4 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 5 |
+
from hebbian_finetune_demo import load_engine
|
| 6 |
+
|
| 7 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 8 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 9 |
+
PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@torch.no_grad()
|
| 13 |
+
def check(engine, tokenizer, label):
|
| 14 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 15 |
+
engine.reset_cache()
|
| 16 |
+
engine._current_seq_id = 0
|
| 17 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 18 |
+
engine.kv_cache._graph_mode = False
|
| 19 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 20 |
+
torch.cuda.synchronize()
|
| 21 |
+
has_nan = logits.isnan().any().item()
|
| 22 |
+
if has_nan:
|
| 23 |
+
print(f" [{label}] NaN DETECTED")
|
| 24 |
+
else:
|
| 25 |
+
top = logits[:, -1, :].argmax(dim=-1).item()
|
| 26 |
+
print(f" [{label}] OK — top={top} ('{tokenizer.decode([top])}')")
|
| 27 |
+
return has_nan
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
if __name__ == "__main__":
|
| 31 |
+
print("=" * 60)
|
| 32 |
+
print(" max_seq_len test")
|
| 33 |
+
print("=" * 60)
|
| 34 |
+
|
| 35 |
+
# Replicate EXACT training script flow: max_seq_len=512
|
| 36 |
+
print("\n[1] load_engine(max_seq_len=512)...")
|
| 37 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=512, device="cuda")
|
| 38 |
+
engine.eval()
|
| 39 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 40 |
+
engine.pack_all_experts()
|
| 41 |
+
|
| 42 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 43 |
+
print(f" VRAM: {vram:.2f} GB")
|
| 44 |
+
|
| 45 |
+
# Warmup
|
| 46 |
+
print("\n[2] Warmup...")
|
| 47 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 48 |
+
for _ in range(3):
|
| 49 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 50 |
+
|
| 51 |
+
# Test WITHOUT eagle (should work)
|
| 52 |
+
print("\n[3] Forward without eagle (max_seq_len=512)...")
|
| 53 |
+
check(engine, tokenizer, "no eagle, seq=512")
|
| 54 |
+
|
| 55 |
+
# Test WITH D=8 eagle
|
| 56 |
+
print("\n[4] Enable D=8 eagle + checkpoint...")
|
| 57 |
+
engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 58 |
+
num_head_layers=8, checkpoint_path=EAGLE_CKPT)
|
| 59 |
+
vram = torch.cuda.memory_allocated() / 1e9
|
| 60 |
+
print(f" VRAM: {vram:.2f} GB")
|
| 61 |
+
nan_512 = check(engine, tokenizer, "D=8, seq=512")
|
| 62 |
+
|
| 63 |
+
print(f"\n{'='*60}")
|
| 64 |
+
print(f" max_seq_len=512 + D=8: {'NaN' if nan_512 else 'OK'}")
|
| 65 |
+
print(f"{'='*60}")
|
FireEcho Engine/debug_seqlen_threshold.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Find exact sequence length threshold for NaN. Test with/without pack_all_experts."""
|
| 3 |
+
import sys, os, torch
|
| 4 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 5 |
+
from hebbian_finetune_demo import load_engine
|
| 6 |
+
|
| 7 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
@torch.no_grad()
|
| 11 |
+
def test_len(engine, tokenizer, n_tokens, label=""):
|
| 12 |
+
"""Generate a prompt of approximately n tokens and test forward."""
|
| 13 |
+
# Use repeating text to control length
|
| 14 |
+
base = "word " * max(n_tokens, 1)
|
| 15 |
+
ids = tokenizer.encode(base, return_tensors='pt').cuda()
|
| 16 |
+
# Truncate to exact length
|
| 17 |
+
ids = ids[:, :n_tokens]
|
| 18 |
+
engine.reset_cache()
|
| 19 |
+
engine._current_seq_id = 0
|
| 20 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 21 |
+
engine.kv_cache._graph_mode = False
|
| 22 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 23 |
+
torch.cuda.synchronize()
|
| 24 |
+
has_nan = logits.isnan().any().item()
|
| 25 |
+
status = "NaN" if has_nan else "OK"
|
| 26 |
+
print(f" len={n_tokens:4d} {label}: {status}")
|
| 27 |
+
return has_nan
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
if __name__ == "__main__":
|
| 31 |
+
print("=" * 60)
|
| 32 |
+
print(" Sequence Length NaN Threshold Finder")
|
| 33 |
+
print("=" * 60)
|
| 34 |
+
|
| 35 |
+
print("\n[1] Loading engine (WITH pack)...")
|
| 36 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 37 |
+
engine.eval()
|
| 38 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 39 |
+
engine.pack_all_experts()
|
| 40 |
+
|
| 41 |
+
# Binary search for threshold
|
| 42 |
+
print("\n[2] Testing WITH pack_all_experts (coarse)...")
|
| 43 |
+
for n in [1, 5, 10, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 40, 50, 64, 100]:
|
| 44 |
+
test_len(engine, tokenizer, n, "(packed)")
|
| 45 |
+
|
| 46 |
+
# Now test WITHOUT pack
|
| 47 |
+
print("\n[3] Reloading engine WITHOUT pack_all_experts...")
|
| 48 |
+
del engine
|
| 49 |
+
torch.cuda.empty_cache()
|
| 50 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 51 |
+
engine.eval()
|
| 52 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 53 |
+
# NO pack_all_experts!
|
| 54 |
+
|
| 55 |
+
print("\n[4] Testing WITHOUT pack_all_experts...")
|
| 56 |
+
for n in [1, 10, 20, 25, 30, 31, 32, 40, 50, 64, 100]:
|
| 57 |
+
test_len(engine, tokenizer, n, "(unpacked)")
|
| 58 |
+
|
| 59 |
+
print("\n" + "=" * 60)
|
| 60 |
+
print(" DONE")
|
| 61 |
+
print("=" * 60)
|
FireEcho Engine/debug_specgen_trace.py
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Trace speculative_generate step by step to find exactly where NaN appears."""
|
| 3 |
+
import sys, os, torch
|
| 4 |
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
| 5 |
+
from hebbian_finetune_demo import load_engine
|
| 6 |
+
|
| 7 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 8 |
+
EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
|
| 9 |
+
PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a function to check primes.<|im_end|>\n<|im_start|>assistant\n"
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def check_nan(label, tensor):
|
| 13 |
+
has_nan = tensor.isnan().any().item()
|
| 14 |
+
has_inf = tensor.isinf().any().item()
|
| 15 |
+
if has_nan or has_inf:
|
| 16 |
+
print(f" *** {label}: NaN={has_nan} Inf={has_inf} shape={list(tensor.shape)}")
|
| 17 |
+
# Check which positions have NaN
|
| 18 |
+
if tensor.dim() == 3: # [B, S, V]
|
| 19 |
+
for s in range(tensor.shape[1]):
|
| 20 |
+
if tensor[0, s].isnan().any():
|
| 21 |
+
print(f" Position {s}: NaN!")
|
| 22 |
+
return True
|
| 23 |
+
else:
|
| 24 |
+
top = tensor[:, -1, :].argmax(dim=-1).item()
|
| 25 |
+
print(f" {label}: OK (top={top}) shape={list(tensor.shape)}")
|
| 26 |
+
return False
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
@torch.no_grad()
|
| 30 |
+
def main():
|
| 31 |
+
print("=" * 60)
|
| 32 |
+
print(" Speculative Generate NaN Trace")
|
| 33 |
+
print("=" * 60)
|
| 34 |
+
|
| 35 |
+
# Load engine exactly like training
|
| 36 |
+
print("\n[SETUP] Loading engine...")
|
| 37 |
+
engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
|
| 38 |
+
engine.eval()
|
| 39 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 40 |
+
engine.pack_all_experts()
|
| 41 |
+
|
| 42 |
+
# Enable EAGLE D=8
|
| 43 |
+
engine.enable_eagle(
|
| 44 |
+
capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
|
| 45 |
+
draft_depth=5, num_head_layers=8, checkpoint_path=EAGLE_CKPT)
|
| 46 |
+
|
| 47 |
+
# Warmup
|
| 48 |
+
print("\n[SETUP] Warmup...")
|
| 49 |
+
wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
|
| 50 |
+
for _ in range(3):
|
| 51 |
+
engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
|
| 52 |
+
del wids
|
| 53 |
+
|
| 54 |
+
# Now replicate speculative_generate manually
|
| 55 |
+
print("\n[TRACE] Starting manual speculation trace...")
|
| 56 |
+
ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
|
| 57 |
+
prompt_len = ids.shape[1]
|
| 58 |
+
print(f" Prompt length: {prompt_len}")
|
| 59 |
+
|
| 60 |
+
# Step 1: Reset + prefill
|
| 61 |
+
engine.reset_cache()
|
| 62 |
+
engine._current_seq_id = 0
|
| 63 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 64 |
+
engine.kv_cache._graph_mode = False
|
| 65 |
+
|
| 66 |
+
print("\n[1] Prefill...")
|
| 67 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 68 |
+
torch.cuda.synchronize()
|
| 69 |
+
nan1 = check_nan("Prefill logits", logits)
|
| 70 |
+
if nan1:
|
| 71 |
+
print(" FATAL: NaN in prefill!")
|
| 72 |
+
return
|
| 73 |
+
|
| 74 |
+
current_pos = prompt_len
|
| 75 |
+
first_token = logits[:, -1:, :].argmax(dim=-1)
|
| 76 |
+
print(f" First token: {first_token.item()} ('{tokenizer.decode([first_token.item()])}')")
|
| 77 |
+
|
| 78 |
+
# Step 2: Process first token through main model
|
| 79 |
+
print("\n[2] Process first token through main model...")
|
| 80 |
+
if hasattr(engine.kv_cache, '_graph_mode'):
|
| 81 |
+
engine.kv_cache._graph_mode = False
|
| 82 |
+
logits = engine.forward(first_token, use_cache=True, position=current_pos)
|
| 83 |
+
torch.cuda.synchronize()
|
| 84 |
+
nan2 = check_nan("First-token logits", logits)
|
| 85 |
+
if nan2:
|
| 86 |
+
print(" FATAL: NaN at first token forward!")
|
| 87 |
+
return
|
| 88 |
+
current_pos += 1
|
| 89 |
+
main_pred = logits[:, -1, :].argmax(dim=-1).item()
|
| 90 |
+
print(f" main_pred: {main_pred} ('{tokenizer.decode([main_pred])}')")
|
| 91 |
+
|
| 92 |
+
# Step 3: Draft K tokens using EAGLE
|
| 93 |
+
print("\n[3] Draft K=5 tokens...")
|
| 94 |
+
features = [engine._eagle_hidden_states[l] for l in engine._eagle_capture_layers]
|
| 95 |
+
for idx, f in enumerate(features):
|
| 96 |
+
has_nan = f.isnan().any().item()
|
| 97 |
+
print(f" Feature {idx} (layer {engine._eagle_capture_layers[idx]}): "
|
| 98 |
+
f"shape={list(f.shape)}, NaN={has_nan}")
|
| 99 |
+
|
| 100 |
+
memory_ctx = engine._get_eagle_memory_context(
|
| 101 |
+
engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
|
| 102 |
+
|
| 103 |
+
draft_tokens, draft_logits = engine.eagle_head.generate_draft(
|
| 104 |
+
features, first_token, engine.embed, depth=5, memory_context=memory_ctx)
|
| 105 |
+
|
| 106 |
+
print(f" Draft tokens: {[t.item() for t in draft_tokens]}")
|
| 107 |
+
print(f" Draft decoded: {[tokenizer.decode([t.item()]) for t in draft_tokens]}")
|
| 108 |
+
for i, dl in enumerate(draft_logits):
|
| 109 |
+
has_nan = dl.isnan().any().item()
|
| 110 |
+
if has_nan:
|
| 111 |
+
print(f" *** Draft logits[{i}]: NaN!")
|
| 112 |
+
|
| 113 |
+
# Step 4: Verify draft tokens through main model (this is the suspicious step)
|
| 114 |
+
print("\n[4] Verify K=5 draft tokens through main model...")
|
| 115 |
+
print(f" Verifying at position={current_pos} (prompt_len={prompt_len})")
|
| 116 |
+
draft_input = torch.cat(draft_tokens, dim=1)
|
| 117 |
+
print(f" draft_input shape: {list(draft_input.shape)}, tokens: {draft_input[0].tolist()}")
|
| 118 |
+
|
| 119 |
+
verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
|
| 120 |
+
torch.cuda.synchronize()
|
| 121 |
+
nan4 = check_nan("Verify logits", verify_logits)
|
| 122 |
+
|
| 123 |
+
if nan4:
|
| 124 |
+
print("\n FOUND THE BUG: Verify forward (K>1 tokens at position>0) produces NaN!")
|
| 125 |
+
print(" This is likely a causal mask or KV cache issue in multi-token decode.")
|
| 126 |
+
|
| 127 |
+
# Additional test: verify ONE draft token at a time
|
| 128 |
+
print("\n[4b] Trying verify ONE token at a time...")
|
| 129 |
+
# Rollback the K tokens we just stored
|
| 130 |
+
engine.kv_cache.rollback_to(current_pos, 5)
|
| 131 |
+
|
| 132 |
+
for i, dt in enumerate(draft_tokens):
|
| 133 |
+
one_logit = engine.forward(dt, use_cache=True, position=current_pos + i)
|
| 134 |
+
torch.cuda.synchronize()
|
| 135 |
+
has_nan = one_logit.isnan().any().item()
|
| 136 |
+
top = one_logit[:, -1, :].argmax(dim=-1).item() if not has_nan else -1
|
| 137 |
+
print(f" Token {i} at pos {current_pos + i}: NaN={has_nan} top={top}")
|
| 138 |
+
if has_nan:
|
| 139 |
+
print(f" SINGLE token verify also fails at position {current_pos + i}!")
|
| 140 |
+
break
|
| 141 |
+
else:
|
| 142 |
+
print("\n Verify logits OK — checking acceptance logic...")
|
| 143 |
+
if draft_tokens[0].item() == main_pred:
|
| 144 |
+
print(f" Draft[0] matches main_pred ({main_pred}) ✓")
|
| 145 |
+
else:
|
| 146 |
+
print(f" Draft[0]={draft_tokens[0].item()} ≠ main_pred={main_pred} ✗")
|
| 147 |
+
|
| 148 |
+
for i in range(1, len(draft_tokens)):
|
| 149 |
+
target_pred = verify_logits[:, i-1, :].argmax(dim=-1).item()
|
| 150 |
+
match = "✓" if draft_tokens[i].item() == target_pred else "✗"
|
| 151 |
+
print(f" verify[{i-1}]={target_pred} vs draft[{i}]={draft_tokens[i].item()} {match}")
|
| 152 |
+
|
| 153 |
+
# Step 5: Also test a multi-token forward with RANDOM tokens at position>0
|
| 154 |
+
print("\n[5] Control test: multi-token forward with KNOWN-GOOD tokens...")
|
| 155 |
+
engine.reset_cache()
|
| 156 |
+
engine._current_seq_id = 0
|
| 157 |
+
# Prefill
|
| 158 |
+
logits = engine.forward(ids, use_cache=True, position=0)
|
| 159 |
+
# Now try 5 copies of a valid token at position=prompt_len
|
| 160 |
+
test_tokens = torch.full((1, 5), first_token.item(), dtype=torch.long, device='cuda')
|
| 161 |
+
test_logits = engine.forward(test_tokens, use_cache=True, position=prompt_len)
|
| 162 |
+
torch.cuda.synchronize()
|
| 163 |
+
nan5 = check_nan("Control multi-token logits", test_logits)
|
| 164 |
+
|
| 165 |
+
print("\n" + "=" * 60)
|
| 166 |
+
print(" TRACE COMPLETE")
|
| 167 |
+
print("=" * 60)
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
if __name__ == "__main__":
|
| 171 |
+
main()
|
FireEcho Engine/dsmem_ops.py
ADDED
|
@@ -0,0 +1,789 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FireEcho DSMEM — Distributed Shared Memory Operations
|
| 3 |
+
=======================================================
|
| 4 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 5 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 6 |
+
|
| 7 |
+
Implements DSMEM and Cluster Barriers using Triton's inline_asm_elementwise
|
| 8 |
+
for PTX injection on SM 9.0+ (Hopper) and SM 12.0+ (Blackwell).
|
| 9 |
+
|
| 10 |
+
Features:
|
| 11 |
+
1. mapa PTX - Map local SMEM to cluster-wide address
|
| 12 |
+
2. mbarrier PTX - Hardware-accelerated cluster barriers
|
| 13 |
+
3. Cooperative cluster primitives
|
| 14 |
+
|
| 15 |
+
Usage:
|
| 16 |
+
from fireecho.dsmem_ops import (
|
| 17 |
+
cluster_matmul_dsmem,
|
| 18 |
+
ClusterConfig,
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
# 2-CTA cooperative matmul with DSMEM
|
| 22 |
+
c = cluster_matmul_dsmem(a, b, cluster_size=2)
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
import torch
|
| 26 |
+
import triton
|
| 27 |
+
import triton.language as tl
|
| 28 |
+
from typing import Tuple, Optional
|
| 29 |
+
from dataclasses import dataclass
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
@dataclass
|
| 33 |
+
class ClusterConfig:
|
| 34 |
+
"""Configuration for cluster operations."""
|
| 35 |
+
cluster_x: int = 2 # Cluster size in X (2 for 2-CTA MMA)
|
| 36 |
+
cluster_y: int = 1
|
| 37 |
+
cluster_z: int = 1
|
| 38 |
+
use_dsmem: bool = True # Enable distributed shared memory
|
| 39 |
+
use_mbarrier: bool = True # Use hardware barriers
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
# =============================================================================
|
| 43 |
+
# SM120 DSMEM PTX Primitives
|
| 44 |
+
# =============================================================================
|
| 45 |
+
#
|
| 46 |
+
# Blackwell (SM120) introduces Distributed Shared Memory (DSMEM) allowing
|
| 47 |
+
# thread blocks within a cluster to directly access each other's shared memory.
|
| 48 |
+
#
|
| 49 |
+
# Key PTX instructions:
|
| 50 |
+
# - mapa.shared::cluster - Map local SMEM to cluster-wide address
|
| 51 |
+
# - mbarrier.arrive/wait - Hardware-accelerated barriers
|
| 52 |
+
# - fence.acq_rel.cluster - Cluster-scope memory fence
|
| 53 |
+
# - st.async.shared::cluster - Async store to remote SMEM
|
| 54 |
+
# - ld.shared::cluster - Load from remote SMEM
|
| 55 |
+
#
|
| 56 |
+
# Reference: CUDA 12.8+ PTX ISA, Section 9.7.13 (Cluster Operations)
|
| 57 |
+
# =============================================================================
|
| 58 |
+
|
| 59 |
+
@triton.jit
|
| 60 |
+
def _cluster_rank_x() -> tl.tensor:
|
| 61 |
+
"""Get current block's X rank within cluster (0 to cluster_dim_x-1)."""
|
| 62 |
+
return tl.inline_asm_elementwise(
|
| 63 |
+
asm="""
|
| 64 |
+
{
|
| 65 |
+
.reg .u32 %r;
|
| 66 |
+
mov.u32 %r, %clusterid.x;
|
| 67 |
+
mov.u32 $0, %r;
|
| 68 |
+
}
|
| 69 |
+
""",
|
| 70 |
+
constraints="=r",
|
| 71 |
+
args=[],
|
| 72 |
+
dtype=tl.int32,
|
| 73 |
+
is_pure=True,
|
| 74 |
+
pack=1,
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
@triton.jit
|
| 79 |
+
def _cluster_rank_y() -> tl.tensor:
|
| 80 |
+
"""Get current block's Y rank within cluster."""
|
| 81 |
+
return tl.inline_asm_elementwise(
|
| 82 |
+
asm="""
|
| 83 |
+
{
|
| 84 |
+
.reg .u32 %r;
|
| 85 |
+
mov.u32 %r, %clusterid.y;
|
| 86 |
+
mov.u32 $0, %r;
|
| 87 |
+
}
|
| 88 |
+
""",
|
| 89 |
+
constraints="=r",
|
| 90 |
+
args=[],
|
| 91 |
+
dtype=tl.int32,
|
| 92 |
+
is_pure=True,
|
| 93 |
+
pack=1,
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
@triton.jit
|
| 98 |
+
def _cluster_dim_x() -> tl.tensor:
|
| 99 |
+
"""Get cluster dimension in X (number of CTAs in X)."""
|
| 100 |
+
return tl.inline_asm_elementwise(
|
| 101 |
+
asm="""
|
| 102 |
+
{
|
| 103 |
+
.reg .u32 %r;
|
| 104 |
+
mov.u32 %r, %nclusterid.x;
|
| 105 |
+
mov.u32 $0, %r;
|
| 106 |
+
}
|
| 107 |
+
""",
|
| 108 |
+
constraints="=r",
|
| 109 |
+
args=[],
|
| 110 |
+
dtype=tl.int32,
|
| 111 |
+
is_pure=True,
|
| 112 |
+
pack=1,
|
| 113 |
+
)
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
@triton.jit
|
| 117 |
+
def _cluster_dim_y() -> tl.tensor:
|
| 118 |
+
"""Get cluster dimension in Y."""
|
| 119 |
+
return tl.inline_asm_elementwise(
|
| 120 |
+
asm="""
|
| 121 |
+
{
|
| 122 |
+
.reg .u32 %r;
|
| 123 |
+
mov.u32 %r, %nclusterid.y;
|
| 124 |
+
mov.u32 $0, %r;
|
| 125 |
+
}
|
| 126 |
+
""",
|
| 127 |
+
constraints="=r",
|
| 128 |
+
args=[],
|
| 129 |
+
dtype=tl.int32,
|
| 130 |
+
is_pure=True,
|
| 131 |
+
pack=1,
|
| 132 |
+
)
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# Legacy aliases
|
| 136 |
+
@triton.jit
|
| 137 |
+
def _cluster_rank() -> tl.tensor:
|
| 138 |
+
"""Get current block's rank within cluster (X dimension)."""
|
| 139 |
+
return _cluster_rank_x()
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
@triton.jit
|
| 143 |
+
def _cluster_size() -> tl.tensor:
|
| 144 |
+
"""Get total cluster size (X dimension)."""
|
| 145 |
+
return _cluster_dim_x()
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
@triton.jit
|
| 149 |
+
def _mapa_shared(local_ptr, target_rank):
|
| 150 |
+
"""
|
| 151 |
+
Map local shared memory pointer to target rank's address space.
|
| 152 |
+
|
| 153 |
+
PTX: mapa.shared::cluster.u64 dst, src, ctaid
|
| 154 |
+
|
| 155 |
+
This maps a local SMEM address to the equivalent address in another
|
| 156 |
+
CTA's shared memory space within the same cluster.
|
| 157 |
+
|
| 158 |
+
Args:
|
| 159 |
+
local_ptr: Pointer to local shared memory
|
| 160 |
+
target_rank: Target CTA rank within cluster
|
| 161 |
+
|
| 162 |
+
Returns:
|
| 163 |
+
Pointer to remote CTA's shared memory
|
| 164 |
+
|
| 165 |
+
Note: Requires SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
|
| 166 |
+
"""
|
| 167 |
+
return tl.inline_asm_elementwise(
|
| 168 |
+
asm="mapa.shared::cluster.u64 $0, $1, $2;",
|
| 169 |
+
constraints="=l,l,r",
|
| 170 |
+
args=[local_ptr, target_rank],
|
| 171 |
+
dtype=tl.pointer_type(tl.float32),
|
| 172 |
+
is_pure=True,
|
| 173 |
+
pack=1,
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
@triton.jit
|
| 178 |
+
def _cluster_barrier_init(barrier_ptr, expected_count):
|
| 179 |
+
"""
|
| 180 |
+
Initialize mbarrier for cluster-wide synchronization.
|
| 181 |
+
|
| 182 |
+
PTX: mbarrier.init.shared::cluster.b64 [addr], count
|
| 183 |
+
|
| 184 |
+
Args:
|
| 185 |
+
barrier_ptr: Pointer to barrier in shared memory
|
| 186 |
+
expected_count: Number of arrivals before completion
|
| 187 |
+
"""
|
| 188 |
+
tl.inline_asm_elementwise(
|
| 189 |
+
asm="mbarrier.init.shared::cluster.b64 [$0], $1;",
|
| 190 |
+
constraints="r,r",
|
| 191 |
+
args=[barrier_ptr, expected_count],
|
| 192 |
+
dtype=tl.int32,
|
| 193 |
+
is_pure=False,
|
| 194 |
+
pack=1,
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
@triton.jit
|
| 199 |
+
def _cluster_barrier_arrive(barrier_ptr):
|
| 200 |
+
"""
|
| 201 |
+
Arrive at cluster barrier, returns phase token.
|
| 202 |
+
|
| 203 |
+
PTX: mbarrier.arrive.shared::cluster.b64 state, [addr]
|
| 204 |
+
|
| 205 |
+
Args:
|
| 206 |
+
barrier_ptr: Pointer to barrier in shared memory
|
| 207 |
+
|
| 208 |
+
Returns:
|
| 209 |
+
Phase token for wait operation
|
| 210 |
+
"""
|
| 211 |
+
return tl.inline_asm_elementwise(
|
| 212 |
+
asm="mbarrier.arrive.shared::cluster.b64 $0, [$1];",
|
| 213 |
+
constraints="=l,r",
|
| 214 |
+
args=[barrier_ptr],
|
| 215 |
+
dtype=tl.uint64,
|
| 216 |
+
is_pure=False,
|
| 217 |
+
pack=1,
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
@triton.jit
|
| 222 |
+
def _cluster_barrier_arrive_tx(barrier_ptr, tx_count):
|
| 223 |
+
"""
|
| 224 |
+
Arrive at barrier with transaction count (for async copy tracking).
|
| 225 |
+
|
| 226 |
+
PTX: mbarrier.arrive.expect_tx.shared::cluster.b64 state, [addr], tx_count
|
| 227 |
+
|
| 228 |
+
Args:
|
| 229 |
+
barrier_ptr: Pointer to barrier
|
| 230 |
+
tx_count: Number of bytes expected in transaction
|
| 231 |
+
|
| 232 |
+
Returns:
|
| 233 |
+
Phase token
|
| 234 |
+
"""
|
| 235 |
+
return tl.inline_asm_elementwise(
|
| 236 |
+
asm="mbarrier.arrive.expect_tx.shared::cluster.b64 $0, [$1], $2;",
|
| 237 |
+
constraints="=l,r,r",
|
| 238 |
+
args=[barrier_ptr, tx_count],
|
| 239 |
+
dtype=tl.uint64,
|
| 240 |
+
is_pure=False,
|
| 241 |
+
pack=1,
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
@triton.jit
|
| 246 |
+
def _cluster_barrier_wait(barrier_ptr, phase):
|
| 247 |
+
"""
|
| 248 |
+
Wait on cluster barrier until phase completes.
|
| 249 |
+
|
| 250 |
+
PTX: mbarrier.try_wait.shared::cluster.b64 pred, [addr], phase
|
| 251 |
+
|
| 252 |
+
Uses spin-wait loop for completion.
|
| 253 |
+
"""
|
| 254 |
+
tl.inline_asm_elementwise(
|
| 255 |
+
asm="""
|
| 256 |
+
{
|
| 257 |
+
.reg .pred %p;
|
| 258 |
+
WAIT_LOOP:
|
| 259 |
+
mbarrier.try_wait.shared::cluster.b64 %p, [$0], $1;
|
| 260 |
+
@!%p bra WAIT_LOOP;
|
| 261 |
+
}
|
| 262 |
+
""",
|
| 263 |
+
constraints="r,l",
|
| 264 |
+
args=[barrier_ptr, phase],
|
| 265 |
+
dtype=tl.int32,
|
| 266 |
+
is_pure=False,
|
| 267 |
+
pack=1,
|
| 268 |
+
)
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
@triton.jit
|
| 272 |
+
def _cluster_barrier_test_wait(barrier_ptr, phase):
|
| 273 |
+
"""
|
| 274 |
+
Non-blocking test if barrier phase completed.
|
| 275 |
+
|
| 276 |
+
Returns 1 if complete, 0 if still pending.
|
| 277 |
+
"""
|
| 278 |
+
return tl.inline_asm_elementwise(
|
| 279 |
+
asm="""
|
| 280 |
+
{
|
| 281 |
+
.reg .pred %p;
|
| 282 |
+
.reg .u32 %r;
|
| 283 |
+
mbarrier.test_wait.shared::cluster.b64 %p, [$1], $2;
|
| 284 |
+
selp.u32 %r, 1, 0, %p;
|
| 285 |
+
mov.u32 $0, %r;
|
| 286 |
+
}
|
| 287 |
+
""",
|
| 288 |
+
constraints="=r,r,l",
|
| 289 |
+
args=[barrier_ptr, phase],
|
| 290 |
+
dtype=tl.int32,
|
| 291 |
+
is_pure=False,
|
| 292 |
+
pack=1,
|
| 293 |
+
)
|
| 294 |
+
|
| 295 |
+
|
| 296 |
+
@triton.jit
|
| 297 |
+
def _fence_cluster():
|
| 298 |
+
"""
|
| 299 |
+
Memory fence at cluster scope.
|
| 300 |
+
|
| 301 |
+
PTX: fence.acq_rel.cluster
|
| 302 |
+
|
| 303 |
+
Ensures all prior memory operations visible to all CTAs in cluster.
|
| 304 |
+
"""
|
| 305 |
+
tl.inline_asm_elementwise(
|
| 306 |
+
asm="fence.acq_rel.cluster;",
|
| 307 |
+
constraints="",
|
| 308 |
+
args=[],
|
| 309 |
+
dtype=tl.int32,
|
| 310 |
+
is_pure=False,
|
| 311 |
+
pack=1,
|
| 312 |
+
)
|
| 313 |
+
|
| 314 |
+
|
| 315 |
+
@triton.jit
|
| 316 |
+
def _fence_cluster_release():
|
| 317 |
+
"""Release fence at cluster scope."""
|
| 318 |
+
tl.inline_asm_elementwise(
|
| 319 |
+
asm="fence.release.cluster;",
|
| 320 |
+
constraints="",
|
| 321 |
+
args=[],
|
| 322 |
+
dtype=tl.int32,
|
| 323 |
+
is_pure=False,
|
| 324 |
+
pack=1,
|
| 325 |
+
)
|
| 326 |
+
|
| 327 |
+
|
| 328 |
+
@triton.jit
|
| 329 |
+
def _fence_cluster_acquire():
|
| 330 |
+
"""Acquire fence at cluster scope."""
|
| 331 |
+
tl.inline_asm_elementwise(
|
| 332 |
+
asm="fence.acquire.cluster;",
|
| 333 |
+
constraints="",
|
| 334 |
+
args=[],
|
| 335 |
+
dtype=tl.int32,
|
| 336 |
+
is_pure=False,
|
| 337 |
+
pack=1,
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
+
|
| 341 |
+
@triton.jit
|
| 342 |
+
def _cluster_sync():
|
| 343 |
+
"""
|
| 344 |
+
Full cluster synchronization point.
|
| 345 |
+
|
| 346 |
+
Equivalent to barrier + fence.
|
| 347 |
+
All threads in all CTAs of cluster must reach this point.
|
| 348 |
+
"""
|
| 349 |
+
# Note: bar.cluster requires cooperative launch
|
| 350 |
+
tl.inline_asm_elementwise(
|
| 351 |
+
asm="""
|
| 352 |
+
{
|
| 353 |
+
bar.cluster.arrive;
|
| 354 |
+
bar.cluster.wait;
|
| 355 |
+
fence.acq_rel.cluster;
|
| 356 |
+
}
|
| 357 |
+
""",
|
| 358 |
+
constraints="",
|
| 359 |
+
args=[],
|
| 360 |
+
dtype=tl.int32,
|
| 361 |
+
is_pure=False,
|
| 362 |
+
pack=1,
|
| 363 |
+
)
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
@triton.jit
|
| 367 |
+
def _async_copy_cluster(dst_ptr, src_ptr, size_bytes):
|
| 368 |
+
"""
|
| 369 |
+
Asynchronous copy within cluster using TMA.
|
| 370 |
+
|
| 371 |
+
PTX: cp.async.bulk.shared::cluster.global
|
| 372 |
+
|
| 373 |
+
Note: This is a simplified version. Full TMA requires descriptor setup.
|
| 374 |
+
"""
|
| 375 |
+
tl.inline_asm_elementwise(
|
| 376 |
+
asm="cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [$0], [$1], $2;",
|
| 377 |
+
constraints="l,l,r",
|
| 378 |
+
args=[dst_ptr, src_ptr, size_bytes],
|
| 379 |
+
dtype=tl.int32,
|
| 380 |
+
is_pure=False,
|
| 381 |
+
pack=1,
|
| 382 |
+
)
|
| 383 |
+
|
| 384 |
+
|
| 385 |
+
# =============================================================================
|
| 386 |
+
# High-Level DSMEM Utilities
|
| 387 |
+
# =============================================================================
|
| 388 |
+
|
| 389 |
+
def get_sm_version() -> Tuple[int, int]:
|
| 390 |
+
"""Get GPU SM version (major, minor)."""
|
| 391 |
+
if torch.cuda.is_available():
|
| 392 |
+
props = torch.cuda.get_device_properties(0)
|
| 393 |
+
return (props.major, props.minor)
|
| 394 |
+
return (0, 0)
|
| 395 |
+
|
| 396 |
+
|
| 397 |
+
def supports_dsmem() -> bool:
|
| 398 |
+
"""Check if current GPU supports DSMEM (SM 9.0+ / SM 12.0+)."""
|
| 399 |
+
major, minor = get_sm_version()
|
| 400 |
+
return major >= 9
|
| 401 |
+
|
| 402 |
+
|
| 403 |
+
def supports_cluster_2cta() -> bool:
|
| 404 |
+
"""Check if current GPU supports 2-CTA clusters."""
|
| 405 |
+
major, minor = get_sm_version()
|
| 406 |
+
return major >= 9 # Hopper+ supports clusters
|
| 407 |
+
|
| 408 |
+
|
| 409 |
+
def get_max_cluster_size() -> int:
|
| 410 |
+
"""Get maximum cluster size supported by GPU."""
|
| 411 |
+
major, minor = get_sm_version()
|
| 412 |
+
if major >= 12: # Blackwell
|
| 413 |
+
return 16 # Up to 16 CTAs per cluster
|
| 414 |
+
elif major >= 9: # Hopper
|
| 415 |
+
return 8 # Up to 8 CTAs per cluster
|
| 416 |
+
return 1 # No cluster support
|
| 417 |
+
|
| 418 |
+
|
| 419 |
+
# =============================================================================
|
| 420 |
+
# High-Level Cluster MatMul with DSMEM
|
| 421 |
+
# =============================================================================
|
| 422 |
+
|
| 423 |
+
@triton.autotune(
|
| 424 |
+
configs=[
|
| 425 |
+
triton.Config(
|
| 426 |
+
{'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64},
|
| 427 |
+
num_stages=3, num_warps=8, num_ctas=2
|
| 428 |
+
),
|
| 429 |
+
triton.Config(
|
| 430 |
+
{'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64},
|
| 431 |
+
num_stages=4, num_warps=8, num_ctas=2
|
| 432 |
+
),
|
| 433 |
+
],
|
| 434 |
+
key=['M', 'N', 'K'],
|
| 435 |
+
)
|
| 436 |
+
@triton.jit
|
| 437 |
+
def _cluster_matmul_dsmem_kernel(
|
| 438 |
+
a_ptr, b_ptr, c_ptr,
|
| 439 |
+
M, N, K,
|
| 440 |
+
stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
|
| 441 |
+
BLOCK_M: tl.constexpr,
|
| 442 |
+
BLOCK_N: tl.constexpr,
|
| 443 |
+
BLOCK_K: tl.constexpr,
|
| 444 |
+
):
|
| 445 |
+
"""
|
| 446 |
+
2-CTA Cluster MatMul with Distributed Shared Memory.
|
| 447 |
+
|
| 448 |
+
Architecture:
|
| 449 |
+
- CTA 0: Responsible for loading A tiles, shares via DSMEM
|
| 450 |
+
- CTA 1: Responsible for loading B tiles, shares via DSMEM
|
| 451 |
+
- Both: Compute partial products cooperatively
|
| 452 |
+
|
| 453 |
+
This kernel demonstrates the pattern; actual DSMEM requires
|
| 454 |
+
explicit shared memory management in Triton.
|
| 455 |
+
"""
|
| 456 |
+
pid_m = tl.program_id(0)
|
| 457 |
+
pid_n = tl.program_id(1)
|
| 458 |
+
|
| 459 |
+
# Get cluster info (when running with num_ctas > 1)
|
| 460 |
+
# For 2-CTA mode, blocks cooperate on adjacent tiles
|
| 461 |
+
num_pid_m = tl.cdiv(M, BLOCK_M)
|
| 462 |
+
num_pid_n = tl.cdiv(N, BLOCK_N)
|
| 463 |
+
|
| 464 |
+
# Swizzle for better L2 locality
|
| 465 |
+
GROUP_SIZE_M = 8
|
| 466 |
+
pid_m_group = pid_m // GROUP_SIZE_M
|
| 467 |
+
pid_m_local = pid_m % GROUP_SIZE_M
|
| 468 |
+
pid_n_group = pid_n // (num_pid_n // GROUP_SIZE_M + 1)
|
| 469 |
+
|
| 470 |
+
# Block pointers for TMA-style access
|
| 471 |
+
a_block = tl.make_block_ptr(
|
| 472 |
+
base=a_ptr, shape=(M, K), strides=(stride_am, stride_ak),
|
| 473 |
+
offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_K),
|
| 474 |
+
order=(1, 0)
|
| 475 |
+
)
|
| 476 |
+
b_block = tl.make_block_ptr(
|
| 477 |
+
base=b_ptr, shape=(K, N), strides=(stride_bk, stride_bn),
|
| 478 |
+
offsets=(0, pid_n * BLOCK_N), block_shape=(BLOCK_K, BLOCK_N),
|
| 479 |
+
order=(1, 0)
|
| 480 |
+
)
|
| 481 |
+
|
| 482 |
+
# Accumulator in FP32 for precision
|
| 483 |
+
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
|
| 484 |
+
|
| 485 |
+
# Main loop with software pipelining
|
| 486 |
+
for k_iter in range(0, tl.cdiv(K, BLOCK_K)):
|
| 487 |
+
# Load tiles (TMA handles async prefetch)
|
| 488 |
+
a_tile = tl.load(a_block, boundary_check=(0, 1))
|
| 489 |
+
b_tile = tl.load(b_block, boundary_check=(0, 1))
|
| 490 |
+
|
| 491 |
+
# Matrix multiply accumulate
|
| 492 |
+
acc += tl.dot(a_tile, b_tile)
|
| 493 |
+
|
| 494 |
+
# Advance pointers
|
| 495 |
+
a_block = tl.advance(a_block, (0, BLOCK_K))
|
| 496 |
+
b_block = tl.advance(b_block, (BLOCK_K, 0))
|
| 497 |
+
|
| 498 |
+
# Store result
|
| 499 |
+
c_block = tl.make_block_ptr(
|
| 500 |
+
base=c_ptr, shape=(M, N), strides=(stride_cm, stride_cn),
|
| 501 |
+
offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
|
| 502 |
+
block_shape=(BLOCK_M, BLOCK_N), order=(1, 0)
|
| 503 |
+
)
|
| 504 |
+
tl.store(c_block, acc.to(tl.bfloat16), boundary_check=(0, 1))
|
| 505 |
+
|
| 506 |
+
|
| 507 |
+
def cluster_matmul_dsmem(
|
| 508 |
+
a: torch.Tensor,
|
| 509 |
+
b: torch.Tensor,
|
| 510 |
+
config: Optional[ClusterConfig] = None
|
| 511 |
+
) -> torch.Tensor:
|
| 512 |
+
"""
|
| 513 |
+
High-performance cluster MatMul with DSMEM.
|
| 514 |
+
|
| 515 |
+
Uses 2-CTA cooperative mode on Blackwell (SM 12.0) for
|
| 516 |
+
~116% of cuBLAS performance on medium matrices.
|
| 517 |
+
|
| 518 |
+
Args:
|
| 519 |
+
a: Input matrix A [M, K] in BF16
|
| 520 |
+
b: Input matrix B [K, N] in BF16
|
| 521 |
+
config: Cluster configuration (default: 2-CTA)
|
| 522 |
+
|
| 523 |
+
Returns:
|
| 524 |
+
Output matrix C [M, N] in BF16
|
| 525 |
+
"""
|
| 526 |
+
if config is None:
|
| 527 |
+
config = ClusterConfig()
|
| 528 |
+
|
| 529 |
+
M, K = a.shape
|
| 530 |
+
K2, N = b.shape
|
| 531 |
+
assert K == K2, f"K dimension mismatch: {K} vs {K2}"
|
| 532 |
+
|
| 533 |
+
# Ensure BF16 for Tensor Core efficiency
|
| 534 |
+
a = a.to(torch.bfloat16).contiguous()
|
| 535 |
+
b = b.to(torch.bfloat16).contiguous()
|
| 536 |
+
|
| 537 |
+
c = torch.empty((M, N), device=a.device, dtype=torch.bfloat16)
|
| 538 |
+
|
| 539 |
+
grid = lambda META: (
|
| 540 |
+
triton.cdiv(M, META['BLOCK_M']),
|
| 541 |
+
triton.cdiv(N, META['BLOCK_N']),
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
_cluster_matmul_dsmem_kernel[grid](
|
| 545 |
+
a, b, c,
|
| 546 |
+
M, N, K,
|
| 547 |
+
a.stride(0), a.stride(1),
|
| 548 |
+
b.stride(0), b.stride(1),
|
| 549 |
+
c.stride(0), c.stride(1),
|
| 550 |
+
)
|
| 551 |
+
|
| 552 |
+
return c
|
| 553 |
+
|
| 554 |
+
|
| 555 |
+
# =============================================================================
|
| 556 |
+
# Cluster Attention with DSMEM (Preview)
|
| 557 |
+
# =============================================================================
|
| 558 |
+
|
| 559 |
+
@triton.jit
|
| 560 |
+
def _cluster_attention_kernel(
|
| 561 |
+
q_ptr, k_ptr, v_ptr, o_ptr,
|
| 562 |
+
M, N, D, # seq_len, kv_len, head_dim
|
| 563 |
+
stride_qm, stride_qd, stride_kn, stride_kd, stride_vn, stride_vd,
|
| 564 |
+
stride_om, stride_od,
|
| 565 |
+
scale,
|
| 566 |
+
BLOCK_M: tl.constexpr,
|
| 567 |
+
BLOCK_N: tl.constexpr,
|
| 568 |
+
BLOCK_D: tl.constexpr,
|
| 569 |
+
):
|
| 570 |
+
"""
|
| 571 |
+
Flash-Attention with 2-CTA cluster cooperation.
|
| 572 |
+
|
| 573 |
+
CTA cooperation strategy:
|
| 574 |
+
- CTA 0: Handles even KV blocks
|
| 575 |
+
- CTA 1: Handles odd KV blocks
|
| 576 |
+
- Both: Merge via DSMEM for softmax normalization
|
| 577 |
+
"""
|
| 578 |
+
pid_m = tl.program_id(0)
|
| 579 |
+
|
| 580 |
+
# Load Q tile (both CTAs load same Q)
|
| 581 |
+
q_block = tl.make_block_ptr(
|
| 582 |
+
base=q_ptr, shape=(M, D), strides=(stride_qm, stride_qd),
|
| 583 |
+
offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_D),
|
| 584 |
+
order=(1, 0)
|
| 585 |
+
)
|
| 586 |
+
q = tl.load(q_block, boundary_check=(0, 1))
|
| 587 |
+
|
| 588 |
+
# Running max and sum for online softmax
|
| 589 |
+
m_i = tl.zeros((BLOCK_M,), dtype=tl.float32) - float('inf')
|
| 590 |
+
l_i = tl.zeros((BLOCK_M,), dtype=tl.float32)
|
| 591 |
+
acc = tl.zeros((BLOCK_M, BLOCK_D), dtype=tl.float32)
|
| 592 |
+
|
| 593 |
+
# Iterate over KV blocks
|
| 594 |
+
for kv_block_idx in range(0, tl.cdiv(N, BLOCK_N)):
|
| 595 |
+
k_block = tl.make_block_ptr(
|
| 596 |
+
base=k_ptr, shape=(N, D), strides=(stride_kn, stride_kd),
|
| 597 |
+
offsets=(kv_block_idx * BLOCK_N, 0), block_shape=(BLOCK_N, BLOCK_D),
|
| 598 |
+
order=(1, 0)
|
| 599 |
+
)
|
| 600 |
+
v_block = tl.make_block_ptr(
|
| 601 |
+
base=v_ptr, shape=(N, D), strides=(stride_vn, stride_vd),
|
| 602 |
+
offsets=(kv_block_idx * BLOCK_N, 0), block_shape=(BLOCK_N, BLOCK_D),
|
| 603 |
+
order=(1, 0)
|
| 604 |
+
)
|
| 605 |
+
|
| 606 |
+
k = tl.load(k_block, boundary_check=(0, 1))
|
| 607 |
+
v = tl.load(v_block, boundary_check=(0, 1))
|
| 608 |
+
|
| 609 |
+
# QK^T
|
| 610 |
+
qk = tl.dot(q, tl.trans(k)) * scale
|
| 611 |
+
|
| 612 |
+
# Online softmax
|
| 613 |
+
m_ij = tl.max(qk, axis=1)
|
| 614 |
+
m_new = tl.maximum(m_i, m_ij)
|
| 615 |
+
alpha = tl.exp(m_i - m_new)
|
| 616 |
+
p = tl.exp(qk - m_new[:, None])
|
| 617 |
+
|
| 618 |
+
l_i = alpha * l_i + tl.sum(p, axis=1)
|
| 619 |
+
acc = alpha[:, None] * acc + tl.dot(p.to(q.dtype), v)
|
| 620 |
+
m_i = m_new
|
| 621 |
+
|
| 622 |
+
# Normalize
|
| 623 |
+
acc = acc / l_i[:, None]
|
| 624 |
+
|
| 625 |
+
# Store output
|
| 626 |
+
o_block = tl.make_block_ptr(
|
| 627 |
+
base=o_ptr, shape=(M, D), strides=(stride_om, stride_od),
|
| 628 |
+
offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_D),
|
| 629 |
+
order=(1, 0)
|
| 630 |
+
)
|
| 631 |
+
tl.store(o_block, acc.to(tl.bfloat16), boundary_check=(0, 1))
|
| 632 |
+
|
| 633 |
+
|
| 634 |
+
def cluster_attention(
|
| 635 |
+
q: torch.Tensor,
|
| 636 |
+
k: torch.Tensor,
|
| 637 |
+
v: torch.Tensor,
|
| 638 |
+
scale: Optional[float] = None
|
| 639 |
+
) -> torch.Tensor:
|
| 640 |
+
"""
|
| 641 |
+
Flash-Attention with cluster cooperation.
|
| 642 |
+
|
| 643 |
+
Args:
|
| 644 |
+
q: Query tensor [batch, heads, seq_len, head_dim]
|
| 645 |
+
k: Key tensor [batch, heads, kv_len, head_dim]
|
| 646 |
+
v: Value tensor [batch, heads, kv_len, head_dim]
|
| 647 |
+
scale: Attention scale (default: 1/sqrt(head_dim))
|
| 648 |
+
|
| 649 |
+
Returns:
|
| 650 |
+
Output tensor [batch, heads, seq_len, head_dim]
|
| 651 |
+
"""
|
| 652 |
+
batch, heads, seq_len, head_dim = q.shape
|
| 653 |
+
_, _, kv_len, _ = k.shape
|
| 654 |
+
|
| 655 |
+
if scale is None:
|
| 656 |
+
scale = head_dim ** -0.5
|
| 657 |
+
|
| 658 |
+
# Reshape for kernel
|
| 659 |
+
q_2d = q.view(batch * heads * seq_len, head_dim).contiguous()
|
| 660 |
+
k_2d = k.view(batch * heads * kv_len, head_dim).contiguous()
|
| 661 |
+
v_2d = v.view(batch * heads * kv_len, head_dim).contiguous()
|
| 662 |
+
o_2d = torch.empty_like(q_2d)
|
| 663 |
+
|
| 664 |
+
M = batch * heads * seq_len
|
| 665 |
+
N = kv_len
|
| 666 |
+
D = head_dim
|
| 667 |
+
|
| 668 |
+
BLOCK_M = 64
|
| 669 |
+
BLOCK_N = 64
|
| 670 |
+
BLOCK_D = head_dim
|
| 671 |
+
|
| 672 |
+
grid = (triton.cdiv(M, BLOCK_M),)
|
| 673 |
+
|
| 674 |
+
_cluster_attention_kernel[grid](
|
| 675 |
+
q_2d, k_2d, v_2d, o_2d,
|
| 676 |
+
M, N, D,
|
| 677 |
+
q_2d.stride(0), q_2d.stride(1),
|
| 678 |
+
k_2d.stride(0), k_2d.stride(1),
|
| 679 |
+
v_2d.stride(0), v_2d.stride(1),
|
| 680 |
+
o_2d.stride(0), o_2d.stride(1),
|
| 681 |
+
scale,
|
| 682 |
+
BLOCK_M=BLOCK_M,
|
| 683 |
+
BLOCK_N=BLOCK_N,
|
| 684 |
+
BLOCK_D=BLOCK_D,
|
| 685 |
+
num_ctas=2, # Enable 2-CTA mode
|
| 686 |
+
num_warps=4,
|
| 687 |
+
num_stages=2,
|
| 688 |
+
)
|
| 689 |
+
|
| 690 |
+
return o_2d.view(batch, heads, seq_len, head_dim)
|
| 691 |
+
|
| 692 |
+
|
| 693 |
+
# =============================================================================
|
| 694 |
+
# Super-Cluster API (Vera Rubin / NVL72 - Future)
|
| 695 |
+
# =============================================================================
|
| 696 |
+
|
| 697 |
+
@dataclass
|
| 698 |
+
class SuperClusterConfig:
|
| 699 |
+
"""
|
| 700 |
+
Configuration for Vera Rubin Super-Clusters.
|
| 701 |
+
|
| 702 |
+
NVL72: 72 GPUs with 3.6 TB/s NVLink 6 per GPU
|
| 703 |
+
NVL144: 144 GPUs (2 racks) with coherent memory
|
| 704 |
+
"""
|
| 705 |
+
num_gpus: int = 72
|
| 706 |
+
nvlink_version: int = 6
|
| 707 |
+
bandwidth_tb_s: float = 3.6
|
| 708 |
+
use_coherent_memory: bool = True
|
| 709 |
+
|
| 710 |
+
@property
|
| 711 |
+
def total_bandwidth_tb_s(self) -> float:
|
| 712 |
+
return self.num_gpus * self.bandwidth_tb_s
|
| 713 |
+
|
| 714 |
+
|
| 715 |
+
def init_super_cluster(config: SuperClusterConfig) -> bool:
|
| 716 |
+
"""
|
| 717 |
+
Initialize Super-Cluster for rack-scale computation.
|
| 718 |
+
|
| 719 |
+
Note: Requires Vera Rubin hardware (expected 2H 2026).
|
| 720 |
+
Currently returns False on pre-Rubin systems.
|
| 721 |
+
"""
|
| 722 |
+
# Check for Vera Rubin (SM 13.0+)
|
| 723 |
+
if torch.cuda.is_available():
|
| 724 |
+
props = torch.cuda.get_device_properties(0)
|
| 725 |
+
if props.major >= 13: # Vera Rubin
|
| 726 |
+
# Future: Initialize NVLink 6 collective
|
| 727 |
+
return True
|
| 728 |
+
|
| 729 |
+
return False
|
| 730 |
+
|
| 731 |
+
|
| 732 |
+
# =============================================================================
|
| 733 |
+
# Benchmark
|
| 734 |
+
# =============================================================================
|
| 735 |
+
|
| 736 |
+
def benchmark_dsmem():
|
| 737 |
+
"""Benchmark DSMEM cluster operations."""
|
| 738 |
+
import time
|
| 739 |
+
|
| 740 |
+
print("=" * 60)
|
| 741 |
+
print("FireEcho DSMEM Cluster Benchmark")
|
| 742 |
+
print("=" * 60)
|
| 743 |
+
|
| 744 |
+
sizes = [(2048, 2048, 2048), (4096, 4096, 4096), (8192, 8192, 8192)]
|
| 745 |
+
|
| 746 |
+
for M, N, K in sizes:
|
| 747 |
+
a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
|
| 748 |
+
b = torch.randn(K, N, device='cuda', dtype=torch.bfloat16)
|
| 749 |
+
|
| 750 |
+
# Warmup
|
| 751 |
+
for _ in range(3):
|
| 752 |
+
_ = cluster_matmul_dsmem(a, b)
|
| 753 |
+
torch.cuda.synchronize()
|
| 754 |
+
|
| 755 |
+
# Benchmark
|
| 756 |
+
start = time.perf_counter()
|
| 757 |
+
iters = 100
|
| 758 |
+
for _ in range(iters):
|
| 759 |
+
c = cluster_matmul_dsmem(a, b)
|
| 760 |
+
torch.cuda.synchronize()
|
| 761 |
+
elapsed = time.perf_counter() - start
|
| 762 |
+
|
| 763 |
+
flops = 2 * M * N * K * iters
|
| 764 |
+
tflops = flops / elapsed / 1e12
|
| 765 |
+
|
| 766 |
+
print(f" {M}x{N}x{K}: {tflops:.1f} TFLOPS ({elapsed/iters*1000:.2f}ms/iter)")
|
| 767 |
+
|
| 768 |
+
print()
|
| 769 |
+
|
| 770 |
+
|
| 771 |
+
if __name__ == '__main__':
|
| 772 |
+
print("Testing DSMEM cluster operations...")
|
| 773 |
+
print()
|
| 774 |
+
|
| 775 |
+
# Basic test
|
| 776 |
+
a = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)
|
| 777 |
+
b = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)
|
| 778 |
+
|
| 779 |
+
c = cluster_matmul_dsmem(a, b)
|
| 780 |
+
c_ref = torch.matmul(a, b)
|
| 781 |
+
|
| 782 |
+
rel_err = torch.norm(c.float() - c_ref.float()) / torch.norm(c_ref.float())
|
| 783 |
+
print(f"Cluster MatMul DSMEM:")
|
| 784 |
+
print(f" Output shape: {c.shape}")
|
| 785 |
+
print(f" Relative error: {rel_err:.2e}")
|
| 786 |
+
print()
|
| 787 |
+
|
| 788 |
+
# Benchmark
|
| 789 |
+
benchmark_dsmem()
|
FireEcho Engine/eagle_data_codemix_cache.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:be37ced722408193210259dda063935f3886ccbb6b2b100c06d5d925d7a7242b
|
| 3 |
+
size 151376367
|
FireEcho Engine/eagle_data_codemix_cache.pt.bak
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e9b8667e8946514f3d6d90d66df6ee45603a7095c734b72a6d88be9906d6659d
|
| 3 |
+
size 25337149
|
FireEcho Engine/eagle_data_codemix_cache_old.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9f8eacaa8701aac02c030d2c304993969e64236a2f220aef29ed8aefe305e754
|
| 3 |
+
size 75374285
|
FireEcho Engine/eagle_data_selfgen_cache.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c42d8cb1c64cb824f2697487d51a2cda64b757218ef0e2c0093cb6ced0398e74
|
| 3 |
+
size 9930893
|
FireEcho Engine/eagle_data_selfgen_cache.pt.old
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ffabde184ee598914bad8b5150fdbf0f8e24214c5b371301556b1c5db0895f98
|
| 3 |
+
size 4643021
|
FireEcho Engine/eagle_precompute.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
FireEcho Engine/eagle_precompute_goddess.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
FireEcho Engine/eagle_precompute_v2.log
ADDED
|
@@ -0,0 +1,1220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
EAGLE-3 Draft Head Training — PRECOMPUTE mode
|
| 3 |
+
============================================================
|
| 4 |
+
Epochs: 10
|
| 5 |
+
Max samples: 20000
|
| 6 |
+
Max seq len: 512
|
| 7 |
+
LR: 0.0001, warmup: 2000
|
| 8 |
+
Draft depth (K): 7
|
| 9 |
+
Grad accum: 4, clip: 0.5
|
| 10 |
+
Capture layers: (8, 24, 47)
|
| 11 |
+
Head layers: 2
|
| 12 |
+
Loss type: fwd_kl
|
| 13 |
+
Focal gamma: 2.0
|
| 14 |
+
TTT mixing: ratio=0.5, warmup=5000 steps
|
| 15 |
+
Top-K logits: 64
|
| 16 |
+
Flatness filter: 100%
|
| 17 |
+
Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
|
| 18 |
+
|
| 19 |
+
[1/4] Loading model...
|
| 20 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 21 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 22 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 23 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 24 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 25 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 26 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 27 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 28 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 29 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 30 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 31 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 32 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 33 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 34 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 35 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 36 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 37 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 38 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 39 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 40 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 41 |
+
Total params: 1.57B
|
| 42 |
+
Frozen params: 1.54B (base model, FP4)
|
| 43 |
+
Trainable params: 30.2M (Hebbian only)
|
| 44 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 45 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 46 |
+
|
| 47 |
+
[2/4] Enabling EAGLE-3 draft head...
|
| 48 |
+
[EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 49 |
+
Trainable eagle params: 104.9M
|
| 50 |
+
|
| 51 |
+
[3/5] Loading external dataset...
|
| 52 |
+
Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
|
| 53 |
+
Loaded 20000 samples.
|
| 54 |
+
|
| 55 |
+
[PRECOMPUTE] Running target model on 20000 samples...
|
| 56 |
+
Precomputed 100/20000 (0.1 samples/s, ETA 2794min)
|
| 57 |
+
Precomputed 200/20000 (0.2 samples/s, ETA 1654min)
|
| 58 |
+
Precomputed 300/20000 (0.3 samples/s, ETA 1265min)
|
| 59 |
+
Precomputed 400/20000 (0.3 samples/s, ETA 1045min)
|
| 60 |
+
Precomputed 500/20000 (0.4 samples/s, ETA 918min)
|
| 61 |
+
Precomputed 600/20000 (0.4 samples/s, ETA 825min)
|
| 62 |
+
Precomputed 700/20000 (0.4 samples/s, ETA 758min)
|
| 63 |
+
Precomputed 800/20000 (0.5 samples/s, ETA 708min)
|
| 64 |
+
Precomputed 900/20000 (0.5 samples/s, ETA 667min)
|
| 65 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 66 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 67 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
|
| 68 |
+
[FE-MX] Expert tiers: 3 cold(FP4) / 1 warm(FP6) / 124 hot(FP8)
|
| 69 |
+
[FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
|
| 70 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
|
| 71 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 72 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 73 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 74 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 75 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 76 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
|
| 77 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 78 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
|
| 79 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 80 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 81 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 82 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 83 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 84 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 85 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 86 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 87 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 88 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 89 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 90 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 91 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 92 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 93 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 94 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 95 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 96 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 97 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 98 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 99 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 100 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 101 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 102 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
|
| 103 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 104 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 105 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 106 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 2 warm(FP6) / 122 hot(FP8)
|
| 107 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 108 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 1 warm(FP6) / 123 hot(FP8)
|
| 109 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
|
| 110 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 111 |
+
[FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
|
| 112 |
+
[FE-MX] Expert tiers: 7 cold(FP4) / 0 warm(FP6) / 121 hot(FP8)
|
| 113 |
+
Precomputed 1000/20000 (0.5 samples/s, ETA 630min)
|
| 114 |
+
Precomputed 1100/20000 (0.5 samples/s, ETA 600min)
|
| 115 |
+
Precomputed 1200/20000 (0.5 samples/s, ETA 575min)
|
| 116 |
+
Precomputed 1300/20000 (0.6 samples/s, ETA 552min)
|
| 117 |
+
Precomputed 1400/20000 (0.6 samples/s, ETA 531min)
|
| 118 |
+
Precomputed 1500/20000 (0.6 samples/s, ETA 514min)
|
| 119 |
+
Precomputed 1600/20000 (0.6 samples/s, ETA 496min)
|
| 120 |
+
Precomputed 1700/20000 (0.6 samples/s, ETA 481min)
|
| 121 |
+
Precomputed 1800/20000 (0.6 samples/s, ETA 467min)
|
| 122 |
+
Precomputed 1900/20000 (0.7 samples/s, ETA 455min)
|
| 123 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 124 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 125 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
|
| 126 |
+
[FE-MX] Expert tiers: 3 cold(FP4) / 0 warm(FP6) / 125 hot(FP8)
|
| 127 |
+
[FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
|
| 128 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
|
| 129 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 130 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 131 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 132 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 133 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 134 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 135 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 136 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
|
| 137 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 138 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 139 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 140 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 141 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 142 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 143 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 144 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 145 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 146 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 147 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 148 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 149 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 150 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 151 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 152 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 153 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 154 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 155 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 156 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 157 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 158 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 159 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 160 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 161 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 162 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 163 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 164 |
+
[FE-MX] Expert tiers: 3 cold(FP4) / 3 warm(FP6) / 122 hot(FP8)
|
| 165 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 166 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
|
| 167 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
|
| 168 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 169 |
+
[FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
|
| 170 |
+
[FE-MX] Expert tiers: 7 cold(FP4) / 0 warm(FP6) / 121 hot(FP8)
|
| 171 |
+
Precomputed 2000/20000 (0.7 samples/s, ETA 443min)
|
| 172 |
+
Precomputed 2100/20000 (0.7 samples/s, ETA 432min)
|
| 173 |
+
Precomputed 2200/20000 (0.7 samples/s, ETA 423min)
|
| 174 |
+
Precomputed 2300/20000 (0.7 samples/s, ETA 415min)
|
| 175 |
+
Precomputed 2400/20000 (0.7 samples/s, ETA 407min)
|
| 176 |
+
Precomputed 2500/20000 (0.7 samples/s, ETA 399min)
|
| 177 |
+
Precomputed 2600/20000 (0.7 samples/s, ETA 392min)
|
| 178 |
+
Precomputed 2700/20000 (0.7 samples/s, ETA 385min)
|
| 179 |
+
Precomputed 2800/20000 (0.8 samples/s, ETA 379min)
|
| 180 |
+
Precomputed 2900/20000 (0.8 samples/s, ETA 374min)
|
| 181 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 182 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 183 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
|
| 184 |
+
[FE-MX] Expert tiers: 2 cold(FP4) / 1 warm(FP6) / 125 hot(FP8)
|
| 185 |
+
[FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
|
| 186 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 187 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 188 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 189 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 190 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 191 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 192 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 193 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 194 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 195 |
+
[FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
|
| 196 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 197 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 198 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 199 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 200 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 201 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 202 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 203 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 204 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 205 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 206 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 207 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 208 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 209 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 210 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 211 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 212 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 213 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 214 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 215 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 216 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 217 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 218 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 219 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 220 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 221 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 222 |
+
[FE-MX] Expert tiers: 2 cold(FP4) / 4 warm(FP6) / 122 hot(FP8)
|
| 223 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 224 |
+
[FE-MX] Expert tiers: 3 cold(FP4) / 1 warm(FP6) / 124 hot(FP8)
|
| 225 |
+
[FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
|
| 226 |
+
[FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
|
| 227 |
+
[FE-MX] Expert tiers: 6 cold(FP4) / 0 warm(FP6) / 122 hot(FP8)
|
| 228 |
+
[FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
|
| 229 |
+
Precomputed 3000/20000 (0.8 samples/s, ETA 369min)
|
| 230 |
+
Precomputed 3100/20000 (0.8 samples/s, ETA 363min)
|
| 231 |
+
Precomputed 3200/20000 (0.8 samples/s, ETA 358min)
|
| 232 |
+
Precomputed 3300/20000 (0.8 samples/s, ETA 353min)
|
| 233 |
+
Precomputed 3400/20000 (0.8 samples/s, ETA 348min)
|
| 234 |
+
Precomputed 3500/20000 (0.8 samples/s, ETA 343min)
|
| 235 |
+
Precomputed 3600/20000 (0.8 samples/s, ETA 338min)
|
| 236 |
+
Precomputed 3700/20000 (0.8 samples/s, ETA 333min)
|
| 237 |
+
Precomputed 3800/20000 (0.8 samples/s, ETA 329min)
|
| 238 |
+
Precomputed 3900/20000 (0.8 samples/s, ETA 324min)
|
| 239 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 240 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 241 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 242 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
|
| 243 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 2 warm(FP6) / 126 hot(FP8)
|
| 244 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 245 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 246 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 247 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 248 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 249 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 250 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 251 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 252 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 253 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 254 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 255 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 256 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 257 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 258 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 259 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 260 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 261 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 262 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 263 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 264 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 265 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 266 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 267 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 268 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 269 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 270 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 271 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 272 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 273 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 274 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 275 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 276 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 277 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 278 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 279 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 280 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
|
| 281 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 4 warm(FP6) / 124 hot(FP8)
|
| 282 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
|
| 283 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 4 warm(FP6) / 124 hot(FP8)
|
| 284 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 5 warm(FP6) / 123 hot(FP8)
|
| 285 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 6 warm(FP6) / 122 hot(FP8)
|
| 286 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 6 warm(FP6) / 122 hot(FP8)
|
| 287 |
+
Precomputed 4000/20000 (0.8 samples/s, ETA 320min)
|
| 288 |
+
Precomputed 4100/20000 (0.8 samples/s, ETA 316min)
|
| 289 |
+
Precomputed 4200/20000 (0.8 samples/s, ETA 312min)
|
| 290 |
+
Precomputed 4300/20000 (0.8 samples/s, ETA 309min)
|
| 291 |
+
Precomputed 4400/20000 (0.9 samples/s, ETA 305min)
|
| 292 |
+
Precomputed 4500/20000 (0.9 samples/s, ETA 301min)
|
| 293 |
+
Precomputed 4600/20000 (0.9 samples/s, ETA 298min)
|
| 294 |
+
Precomputed 4700/20000 (0.9 samples/s, ETA 294min)
|
| 295 |
+
Precomputed 4800/20000 (0.9 samples/s, ETA 291min)
|
| 296 |
+
Precomputed 4900/20000 (0.9 samples/s, ETA 287min)
|
| 297 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 298 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 299 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 300 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 301 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 302 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 303 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 304 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 305 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 306 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 307 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 308 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 309 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 310 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 311 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 312 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 313 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 314 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 315 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 316 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 317 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 318 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 319 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 320 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 321 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 322 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 323 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 324 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 325 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 326 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 327 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 328 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 329 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 330 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 331 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 332 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 333 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 334 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 335 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 336 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 337 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 338 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 339 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 340 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 341 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 342 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 343 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 344 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 345 |
+
Precomputed 5000/20000 (0.9 samples/s, ETA 284min)
|
| 346 |
+
Precomputed 5100/20000 (0.9 samples/s, ETA 281min)
|
| 347 |
+
Precomputed 5200/20000 (0.9 samples/s, ETA 278min)
|
| 348 |
+
Precomputed 5300/20000 (0.9 samples/s, ETA 275min)
|
| 349 |
+
Precomputed 5400/20000 (0.9 samples/s, ETA 272min)
|
| 350 |
+
Precomputed 5500/20000 (0.9 samples/s, ETA 269min)
|
| 351 |
+
Precomputed 5600/20000 (0.9 samples/s, ETA 266min)
|
| 352 |
+
Precomputed 5700/20000 (0.9 samples/s, ETA 263min)
|
| 353 |
+
Precomputed 5800/20000 (0.9 samples/s, ETA 260min)
|
| 354 |
+
Precomputed 5900/20000 (0.9 samples/s, ETA 258min)
|
| 355 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 356 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 357 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 358 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 359 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 360 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 361 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 362 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 363 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 364 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 365 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 366 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 367 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 368 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 369 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 370 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 371 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 372 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 373 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 374 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 375 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 376 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 377 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 378 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 379 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 380 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 381 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 382 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 383 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 384 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 385 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 386 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 387 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 388 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 389 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 390 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 391 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 392 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 393 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 394 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 395 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 396 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 397 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 398 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 399 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 400 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 401 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 402 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 403 |
+
Precomputed 6000/20000 (0.9 samples/s, ETA 255min)
|
| 404 |
+
Precomputed 6100/20000 (0.9 samples/s, ETA 252min)
|
| 405 |
+
Precomputed 6200/20000 (0.9 samples/s, ETA 250min)
|
| 406 |
+
Precomputed 6300/20000 (0.9 samples/s, ETA 247min)
|
| 407 |
+
Precomputed 6400/20000 (0.9 samples/s, ETA 244min)
|
| 408 |
+
Precomputed 6500/20000 (0.9 samples/s, ETA 242min)
|
| 409 |
+
Precomputed 6600/20000 (0.9 samples/s, ETA 239min)
|
| 410 |
+
Precomputed 6700/20000 (0.9 samples/s, ETA 237min)
|
| 411 |
+
Precomputed 6800/20000 (0.9 samples/s, ETA 235min)
|
| 412 |
+
Precomputed 6900/20000 (0.9 samples/s, ETA 232min)
|
| 413 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 414 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 415 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 416 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 417 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 418 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 419 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 420 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 421 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 422 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 423 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 424 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 425 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 426 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 427 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 428 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 429 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 430 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 431 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 432 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 433 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 434 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 435 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 436 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 437 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 438 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 439 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 440 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 441 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 442 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 443 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 444 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 445 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 446 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 447 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 448 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 449 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 450 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 451 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 452 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 453 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 454 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 455 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 456 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 457 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 458 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 459 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 460 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 461 |
+
Precomputed 7000/20000 (0.9 samples/s, ETA 230min)
|
| 462 |
+
Precomputed 7100/20000 (0.9 samples/s, ETA 227min)
|
| 463 |
+
Precomputed 7200/20000 (0.9 samples/s, ETA 225min)
|
| 464 |
+
Precomputed 7300/20000 (1.0 samples/s, ETA 223min)
|
| 465 |
+
Precomputed 7400/20000 (1.0 samples/s, ETA 220min)
|
| 466 |
+
Precomputed 7500/20000 (1.0 samples/s, ETA 218min)
|
| 467 |
+
Precomputed 7600/20000 (1.0 samples/s, ETA 216min)
|
| 468 |
+
Precomputed 7700/20000 (1.0 samples/s, ETA 214min)
|
| 469 |
+
Precomputed 7800/20000 (1.0 samples/s, ETA 212min)
|
| 470 |
+
Precomputed 7900/20000 (1.0 samples/s, ETA 209min)
|
| 471 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 472 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 473 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 474 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 475 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 476 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 477 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 478 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 479 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 480 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 481 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 482 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 483 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 484 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 485 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 486 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 487 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 488 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 489 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 490 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 491 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 492 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 493 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 494 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 495 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 496 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 497 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 498 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 499 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 500 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 501 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 502 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 503 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 504 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 505 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 506 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 507 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 508 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 509 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 510 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 511 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 512 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 513 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 514 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 515 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
|
| 516 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 517 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 518 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 519 |
+
Precomputed 8000/20000 (1.0 samples/s, ETA 207min)
|
| 520 |
+
Precomputed 8100/20000 (1.0 samples/s, ETA 205min)
|
| 521 |
+
Precomputed 8200/20000 (1.0 samples/s, ETA 203min)
|
| 522 |
+
Precomputed 8300/20000 (1.0 samples/s, ETA 201min)
|
| 523 |
+
Precomputed 8400/20000 (1.0 samples/s, ETA 199min)
|
| 524 |
+
Precomputed 8500/20000 (1.0 samples/s, ETA 197min)
|
| 525 |
+
Precomputed 8600/20000 (1.0 samples/s, ETA 195min)
|
| 526 |
+
Precomputed 8700/20000 (1.0 samples/s, ETA 193min)
|
| 527 |
+
Precomputed 8800/20000 (1.0 samples/s, ETA 190min)
|
| 528 |
+
Precomputed 8900/20000 (1.0 samples/s, ETA 188min)
|
| 529 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 530 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 531 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 532 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 533 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 534 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 535 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 536 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 537 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 538 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 539 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 540 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 541 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 542 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 543 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 544 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 545 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 546 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 547 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 548 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 549 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 550 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 551 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 552 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 553 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 554 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 555 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 556 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 557 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 558 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 559 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 560 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 561 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 562 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 563 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 564 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 565 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 566 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 567 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 568 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 569 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 570 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 571 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 572 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 573 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 574 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 575 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 576 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 577 |
+
Precomputed 9000/20000 (1.0 samples/s, ETA 186min)
|
| 578 |
+
Precomputed 9100/20000 (1.0 samples/s, ETA 184min)
|
| 579 |
+
Precomputed 9200/20000 (1.0 samples/s, ETA 182min)
|
| 580 |
+
Precomputed 9300/20000 (1.0 samples/s, ETA 181min)
|
| 581 |
+
Precomputed 9400/20000 (1.0 samples/s, ETA 179min)
|
| 582 |
+
Precomputed 9500/20000 (1.0 samples/s, ETA 177min)
|
| 583 |
+
Precomputed 9600/20000 (1.0 samples/s, ETA 175min)
|
| 584 |
+
Precomputed 9700/20000 (1.0 samples/s, ETA 173min)
|
| 585 |
+
Precomputed 9800/20000 (1.0 samples/s, ETA 171min)
|
| 586 |
+
Precomputed 9900/20000 (1.0 samples/s, ETA 169min)
|
| 587 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 588 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 589 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 590 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 591 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 592 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 593 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 594 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 595 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 596 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 597 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 598 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 599 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 600 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 601 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 602 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 603 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 604 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 605 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 606 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 607 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 608 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 609 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 610 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 611 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 612 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 613 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 614 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 615 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 616 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 617 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 618 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 619 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 620 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 621 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 622 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 623 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 624 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 625 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 626 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 627 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 628 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 629 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 630 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 631 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 632 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 633 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 634 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 635 |
+
Precomputed 10000/20000 (1.0 samples/s, ETA 167min)
|
| 636 |
+
Precomputed 10100/20000 (1.0 samples/s, ETA 165min)
|
| 637 |
+
Precomputed 10200/20000 (1.0 samples/s, ETA 163min)
|
| 638 |
+
Precomputed 10300/20000 (1.0 samples/s, ETA 161min)
|
| 639 |
+
Precomputed 10400/20000 (1.0 samples/s, ETA 160min)
|
| 640 |
+
Precomputed 10500/20000 (1.0 samples/s, ETA 158min)
|
| 641 |
+
Precomputed 10600/20000 (1.0 samples/s, ETA 156min)
|
| 642 |
+
Precomputed 10700/20000 (1.0 samples/s, ETA 154min)
|
| 643 |
+
Precomputed 10800/20000 (1.0 samples/s, ETA 152min)
|
| 644 |
+
Precomputed 10900/20000 (1.0 samples/s, ETA 150min)
|
| 645 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 646 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 647 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 648 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 649 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 650 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 651 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 652 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 653 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 654 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 655 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 656 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 657 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 658 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 659 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 660 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 661 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 662 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 663 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 664 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 665 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 666 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 667 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 668 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 669 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 670 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 671 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 672 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 673 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 674 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 675 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 676 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 677 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 678 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 679 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 680 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 681 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 682 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 683 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 684 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 685 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 686 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 687 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 688 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 689 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 690 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 691 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 692 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 693 |
+
Precomputed 11000/20000 (1.0 samples/s, ETA 149min)
|
| 694 |
+
Precomputed 11100/20000 (1.0 samples/s, ETA 147min)
|
| 695 |
+
Precomputed 11200/20000 (1.0 samples/s, ETA 145min)
|
| 696 |
+
Precomputed 11300/20000 (1.0 samples/s, ETA 143min)
|
| 697 |
+
Precomputed 11400/20000 (1.0 samples/s, ETA 141min)
|
| 698 |
+
Precomputed 11500/20000 (1.0 samples/s, ETA 140min)
|
| 699 |
+
Precomputed 11600/20000 (1.0 samples/s, ETA 138min)
|
| 700 |
+
Precomputed 11700/20000 (1.0 samples/s, ETA 136min)
|
| 701 |
+
Precomputed 11800/20000 (1.0 samples/s, ETA 134min)
|
| 702 |
+
Precomputed 11900/20000 (1.0 samples/s, ETA 132min)
|
| 703 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 704 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 705 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 706 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 707 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 708 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 709 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 710 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 711 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 712 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 713 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 714 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 715 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 716 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 717 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 718 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 719 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 720 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 721 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 722 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 723 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 724 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 725 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 726 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 727 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 728 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 729 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 730 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 731 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 732 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 733 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 734 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 735 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 736 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 737 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 738 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 739 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 740 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 741 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 742 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 743 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 744 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 745 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 746 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 747 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 748 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 749 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 750 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 751 |
+
Precomputed 12000/20000 (1.0 samples/s, ETA 131min)
|
| 752 |
+
Precomputed 12100/20000 (1.0 samples/s, ETA 129min)
|
| 753 |
+
Precomputed 12200/20000 (1.0 samples/s, ETA 127min)
|
| 754 |
+
Precomputed 12300/20000 (1.0 samples/s, ETA 125min)
|
| 755 |
+
Precomputed 12400/20000 (1.0 samples/s, ETA 124min)
|
| 756 |
+
Precomputed 12500/20000 (1.0 samples/s, ETA 122min)
|
| 757 |
+
Precomputed 12600/20000 (1.0 samples/s, ETA 120min)
|
| 758 |
+
Precomputed 12700/20000 (1.0 samples/s, ETA 119min)
|
| 759 |
+
Precomputed 12800/20000 (1.0 samples/s, ETA 117min)
|
| 760 |
+
Precomputed 12900/20000 (1.0 samples/s, ETA 115min)
|
| 761 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 762 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 763 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 764 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 765 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 766 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 767 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 768 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 769 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 770 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 771 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 772 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 773 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 774 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 775 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 776 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 777 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 778 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 779 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 780 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 781 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 782 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 783 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 784 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 785 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 786 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 787 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 788 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 789 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 790 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 791 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 792 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 793 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 794 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 795 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 796 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 797 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 798 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 799 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 800 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 801 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 802 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 803 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 804 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 805 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 806 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 807 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 808 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 809 |
+
Precomputed 13000/20000 (1.0 samples/s, ETA 113min)
|
| 810 |
+
Precomputed 13100/20000 (1.0 samples/s, ETA 112min)
|
| 811 |
+
Precomputed 13200/20000 (1.0 samples/s, ETA 110min)
|
| 812 |
+
Precomputed 13300/20000 (1.0 samples/s, ETA 108min)
|
| 813 |
+
Precomputed 13400/20000 (1.0 samples/s, ETA 107min)
|
| 814 |
+
Precomputed 13500/20000 (1.0 samples/s, ETA 105min)
|
| 815 |
+
Precomputed 13600/20000 (1.0 samples/s, ETA 103min)
|
| 816 |
+
Precomputed 13700/20000 (1.0 samples/s, ETA 102min)
|
| 817 |
+
Precomputed 13800/20000 (1.0 samples/s, ETA 100min)
|
| 818 |
+
Precomputed 13900/20000 (1.0 samples/s, ETA 98min)
|
| 819 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 820 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 821 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 822 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 823 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 824 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 825 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 826 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 827 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 828 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 829 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 830 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 831 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 832 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 833 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 834 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 835 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 836 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 837 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 838 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 839 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 840 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 841 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 842 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 843 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 844 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 845 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 846 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 847 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 848 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 849 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 850 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 851 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 852 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 853 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 854 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 855 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 856 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 857 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 858 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 859 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 860 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 861 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 862 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 863 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 864 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 865 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 866 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 867 |
+
Precomputed 14000/20000 (1.0 samples/s, ETA 96min)
|
| 868 |
+
Precomputed 14100/20000 (1.0 samples/s, ETA 95min)
|
| 869 |
+
Precomputed 14200/20000 (1.0 samples/s, ETA 93min)
|
| 870 |
+
Precomputed 14300/20000 (1.0 samples/s, ETA 91min)
|
| 871 |
+
Precomputed 14400/20000 (1.0 samples/s, ETA 90min)
|
| 872 |
+
Precomputed 14500/20000 (1.0 samples/s, ETA 88min)
|
| 873 |
+
Precomputed 14600/20000 (1.0 samples/s, ETA 86min)
|
| 874 |
+
Precomputed 14700/20000 (1.0 samples/s, ETA 85min)
|
| 875 |
+
Precomputed 14800/20000 (1.0 samples/s, ETA 83min)
|
| 876 |
+
Precomputed 14900/20000 (1.0 samples/s, ETA 81min)
|
| 877 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 878 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 879 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 880 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 881 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 882 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 883 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 884 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 885 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 886 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 887 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 888 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 889 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 890 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 891 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 892 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 893 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 894 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 895 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 896 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 897 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 898 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 899 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 900 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 901 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 902 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 903 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 904 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 905 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 906 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 907 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 908 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 909 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 910 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 911 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 912 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 913 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 914 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 915 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 916 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 917 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 918 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 919 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 920 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 921 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 922 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 923 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 924 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 925 |
+
Precomputed 15000/20000 (1.0 samples/s, ETA 80min)
|
| 926 |
+
Precomputed 15100/20000 (1.0 samples/s, ETA 78min)
|
| 927 |
+
Precomputed 15200/20000 (1.0 samples/s, ETA 77min)
|
| 928 |
+
Precomputed 15300/20000 (1.0 samples/s, ETA 75min)
|
| 929 |
+
Precomputed 15400/20000 (1.0 samples/s, ETA 73min)
|
| 930 |
+
Precomputed 15500/20000 (1.0 samples/s, ETA 72min)
|
| 931 |
+
Precomputed 15600/20000 (1.0 samples/s, ETA 70min)
|
| 932 |
+
Precomputed 15700/20000 (1.0 samples/s, ETA 68min)
|
| 933 |
+
Precomputed 15800/20000 (1.0 samples/s, ETA 67min)
|
| 934 |
+
Precomputed 15900/20000 (1.0 samples/s, ETA 65min)
|
| 935 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 936 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 937 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 938 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 939 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 940 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 941 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 942 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 943 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 944 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 945 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 946 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 947 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 948 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 949 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 950 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 951 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 952 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 953 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 954 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 955 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 956 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 957 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 958 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 959 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 960 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 961 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 962 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 963 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 964 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 965 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 966 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 967 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 968 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 969 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 970 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 971 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 972 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 973 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 974 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 975 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 976 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 977 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 978 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 979 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 980 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 981 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 982 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 983 |
+
Precomputed 16000/20000 (1.1 samples/s, ETA 63min)
|
| 984 |
+
Precomputed 16100/20000 (1.1 samples/s, ETA 62min)
|
| 985 |
+
Precomputed 16200/20000 (1.1 samples/s, ETA 60min)
|
| 986 |
+
Precomputed 16300/20000 (1.1 samples/s, ETA 59min)
|
| 987 |
+
Precomputed 16400/20000 (1.1 samples/s, ETA 57min)
|
| 988 |
+
Precomputed 16500/20000 (1.1 samples/s, ETA 55min)
|
| 989 |
+
Precomputed 16600/20000 (1.1 samples/s, ETA 54min)
|
| 990 |
+
Precomputed 16700/20000 (1.1 samples/s, ETA 52min)
|
| 991 |
+
Precomputed 16800/20000 (1.1 samples/s, ETA 51min)
|
| 992 |
+
Precomputed 16900/20000 (1.1 samples/s, ETA 49min)
|
| 993 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 994 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 995 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 996 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 997 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 998 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 999 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1000 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1001 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1002 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1003 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1004 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1005 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1006 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1007 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1008 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1009 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1010 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1011 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1012 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1013 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1014 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1015 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1016 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1017 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1018 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1019 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1020 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1021 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1022 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1023 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1024 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1025 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1026 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1027 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1028 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1029 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1030 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1031 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1032 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1033 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1034 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1035 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1036 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1037 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1038 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1039 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1040 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1041 |
+
Precomputed 17000/20000 (1.1 samples/s, ETA 47min)
|
| 1042 |
+
Precomputed 17100/20000 (1.1 samples/s, ETA 46min)
|
| 1043 |
+
Precomputed 17200/20000 (1.1 samples/s, ETA 44min)
|
| 1044 |
+
Precomputed 17300/20000 (1.1 samples/s, ETA 43min)
|
| 1045 |
+
Precomputed 17400/20000 (1.1 samples/s, ETA 41min)
|
| 1046 |
+
Precomputed 17500/20000 (1.1 samples/s, ETA 39min)
|
| 1047 |
+
Precomputed 17600/20000 (1.1 samples/s, ETA 38min)
|
| 1048 |
+
Precomputed 17700/20000 (1.1 samples/s, ETA 36min)
|
| 1049 |
+
Precomputed 17800/20000 (1.1 samples/s, ETA 35min)
|
| 1050 |
+
Precomputed 17900/20000 (1.1 samples/s, ETA 33min)
|
| 1051 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1052 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1053 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1054 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1055 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1056 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1057 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1058 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1059 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1060 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1061 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1062 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1063 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1064 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1065 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1066 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1067 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1068 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1069 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1070 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1071 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1072 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1073 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1074 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1075 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1076 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1077 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1078 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1079 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1080 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1081 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1082 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1083 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1084 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1085 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1086 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1087 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1088 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1089 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1090 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1091 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1092 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1093 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1094 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1095 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1096 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1097 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1098 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1099 |
+
Precomputed 18000/20000 (1.1 samples/s, ETA 31min)
|
| 1100 |
+
Precomputed 18100/20000 (1.1 samples/s, ETA 30min)
|
| 1101 |
+
Precomputed 18200/20000 (1.1 samples/s, ETA 28min)
|
| 1102 |
+
Precomputed 18300/20000 (1.1 samples/s, ETA 27min)
|
| 1103 |
+
Precomputed 18400/20000 (1.1 samples/s, ETA 25min)
|
| 1104 |
+
Precomputed 18500/20000 (1.1 samples/s, ETA 24min)
|
| 1105 |
+
Precomputed 18600/20000 (1.1 samples/s, ETA 22min)
|
| 1106 |
+
Precomputed 18700/20000 (1.1 samples/s, ETA 20min)
|
| 1107 |
+
Precomputed 18800/20000 (1.1 samples/s, ETA 19min)
|
| 1108 |
+
Precomputed 18900/20000 (1.1 samples/s, ETA 17min)
|
| 1109 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1110 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1111 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1112 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1113 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1114 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1115 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1116 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1117 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1118 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1119 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1120 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1121 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1122 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1123 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1124 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1125 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1126 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1127 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1128 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1129 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1130 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1131 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1132 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1133 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1134 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1135 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1136 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1137 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1138 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1139 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1140 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1141 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1142 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1143 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1144 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1145 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1146 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1147 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1148 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1149 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1150 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1151 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1152 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1153 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1154 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1155 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1156 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1157 |
+
Precomputed 19000/20000 (1.1 samples/s, ETA 16min)
|
| 1158 |
+
Precomputed 19100/20000 (1.1 samples/s, ETA 14min)
|
| 1159 |
+
Precomputed 19200/20000 (1.1 samples/s, ETA 13min)
|
| 1160 |
+
Precomputed 19300/20000 (1.1 samples/s, ETA 11min)
|
| 1161 |
+
Precomputed 19400/20000 (1.1 samples/s, ETA 9min)
|
| 1162 |
+
Precomputed 19500/20000 (1.1 samples/s, ETA 8min)
|
| 1163 |
+
Precomputed 19600/20000 (1.1 samples/s, ETA 6min)
|
| 1164 |
+
Precomputed 19700/20000 (1.1 samples/s, ETA 5min)
|
| 1165 |
+
Precomputed 19800/20000 (1.1 samples/s, ETA 3min)
|
| 1166 |
+
Precomputed 19900/20000 (1.1 samples/s, ETA 2min)
|
| 1167 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1168 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1169 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1170 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1171 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1172 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1173 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1174 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1175 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1176 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1177 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1178 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1179 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1180 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1181 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1182 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1183 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1184 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1185 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1186 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1187 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1188 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1189 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1190 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1191 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1192 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1193 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1194 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1195 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1196 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1197 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1198 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1199 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1200 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1201 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1202 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1203 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1204 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1205 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1206 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1207 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1208 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1209 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1210 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1211 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1212 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1213 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1214 |
+
[FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
|
| 1215 |
+
Precomputed 20000/20000 (1.1 samples/s, ETA 0min)
|
| 1216 |
+
Precomputed 20000 samples in 311.7min (avg flatness=0.0035)
|
| 1217 |
+
|
| 1218 |
+
[PRECOMPUTE] Done. 20000 samples saved to /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
|
| 1219 |
+
Now run Phase 2:
|
| 1220 |
+
python -u train_eagle_head.py --offline --loss_type fwd_kl --lr 5e-5 --draft_depth 3
|
FireEcho Engine/eagle_test.py
ADDED
|
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
FireEcho EAGLE-3 Test — Speculative Decoding Correctness + Benchmark
|
| 4 |
+
=====================================================================
|
| 5 |
+
Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
|
| 6 |
+
Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
|
| 7 |
+
|
| 8 |
+
EAGLE-3 speculative decoding — correctness + benchmark test.
|
| 9 |
+
|
| 10 |
+
Tests:
|
| 11 |
+
1. Smoke test: speculative_generate() produces valid output
|
| 12 |
+
2. Correctness: temperature=0 output matches non-speculative generate()
|
| 13 |
+
3. Speed: effective tok/s with draft head vs baseline
|
| 14 |
+
4. Acceptance stats: acceptance rate, avg tokens/round
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import sys, os, time
|
| 18 |
+
import torch
|
| 19 |
+
|
| 20 |
+
ENGINE_DIR = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine"
|
| 21 |
+
sys.path.insert(0, ENGINE_DIR)
|
| 22 |
+
sys.path.insert(0, "/run/media/echo/Echo/ECHO")
|
| 23 |
+
|
| 24 |
+
from hebbian_finetune_demo import load_engine
|
| 25 |
+
|
| 26 |
+
MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
|
| 27 |
+
PROMPT = "<|im_start|>system\nYou are a helpful coding assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
|
| 28 |
+
MAX_NEW = 80
|
| 29 |
+
DRAFT_DEPTH = 5
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def main():
|
| 33 |
+
print("=" * 60)
|
| 34 |
+
print("EAGLE-3 Speculative Decoding Test")
|
| 35 |
+
print("=" * 60)
|
| 36 |
+
|
| 37 |
+
# --- Load model ---
|
| 38 |
+
print("\n[1/5] Loading model...")
|
| 39 |
+
engine, tokenizer, config = load_engine(
|
| 40 |
+
MODEL_PATH, max_seq_len=512, device="cuda",
|
| 41 |
+
)
|
| 42 |
+
engine.eval()
|
| 43 |
+
|
| 44 |
+
# Enable flat decode + pack experts (baseline optimizations)
|
| 45 |
+
engine.kv_cache.enable_flat_decode(4096)
|
| 46 |
+
engine.pack_all_experts()
|
| 47 |
+
|
| 48 |
+
input_ids = tokenizer.encode(PROMPT, return_tensors="pt").to("cuda")
|
| 49 |
+
prompt_len = input_ids.shape[1]
|
| 50 |
+
print(f" Prompt tokens: {prompt_len}")
|
| 51 |
+
|
| 52 |
+
# Stop tokens for Qwen3
|
| 53 |
+
eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
|
| 54 |
+
stop_tokens = [eos_id] if eos_id is not None else [151645, 151643]
|
| 55 |
+
print(f" Stop tokens: {stop_tokens}")
|
| 56 |
+
|
| 57 |
+
# --- Warmup pass (triton autotuning) ---
|
| 58 |
+
print(f"\n[2/6] Warmup pass (Triton autotuning)...")
|
| 59 |
+
_ = engine.generate(
|
| 60 |
+
input_ids, max_new_tokens=10, temperature=0.0,
|
| 61 |
+
top_k=0, top_p=1.0, stop_tokens=stop_tokens,
|
| 62 |
+
)
|
| 63 |
+
torch.cuda.synchronize()
|
| 64 |
+
print(f" Warmup done.")
|
| 65 |
+
|
| 66 |
+
# --- Baseline generation (no speculation, no graph) ---
|
| 67 |
+
print(f"\n[3/6] Baseline generate (greedy, no graph, {MAX_NEW} tokens)...")
|
| 68 |
+
torch.cuda.synchronize()
|
| 69 |
+
t0 = time.perf_counter()
|
| 70 |
+
baseline_ids = engine.generate(
|
| 71 |
+
input_ids, max_new_tokens=MAX_NEW, temperature=0.0,
|
| 72 |
+
top_k=0, top_p=1.0, stop_tokens=stop_tokens,
|
| 73 |
+
)
|
| 74 |
+
torch.cuda.synchronize()
|
| 75 |
+
t_baseline = time.perf_counter() - t0
|
| 76 |
+
baseline_tokens = baseline_ids.shape[1] - prompt_len
|
| 77 |
+
baseline_tps = baseline_tokens / t_baseline
|
| 78 |
+
baseline_text = tokenizer.decode(baseline_ids[0, prompt_len:], skip_special_tokens=True)
|
| 79 |
+
print(f" Generated {baseline_tokens} tokens in {t_baseline:.2f}s = {baseline_tps:.1f} tok/s")
|
| 80 |
+
print(f" Output: {baseline_text[:200]}...")
|
| 81 |
+
|
| 82 |
+
# --- Enable EAGLE-3 ---
|
| 83 |
+
print(f"\n[4/6] Enabling EAGLE-3 draft head...")
|
| 84 |
+
engine.enable_eagle(
|
| 85 |
+
capture_layers=(8, 24, 47),
|
| 86 |
+
num_heads=16,
|
| 87 |
+
ffn_mult=2,
|
| 88 |
+
draft_depth=DRAFT_DEPTH,
|
| 89 |
+
)
|
| 90 |
+
vram_after = torch.cuda.memory_allocated() / 1e9
|
| 91 |
+
print(f" VRAM after eagle: {vram_after:.2f} GB")
|
| 92 |
+
|
| 93 |
+
# --- Speculative generation smoke test ---
|
| 94 |
+
print(f"\n[5/6] Speculative generate (greedy, {MAX_NEW} tokens, depth={DRAFT_DEPTH})...")
|
| 95 |
+
tokens_collected = []
|
| 96 |
+
def token_callback(tok_id, pos):
|
| 97 |
+
tokens_collected.append(tok_id)
|
| 98 |
+
|
| 99 |
+
torch.cuda.synchronize()
|
| 100 |
+
t0 = time.perf_counter()
|
| 101 |
+
spec_ids = engine.speculative_generate(
|
| 102 |
+
input_ids, max_new_tokens=MAX_NEW, temperature=0.0,
|
| 103 |
+
draft_depth=DRAFT_DEPTH, stop_tokens=stop_tokens,
|
| 104 |
+
callback=token_callback,
|
| 105 |
+
)
|
| 106 |
+
torch.cuda.synchronize()
|
| 107 |
+
t_spec = time.perf_counter() - t0
|
| 108 |
+
spec_tokens = spec_ids.shape[1] - prompt_len
|
| 109 |
+
spec_tps = spec_tokens / t_spec
|
| 110 |
+
spec_text = tokenizer.decode(spec_ids[0, prompt_len:], skip_special_tokens=True)
|
| 111 |
+
print(f" Generated {spec_tokens} tokens in {t_spec:.2f}s = {spec_tps:.1f} tok/s")
|
| 112 |
+
print(f" Output: {spec_text[:200]}...")
|
| 113 |
+
|
| 114 |
+
# --- Correctness check ---
|
| 115 |
+
print(f"\n[6/6] Correctness check...")
|
| 116 |
+
min_len = min(baseline_tokens, spec_tokens)
|
| 117 |
+
baseline_tok_list = baseline_ids[0, prompt_len:prompt_len + min_len].tolist()
|
| 118 |
+
spec_tok_list = spec_ids[0, prompt_len:prompt_len + min_len].tolist()
|
| 119 |
+
|
| 120 |
+
match = True
|
| 121 |
+
first_diff = -1
|
| 122 |
+
for i in range(min_len):
|
| 123 |
+
if baseline_tok_list[i] != spec_tok_list[i]:
|
| 124 |
+
match = False
|
| 125 |
+
first_diff = i
|
| 126 |
+
break
|
| 127 |
+
|
| 128 |
+
if match and baseline_tokens == spec_tokens:
|
| 129 |
+
print(f" PASS: token-for-token match ({min_len} tokens)")
|
| 130 |
+
elif match:
|
| 131 |
+
print(f" PARTIAL MATCH: first {min_len} tokens match, "
|
| 132 |
+
f"but lengths differ ({baseline_tokens} vs {spec_tokens})")
|
| 133 |
+
else:
|
| 134 |
+
print(f" MISMATCH at token {first_diff}:")
|
| 135 |
+
print(f" Baseline: {baseline_tok_list[max(0,first_diff-2):first_diff+3]}")
|
| 136 |
+
print(f" Speculative: {spec_tok_list[max(0,first_diff-2):first_diff+3]}")
|
| 137 |
+
# Note: with untrained random head, mismatches happen because
|
| 138 |
+
# of floating-point ordering in the verification forward pass
|
| 139 |
+
# when sequences diverge. This is expected and not a bug —
|
| 140 |
+
# the correction mechanism is what matters.
|
| 141 |
+
print(f" NOTE: With untrained head, divergence is expected due to")
|
| 142 |
+
print(f" verification forward seeing different token contexts.")
|
| 143 |
+
print(f" Correctness holds when draft matches (acceptance path).")
|
| 144 |
+
|
| 145 |
+
# --- Summary ---
|
| 146 |
+
print("\n" + "=" * 60)
|
| 147 |
+
print("SUMMARY")
|
| 148 |
+
print("=" * 60)
|
| 149 |
+
print(f" Baseline: {baseline_tps:.1f} tok/s ({baseline_tokens} tokens)")
|
| 150 |
+
print(f" Speculative: {spec_tps:.1f} tok/s ({spec_tokens} tokens)")
|
| 151 |
+
speedup = spec_tps / max(baseline_tps, 0.1)
|
| 152 |
+
if speedup > 1:
|
| 153 |
+
print(f" Speedup: {speedup:.2f}x FASTER")
|
| 154 |
+
else:
|
| 155 |
+
print(f" Slowdown: {1/speedup:.2f}x slower (expected with untrained head)")
|
| 156 |
+
print(f" VRAM: {vram_after:.2f} GB")
|
| 157 |
+
print(f"\n NOTE: Draft head is randomly initialized (untrained).")
|
| 158 |
+
print(f" Expected acceptance rate: ~0.7% (1/vocab_size for greedy).")
|
| 159 |
+
print(f" Training the draft head should raise acceptance to 70-80%.")
|
| 160 |
+
print("=" * 60)
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
if __name__ == "__main__":
|
| 164 |
+
main()
|
FireEcho Engine/eagle_train_d8.log
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
nohup: ignoring input
|
| 2 |
+
============================================================
|
| 3 |
+
EAGLE-3 Draft Head Training — OFFLINE mode
|
| 4 |
+
============================================================
|
| 5 |
+
Epochs: 5
|
| 6 |
+
Max samples: 10000
|
| 7 |
+
Max seq len: 512
|
| 8 |
+
LR: 0.0003, warmup: 300
|
| 9 |
+
Draft depth (K): 5
|
| 10 |
+
Grad accum: 2, clip: 0.5
|
| 11 |
+
Capture layers: (8, 24, 47)
|
| 12 |
+
Head layers: 8
|
| 13 |
+
Loss type: fwd_kl
|
| 14 |
+
Focal gamma: 2.0
|
| 15 |
+
Top-K logits: 64
|
| 16 |
+
Flatness filter: 100%
|
| 17 |
+
Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
|
| 18 |
+
|
| 19 |
+
[1/4] Loading model...
|
| 20 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 21 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 22 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 23 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 24 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 25 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 26 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 27 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 28 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 29 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 30 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 31 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 32 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 33 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 34 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 35 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 36 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 37 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 38 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 39 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 40 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 41 |
+
Total params: 1.57B
|
| 42 |
+
Frozen params: 1.54B (base model, FP4)
|
| 43 |
+
Trainable params: 30.2M (Hebbian only)
|
| 44 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 45 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 46 |
+
|
| 47 |
+
[2/4] Enabling EAGLE-3 draft head...
|
| 48 |
+
[FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 49 |
+
Trainable eagle params: 356.5M
|
| 50 |
+
[EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
|
| 51 |
+
[Checkpoint] Optimizer state mismatch (head resized?), skipping.
|
| 52 |
+
[Checkpoint] Resumed from step 4000 (loss=5.0967)
|
| 53 |
+
|
| 54 |
+
[3/5] Loading external dataset...
|
| 55 |
+
Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
|
| 56 |
+
Loaded 10000 samples.
|
| 57 |
+
|
| 58 |
+
[OFFLINE] Loading precomputed features from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed...
|
| 59 |
+
2777 samples available
|
| 60 |
+
|
| 61 |
+
[OFFLINE] Starting training...
|
| 62 |
+
VRAM before training: 20.66 GB
|
| 63 |
+
[EAGLE-3] 27 rounds, 131 drafted, 5 accepted (4%), avg 0.2/round
|
| 64 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 65 |
+
[EAGLE-3] 29 rounds, 139 drafted, 2 accepted (1%), avg 0.1/round
|
| 66 |
+
[Eval @ step 4000] 180 tokens in 17.2s = 10.5 tok/s
|
| 67 |
+
Step 4100 | epoch 1/5 | loss=2.8709 | avg=4.6042 | acc=31.2% | lr=5.00e-05 | pos=64
|
| 68 |
+
Step 4200 | epoch 1/5 | loss=3.2780 | avg=4.6526 | acc=35.3% | lr=1.00e-04 | pos=64
|
| 69 |
+
Step 4300 | epoch 1/5 | loss=5.3967 | avg=4.6339 | acc=17.5% | lr=1.50e-04 | pos=64
|
| 70 |
+
Step 4400 | epoch 1/5 | loss=5.6657 | avg=4.7462 | acc=12.8% | lr=2.00e-04 | pos=64
|
| 71 |
+
Step 4500 | epoch 1/5 | loss=5.9773 | avg=4.8205 | acc=9.4% | lr=2.50e-04 | pos=64
|
| 72 |
+
Step 4600 | epoch 1/5 | loss=5.4029 | avg=4.8950 | acc=16.9% | lr=3.00e-04 | pos=64
|
| 73 |
+
Step 4700 | epoch 1/5 | loss=5.2982 | avg=4.9767 | acc=9.4% | lr=3.00e-04 | pos=64
|
| 74 |
+
Step 4800 | epoch 1/5 | loss=5.0728 | avg=5.0216 | acc=12.2% | lr=3.00e-04 | pos=64
|
| 75 |
+
Step 4900 | epoch 1/5 | loss=6.8400 | avg=5.0394 | acc=13.1% | lr=3.00e-04 | pos=64
|
| 76 |
+
Step 5000 | epoch 1/5 | loss=5.1369 | avg=5.0459 | acc=16.2% | lr=2.99e-04 | pos=64
|
| 77 |
+
[EAGLE-3] 30 rounds, 144 drafted, 1 accepted (1%), avg 0.0/round
|
| 78 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 79 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 80 |
+
[Eval @ step 5000] 181 tokens in 10.9s = 16.6 tok/s
|
| 81 |
+
[Checkpoint] Saved step 5000 (loss=5.1369) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 82 |
+
[Best] New best tok/s: 16.6 (step 5000)
|
| 83 |
+
Step 5100 | epoch 1/5 | loss=5.3802 | avg=5.0351 | acc=16.2% | lr=2.99e-04 | pos=64
|
| 84 |
+
Step 5200 | epoch 1/5 | loss=4.6753 | avg=4.9773 | acc=20.3% | lr=2.99e-04 | pos=64
|
| 85 |
+
Step 5300 | epoch 1/5 | loss=4.3068 | avg=4.9713 | acc=24.4% | lr=2.98e-04 | pos=64
|
| 86 |
+
Step 5400 | epoch 1/5 | loss=3.0352 | avg=4.9536 | acc=30.0% | lr=2.98e-04 | pos=64
|
| 87 |
+
Step 5500 | epoch 1/5 | loss=4.8197 | avg=4.9954 | acc=21.9% | lr=2.97e-04 | pos=64
|
| 88 |
+
Step 5600 | epoch 1/5 | loss=3.4431 | avg=5.0006 | acc=26.2% | lr=2.96e-04 | pos=64
|
| 89 |
+
Step 5700 | epoch 1/5 | loss=3.6114 | avg=5.0065 | acc=22.8% | lr=2.95e-04 | pos=64
|
| 90 |
+
Step 5800 | epoch 1/5 | loss=5.0362 | avg=4.9796 | acc=17.8% | lr=2.95e-04 | pos=64
|
| 91 |
+
Step 5900 | epoch 1/5 | loss=5.8618 | avg=4.9976 | acc=8.4% | lr=2.94e-04 | pos=64
|
| 92 |
+
Step 6000 | epoch 1/5 | loss=6.3429 | avg=4.9858 | acc=11.2% | lr=2.93e-04 | pos=64
|
| 93 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 94 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 95 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 96 |
+
[Eval @ step 6000] 180 tokens in 10.5s = 17.1 tok/s
|
| 97 |
+
[Checkpoint] Saved step 6000 (loss=6.3429) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 98 |
+
[Best] New best tok/s: 17.1 (step 6000)
|
| 99 |
+
[Checkpoint] Saved step 6000 (loss=6.3429) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step6000.pt
|
| 100 |
+
Step 6100 | epoch 1/5 | loss=6.3301 | avg=4.9179 | acc=11.6% | lr=2.92e-04 | pos=64
|
| 101 |
+
Step 6200 | epoch 1/5 | loss=4.4811 | avg=4.8956 | acc=19.4% | lr=2.90e-04 | pos=64
|
| 102 |
+
Step 6300 | epoch 1/5 | loss=5.5715 | avg=4.9178 | acc=16.9% | lr=2.89e-04 | pos=64
|
| 103 |
+
Step 6400 | epoch 1/5 | loss=3.3082 | avg=4.8940 | acc=28.7% | lr=2.88e-04 | pos=64
|
| 104 |
+
Step 6500 | epoch 1/5 | loss=4.5000 | avg=4.9460 | acc=20.0% | lr=2.87e-04 | pos=64
|
| 105 |
+
Step 6600 | epoch 1/5 | loss=4.0213 | avg=4.9359 | acc=18.8% | lr=2.85e-04 | pos=64
|
| 106 |
+
Step 6700 | epoch 1/5 | loss=4.2572 | avg=4.9256 | acc=31.2% | lr=2.84e-04 | pos=64
|
| 107 |
+
--- Epoch 1/5 complete (step 6777) ---
|
| 108 |
+
Step 6800 | epoch 2/5 | loss=3.7218 | avg=4.8991 | acc=24.1% | lr=2.82e-04 | pos=64
|
| 109 |
+
Step 6900 | epoch 2/5 | loss=4.7880 | avg=4.8843 | acc=19.7% | lr=2.81e-04 | pos=64
|
| 110 |
+
Step 7000 | epoch 2/5 | loss=5.4015 | avg=4.8636 | acc=9.7% | lr=2.79e-04 | pos=64
|
| 111 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 112 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 113 |
+
[FE-MX] Expert tiers: 26 cold(FP4) / 61 warm(FP6) / 41 hot(FP8)
|
| 114 |
+
[FE-MX] Expert tiers: 24 cold(FP4) / 66 warm(FP6) / 38 hot(FP8)
|
| 115 |
+
[FE-MX] Expert tiers: 45 cold(FP4) / 43 warm(FP6) / 40 hot(FP8)
|
| 116 |
+
[FE-MX] Expert tiers: 40 cold(FP4) / 53 warm(FP6) / 35 hot(FP8)
|
| 117 |
+
[FE-MX] Expert tiers: 48 cold(FP4) / 46 warm(FP6) / 34 hot(FP8)
|
| 118 |
+
[FE-MX] Expert tiers: 47 cold(FP4) / 46 warm(FP6) / 35 hot(FP8)
|
| 119 |
+
[FE-MX] Expert tiers: 66 cold(FP4) / 32 warm(FP6) / 30 hot(FP8)
|
| 120 |
+
[FE-MX] Expert tiers: 67 cold(FP4) / 29 warm(FP6) / 32 hot(FP8)
|
| 121 |
+
[FE-MX] Expert tiers: 55 cold(FP4) / 42 warm(FP6) / 31 hot(FP8)
|
| 122 |
+
[FE-MX] Expert tiers: 50 cold(FP4) / 48 warm(FP6) / 30 hot(FP8)
|
| 123 |
+
[FE-MX] Expert tiers: 46 cold(FP4) / 47 warm(FP6) / 35 hot(FP8)
|
| 124 |
+
[FE-MX] Expert tiers: 40 cold(FP4) / 52 warm(FP6) / 36 hot(FP8)
|
| 125 |
+
[FE-MX] Expert tiers: 49 cold(FP4) / 48 warm(FP6) / 31 hot(FP8)
|
| 126 |
+
[FE-MX] Expert tiers: 49 cold(FP4) / 43 warm(FP6) / 36 hot(FP8)
|
| 127 |
+
[FE-MX] Expert tiers: 46 cold(FP4) / 42 warm(FP6) / 40 hot(FP8)
|
| 128 |
+
[FE-MX] Expert tiers: 51 cold(FP4) / 46 warm(FP6) / 31 hot(FP8)
|
| 129 |
+
[FE-MX] Expert tiers: 54 cold(FP4) / 39 warm(FP6) / 35 hot(FP8)
|
| 130 |
+
[FE-MX] Expert tiers: 51 cold(FP4) / 45 warm(FP6) / 32 hot(FP8)
|
| 131 |
+
[FE-MX] Expert tiers: 69 cold(FP4) / 30 warm(FP6) / 29 hot(FP8)
|
| 132 |
+
[FE-MX] Expert tiers: 77 cold(FP4) / 25 warm(FP6) / 26 hot(FP8)
|
| 133 |
+
[FE-MX] Expert tiers: 53 cold(FP4) / 45 warm(FP6) / 30 hot(FP8)
|
| 134 |
+
[FE-MX] Expert tiers: 52 cold(FP4) / 45 warm(FP6) / 31 hot(FP8)
|
| 135 |
+
[FE-MX] Expert tiers: 52 cold(FP4) / 41 warm(FP6) / 35 hot(FP8)
|
| 136 |
+
[FE-MX] Expert tiers: 47 cold(FP4) / 50 warm(FP6) / 31 hot(FP8)
|
| 137 |
+
[FE-MX] Expert tiers: 52 cold(FP4) / 47 warm(FP6) / 29 hot(FP8)
|
| 138 |
+
[FE-MX] Expert tiers: 49 cold(FP4) / 49 warm(FP6) / 30 hot(FP8)
|
| 139 |
+
[FE-MX] Expert tiers: 52 cold(FP4) / 40 warm(FP6) / 36 hot(FP8)
|
| 140 |
+
[FE-MX] Expert tiers: 54 cold(FP4) / 45 warm(FP6) / 29 hot(FP8)
|
| 141 |
+
[FE-MX] Expert tiers: 52 cold(FP4) / 42 warm(FP6) / 34 hot(FP8)
|
| 142 |
+
[FE-MX] Expert tiers: 55 cold(FP4) / 41 warm(FP6) / 32 hot(FP8)
|
| 143 |
+
[FE-MX] Expert tiers: 71 cold(FP4) / 30 warm(FP6) / 27 hot(FP8)
|
| 144 |
+
[FE-MX] Expert tiers: 77 cold(FP4) / 23 warm(FP6) / 28 hot(FP8)
|
| 145 |
+
[FE-MX] Expert tiers: 55 cold(FP4) / 41 warm(FP6) / 32 hot(FP8)
|
| 146 |
+
[FE-MX] Expert tiers: 49 cold(FP4) / 48 warm(FP6) / 31 hot(FP8)
|
| 147 |
+
[FE-MX] Expert tiers: 45 cold(FP4) / 48 warm(FP6) / 35 hot(FP8)
|
| 148 |
+
[FE-MX] Expert tiers: 40 cold(FP4) / 52 warm(FP6) / 36 hot(FP8)
|
| 149 |
+
[FE-MX] Expert tiers: 53 cold(FP4) / 44 warm(FP6) / 31 hot(FP8)
|
| 150 |
+
[FE-MX] Expert tiers: 44 cold(FP4) / 52 warm(FP6) / 32 hot(FP8)
|
| 151 |
+
[FE-MX] Expert tiers: 51 cold(FP4) / 39 warm(FP6) / 38 hot(FP8)
|
| 152 |
+
[FE-MX] Expert tiers: 51 cold(FP4) / 41 warm(FP6) / 36 hot(FP8)
|
| 153 |
+
[FE-MX] Expert tiers: 57 cold(FP4) / 29 warm(FP6) / 42 hot(FP8)
|
| 154 |
+
[FE-MX] Expert tiers: 55 cold(FP4) / 38 warm(FP6) / 35 hot(FP8)
|
| 155 |
+
[FE-MX] Expert tiers: 55 cold(FP4) / 33 warm(FP6) / 40 hot(FP8)
|
| 156 |
+
[FE-MX] Expert tiers: 53 cold(FP4) / 38 warm(FP6) / 37 hot(FP8)
|
| 157 |
+
[FE-MX] Expert tiers: 61 cold(FP4) / 31 warm(FP6) / 36 hot(FP8)
|
| 158 |
+
[FE-MX] Expert tiers: 58 cold(FP4) / 34 warm(FP6) / 36 hot(FP8)
|
| 159 |
+
[FE-MX] Expert tiers: 46 cold(FP4) / 48 warm(FP6) / 34 hot(FP8)
|
| 160 |
+
[FE-MX] Expert tiers: 41 cold(FP4) / 51 warm(FP6) / 36 hot(FP8)
|
| 161 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 162 |
+
[Eval @ step 7000] 180 tokens in 10.7s = 16.9 tok/s
|
| 163 |
+
Step 7100 | epoch 2/5 | loss=3.9199 | avg=4.8484 | acc=32.5% | lr=2.77e-04 | pos=64
|
| 164 |
+
Step 7200 | epoch 2/5 | loss=4.4965 | avg=4.6926 | acc=23.1% | lr=2.75e-04 | pos=64
|
| 165 |
+
Step 7300 | epoch 2/5 | loss=4.1791 | avg=4.6618 | acc=20.9% | lr=2.73e-04 | pos=64
|
| 166 |
+
Step 7400 | epoch 2/5 | loss=3.6816 | avg=4.6057 | acc=22.2% | lr=2.71e-04 | pos=64
|
| 167 |
+
Step 7500 | epoch 2/5 | loss=5.8260 | avg=4.5923 | acc=5.9% | lr=2.69e-04 | pos=64
|
| 168 |
+
Step 7600 | epoch 2/5 | loss=4.9514 | avg=4.5939 | acc=18.4% | lr=2.67e-04 | pos=64
|
| 169 |
+
Step 7700 | epoch 2/5 | loss=3.7191 | avg=4.6118 | acc=22.8% | lr=2.65e-04 | pos=64
|
| 170 |
+
Step 7800 | epoch 2/5 | loss=4.6762 | avg=4.5979 | acc=19.1% | lr=2.63e-04 | pos=64
|
| 171 |
+
Step 7900 | epoch 2/5 | loss=5.7284 | avg=4.5778 | acc=15.6% | lr=2.61e-04 | pos=64
|
| 172 |
+
Step 8000 | epoch 2/5 | loss=5.9431 | avg=4.5689 | acc=4.7% | lr=2.59e-04 | pos=64
|
| 173 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 174 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 175 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 176 |
+
[Eval @ step 8000] 180 tokens in 10.7s = 16.8 tok/s
|
| 177 |
+
[Checkpoint] Saved step 8000 (loss=5.9431) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step8000.pt
|
| 178 |
+
Step 8100 | epoch 2/5 | loss=3.5748 | avg=4.4854 | acc=27.5% | lr=2.56e-04 | pos=64
|
| 179 |
+
Step 8200 | epoch 2/5 | loss=3.9363 | avg=4.5077 | acc=32.5% | lr=2.54e-04 | pos=64
|
| 180 |
+
Step 8300 | epoch 2/5 | loss=2.7494 | avg=4.4987 | acc=37.8% | lr=2.52e-04 | pos=64
|
| 181 |
+
Step 8400 | epoch 2/5 | loss=4.1517 | avg=4.5172 | acc=25.0% | lr=2.49e-04 | pos=64
|
| 182 |
+
Step 8500 | epoch 2/5 | loss=5.5557 | avg=4.4605 | acc=10.9% | lr=2.47e-04 | pos=64
|
| 183 |
+
Step 8600 | epoch 2/5 | loss=2.5267 | avg=4.4706 | acc=31.6% | lr=2.44e-04 | pos=64
|
| 184 |
+
Step 8700 | epoch 2/5 | loss=5.7917 | avg=4.4517 | acc=12.5% | lr=2.41e-04 | pos=64
|
| 185 |
+
Step 8800 | epoch 2/5 | loss=5.8896 | avg=4.4381 | acc=12.5% | lr=2.39e-04 | pos=64
|
| 186 |
+
Step 8900 | epoch 2/5 | loss=4.0428 | avg=4.4427 | acc=24.4% | lr=2.36e-04 | pos=64
|
| 187 |
+
Step 9000 | epoch 2/5 | loss=5.2436 | avg=4.4426 | acc=9.7% | lr=2.33e-04 | pos=64
|
| 188 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 189 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 190 |
+
[EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
|
| 191 |
+
[Eval @ step 9000] 180 tokens in 10.9s = 16.6 tok/s
|
| 192 |
+
Step 9100 | epoch 2/5 | loss=5.9143 | avg=4.2725 | acc=7.2% | lr=2.30e-04 | pos=64
|
| 193 |
+
Step 9200 | epoch 2/5 | loss=5.3081 | avg=4.2707 | acc=12.8% | lr=2.28e-04 | pos=64
|
| 194 |
+
Step 9300 | epoch 2/5 | loss=5.3774 | avg=4.3151 | acc=14.7% | lr=2.25e-04 | pos=64
|
| 195 |
+
Step 9400 | epoch 2/5 | loss=5.7517 | avg=4.3221 | acc=17.8% | lr=2.22e-04 | pos=64
|
| 196 |
+
Step 9500 | epoch 2/5 | loss=2.6826 | avg=4.3317 | acc=34.1% | lr=2.19e-04 | pos=64
|
| 197 |
+
--- Epoch 2/5 complete (step 9554) ---
|
| 198 |
+
Step 9600 | epoch 3/5 | loss=4.7292 | avg=4.2845 | acc=20.9% | lr=2.16e-04 | pos=64
|
| 199 |
+
Step 9700 | epoch 3/5 | loss=4.1688 | avg=4.2683 | acc=24.1% | lr=2.13e-04 | pos=64
|
| 200 |
+
Step 9800 | epoch 3/5 | loss=4.5375 | avg=4.2397 | acc=21.9% | lr=2.10e-04 | pos=64
|
| 201 |
+
Step 9900 | epoch 3/5 | loss=5.2854 | avg=4.2331 | acc=14.1% | lr=2.07e-04 | pos=64
|
| 202 |
+
Step 10000 | epoch 3/5 | loss=4.0904 | avg=4.2228 | acc=25.3% | lr=2.04e-04 | pos=64
|
| 203 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 204 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 205 |
+
[EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
|
| 206 |
+
[Eval @ step 10000] 180 tokens in 10.7s = 16.9 tok/s
|
| 207 |
+
[Checkpoint] Saved step 10000 (loss=4.0904) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step10000.pt
|
| 208 |
+
Step 10100 | epoch 3/5 | loss=3.7871 | avg=3.9878 | acc=30.9% | lr=2.01e-04 | pos=64
|
| 209 |
+
Step 10200 | epoch 3/5 | loss=2.2971 | avg=4.0641 | acc=37.8% | lr=1.98e-04 | pos=64
|
| 210 |
+
Step 10300 | epoch 3/5 | loss=5.0256 | avg=4.0141 | acc=10.6% | lr=1.95e-04 | pos=64
|
| 211 |
+
Step 10400 | epoch 3/5 | loss=5.8723 | avg=4.0130 | acc=10.3% | lr=1.92e-04 | pos=64
|
| 212 |
+
Step 10500 | epoch 3/5 | loss=2.2164 | avg=3.9910 | acc=37.5% | lr=1.89e-04 | pos=64
|
FireEcho Engine/eagle_train_goddess.log
ADDED
|
@@ -0,0 +1,973 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
nohup: ignoring input
|
| 2 |
+
============================================================
|
| 3 |
+
EAGLE-3 Draft Head Training — OFFLINE mode
|
| 4 |
+
============================================================
|
| 5 |
+
Epochs: 2
|
| 6 |
+
Max samples: 10000
|
| 7 |
+
Max seq len: 512
|
| 8 |
+
LR: 0.0001, warmup: 2000
|
| 9 |
+
Draft depth (K): 5
|
| 10 |
+
Grad accum: 4, clip: 0.5
|
| 11 |
+
Capture layers: (8, 24, 47)
|
| 12 |
+
Head layers: 50
|
| 13 |
+
Loss type: ce
|
| 14 |
+
Top-K logits: 64
|
| 15 |
+
Flatness filter: 100%
|
| 16 |
+
Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
|
| 17 |
+
FireEcho stack: batch_positions (B=P), torch.compile, GoliathQuantumLinear, MPS(bond=256), entanglement_prune(every=5000)
|
| 18 |
+
|
| 19 |
+
[1/4] Loading model...
|
| 20 |
+
[Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
|
| 21 |
+
[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
|
| 22 |
+
[FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
|
| 23 |
+
Qwen3-Omni: will stream-load from 15 shards
|
| 24 |
+
[Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
|
| 25 |
+
[Qwen3 Streaming] Building engine skeleton...
|
| 26 |
+
[Qwen3 Streaming] Global params on GPU: 1.2 GB
|
| 27 |
+
Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
|
| 28 |
+
Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
|
| 29 |
+
Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
|
| 30 |
+
Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
|
| 31 |
+
Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
|
| 32 |
+
Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
|
| 33 |
+
Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
|
| 34 |
+
Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
|
| 35 |
+
Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
|
| 36 |
+
Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
|
| 37 |
+
Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
|
| 38 |
+
Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
|
| 39 |
+
[Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
|
| 40 |
+
[Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
|
| 41 |
+
Total params: 1.57B
|
| 42 |
+
Frozen params: 1.54B (base model, FP4)
|
| 43 |
+
Trainable params: 30.2M (Hebbian only)
|
| 44 |
+
[Flat KV] Enabled: 4096 tokens, 403 MB
|
| 45 |
+
[Packed MoE] 48 layers packed (6144 experts → contiguous)
|
| 46 |
+
|
| 47 |
+
[2/4] Enabling EAGLE-3 draft head...
|
| 48 |
+
[FE-XT] Draft head: D=50, 2118.3M params, 4237 MB, capture layers [8, 24, 47] + Hebbian memory
|
| 49 |
+
[FireEcho] WARNING: --use_mps and --use_quantum_linear are mutually exclusive
|
| 50 |
+
[FireEcho] Using MPS (bigger memory win enables batching)
|
| 51 |
+
[FireEcho] MPS compression (bond_dim=256)...
|
| 52 |
+
[MPS] Replaced 150 FFN layers with bond_dim=256
|
| 53 |
+
[MPS] Params: 2429.8M → 1407.4M (1.7x compression)
|
| 54 |
+
[FireEcho] torch.compile(eagle, mode='default', fullgraph=False)...
|
| 55 |
+
[FireEcho] Compilation enabled (first steps will be slow for tracing)
|
| 56 |
+
Trainable eagle params: 1096.0M
|
| 57 |
+
[Checkpoint] Resumed from step 5000 (loss=6.9199)
|
| 58 |
+
|
| 59 |
+
[3/5] Loading external dataset...
|
| 60 |
+
Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
|
| 61 |
+
Loaded 41122 samples.
|
| 62 |
+
|
| 63 |
+
[OFFLINE] Loading precomputed features from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed...
|
| 64 |
+
41122 samples available
|
| 65 |
+
|
| 66 |
+
[OFFLINE] Starting training...
|
| 67 |
+
VRAM before training: 26.57 GB
|
| 68 |
+
[VRAM] Deleting base model layers (--no_eval)...
|
| 69 |
+
[VRAM] Freed 18.6 GB (26.6 → 7.9 GB)
|
| 70 |
+
Step 5100 | epoch 1/2 | loss=6.0456 | avg=6.2106 | acc=14.4% | lr=1.25e-06 | pos=64
|
| 71 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] torch._dynamo hit config.recompile_limit (8)
|
| 72 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] function: 'forward' (/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py:8993)
|
| 73 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] last reason: 0/7: self._draft_pos == 1 # draft_k[:, :, pos:pos + 1, :] = k.detach() # training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py:8982 in _draft_attn (HINT: torch.compile considers integer attributes of the nn.Module to be static. If you are observing recompilation, you might want to make this integer dynamic using torch._dynamo.config.allow_unspec_int_on_nn_module = True, or convert this integer into a tensor.)
|
| 74 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] User stack trace:
|
| 75 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 9039, in forward
|
| 76 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] x = self._draft_attn(x, pos, layer, draft_k, draft_v)
|
| 77 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8982, in _draft_attn
|
| 78 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] draft_k[:, :, pos:pos + 1, :] = k.detach()
|
| 79 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
|
| 80 |
+
W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html
|
| 81 |
+
Step 5200 | epoch 1/2 | loss=6.4288 | avg=6.1992 | acc=8.8% | lr=2.50e-06 | pos=64
|
| 82 |
+
Step 5300 | epoch 1/2 | loss=6.5290 | avg=6.1820 | acc=8.8% | lr=3.75e-06 | pos=64
|
| 83 |
+
Step 5400 | epoch 1/2 | loss=7.1685 | avg=6.1450 | acc=7.8% | lr=5.00e-06 | pos=64
|
| 84 |
+
Step 5500 | epoch 1/2 | loss=6.1653 | avg=6.1139 | acc=5.6% | lr=6.25e-06 | pos=64
|
| 85 |
+
Step 5600 | epoch 1/2 | loss=6.4737 | avg=6.0755 | acc=10.9% | lr=7.50e-06 | pos=64
|
| 86 |
+
Step 5700 | epoch 1/2 | loss=4.9286 | avg=6.0427 | acc=13.1% | lr=8.75e-06 | pos=64
|
| 87 |
+
Step 5800 | epoch 1/2 | loss=4.8731 | avg=6.0262 | acc=19.1% | lr=1.00e-05 | pos=64
|
| 88 |
+
Step 5900 | epoch 1/2 | loss=6.6587 | avg=6.0177 | acc=8.4% | lr=1.13e-05 | pos=64
|
| 89 |
+
Step 6000 | epoch 1/2 | loss=5.7042 | avg=5.9727 | acc=16.6% | lr=1.25e-05 | pos=64
|
| 90 |
+
Step 6100 | epoch 1/2 | loss=4.5372 | avg=5.5973 | acc=18.8% | lr=1.38e-05 | pos=64
|
| 91 |
+
Step 6200 | epoch 1/2 | loss=6.5012 | avg=5.6312 | acc=7.8% | lr=1.50e-05 | pos=64
|
| 92 |
+
Step 6300 | epoch 1/2 | loss=6.2758 | avg=5.6285 | acc=10.0% | lr=1.63e-05 | pos=64
|
| 93 |
+
Step 6400 | epoch 1/2 | loss=4.1524 | avg=5.6293 | acc=19.7% | lr=1.75e-05 | pos=64
|
| 94 |
+
Step 6500 | epoch 1/2 | loss=6.8674 | avg=5.5965 | acc=17.6% | lr=1.88e-05 | pos=41
|
| 95 |
+
Step 6600 | epoch 1/2 | loss=5.3658 | avg=5.6164 | acc=15.3% | lr=2.00e-05 | pos=64
|
| 96 |
+
Step 6700 | epoch 1/2 | loss=4.1285 | avg=5.6091 | acc=25.3% | lr=2.13e-05 | pos=64
|
| 97 |
+
Step 6800 | epoch 1/2 | loss=7.3849 | avg=5.5995 | acc=6.6% | lr=2.25e-05 | pos=64
|
| 98 |
+
Step 6900 | epoch 1/2 | loss=6.1772 | avg=5.5865 | acc=11.9% | lr=2.38e-05 | pos=64
|
| 99 |
+
Step 7000 | epoch 1/2 | loss=6.1639 | avg=5.5709 | acc=16.2% | lr=2.50e-05 | pos=64
|
| 100 |
+
Step 7100 | epoch 1/2 | loss=5.3027 | avg=5.4978 | acc=11.2% | lr=2.63e-05 | pos=64
|
| 101 |
+
Step 7200 | epoch 1/2 | loss=5.3408 | avg=5.4792 | acc=10.0% | lr=2.75e-05 | pos=64
|
| 102 |
+
Step 7300 | epoch 1/2 | loss=4.4438 | avg=5.4632 | acc=17.5% | lr=2.87e-05 | pos=64
|
| 103 |
+
Step 7400 | epoch 1/2 | loss=5.1489 | avg=5.4713 | acc=13.1% | lr=3.00e-05 | pos=64
|
| 104 |
+
Step 7500 | epoch 1/2 | loss=6.4010 | avg=5.4904 | acc=12.8% | lr=3.13e-05 | pos=64
|
| 105 |
+
Step 7600 | epoch 1/2 | loss=3.8629 | avg=5.5108 | acc=23.1% | lr=3.25e-05 | pos=64
|
| 106 |
+
Step 7700 | epoch 1/2 | loss=7.2239 | avg=5.5110 | acc=5.9% | lr=3.38e-05 | pos=64
|
| 107 |
+
Step 7800 | epoch 1/2 | loss=6.8530 | avg=5.5100 | acc=10.3% | lr=3.50e-05 | pos=64
|
| 108 |
+
Step 7900 | epoch 1/2 | loss=6.6124 | avg=5.5197 | acc=8.1% | lr=3.63e-05 | pos=64
|
| 109 |
+
Step 8000 | epoch 1/2 | loss=4.6751 | avg=5.5232 | acc=23.4% | lr=3.75e-05 | pos=64
|
| 110 |
+
Step 8100 | epoch 1/2 | loss=4.0154 | avg=5.5424 | acc=24.4% | lr=3.87e-05 | pos=64
|
| 111 |
+
Step 8200 | epoch 1/2 | loss=5.5367 | avg=5.6600 | acc=12.8% | lr=4.00e-05 | pos=64
|
| 112 |
+
Step 8300 | epoch 1/2 | loss=6.1311 | avg=5.6181 | acc=13.4% | lr=4.12e-05 | pos=64
|
| 113 |
+
Step 8400 | epoch 1/2 | loss=6.5729 | avg=5.6331 | acc=12.5% | lr=4.25e-05 | pos=64
|
| 114 |
+
Step 8500 | epoch 1/2 | loss=4.5534 | avg=5.6485 | acc=16.9% | lr=4.37e-05 | pos=64
|
| 115 |
+
Step 8600 | epoch 1/2 | loss=6.8225 | avg=5.6280 | acc=10.6% | lr=4.50e-05 | pos=64
|
| 116 |
+
Step 8700 | epoch 1/2 | loss=4.0110 | avg=5.6234 | acc=22.8% | lr=4.63e-05 | pos=64
|
| 117 |
+
Step 8800 | epoch 1/2 | loss=5.4399 | avg=5.6160 | acc=13.1% | lr=4.75e-05 | pos=64
|
| 118 |
+
Step 8900 | epoch 1/2 | loss=4.5850 | avg=5.6229 | acc=16.9% | lr=4.87e-05 | pos=64
|
| 119 |
+
Step 9000 | epoch 1/2 | loss=7.4199 | avg=5.6474 | acc=7.8% | lr=5.00e-05 | pos=64
|
| 120 |
+
Step 9100 | epoch 1/2 | loss=7.1357 | avg=5.7880 | acc=6.6% | lr=5.12e-05 | pos=64
|
| 121 |
+
Step 9200 | epoch 1/2 | loss=4.8856 | avg=5.7771 | acc=15.6% | lr=5.25e-05 | pos=64
|
| 122 |
+
Step 9300 | epoch 1/2 | loss=6.1873 | avg=5.8079 | acc=5.9% | lr=5.37e-05 | pos=64
|
| 123 |
+
Step 9400 | epoch 1/2 | loss=5.3464 | avg=5.8165 | acc=15.0% | lr=5.50e-05 | pos=64
|
| 124 |
+
Step 9500 | epoch 1/2 | loss=3.5382 | avg=5.7942 | acc=19.7% | lr=5.63e-05 | pos=64
|
| 125 |
+
Step 9600 | epoch 1/2 | loss=7.2470 | avg=5.8229 | acc=8.8% | lr=5.75e-05 | pos=64
|
| 126 |
+
Step 9700 | epoch 1/2 | loss=7.5141 | avg=5.8537 | acc=6.9% | lr=5.88e-05 | pos=64
|
| 127 |
+
Step 9800 | epoch 1/2 | loss=5.1512 | avg=5.8826 | acc=13.8% | lr=6.00e-05 | pos=64
|
| 128 |
+
Step 9900 | epoch 1/2 | loss=5.1891 | avg=5.8964 | acc=14.4% | lr=6.13e-05 | pos=64
|
| 129 |
+
Step 10000 | epoch 1/2 | loss=6.2276 | avg=5.9194 | acc=9.7% | lr=6.25e-05 | pos=64
|
| 130 |
+
[Checkpoint] Saved step 10000 (loss=6.2276) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 131 |
+
[Save @ step 10000] loss=6.2276
|
| 132 |
+
[Checkpoint] Saved step 10000 (loss=6.2276) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step10000.pt
|
| 133 |
+
[Prune @ step 10000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 134 |
+
Step 10100 | epoch 1/2 | loss=5.8900 | avg=5.9241 | acc=14.1% | lr=6.38e-05 | pos=64
|
| 135 |
+
Step 10200 | epoch 1/2 | loss=6.1215 | avg=5.9139 | acc=14.7% | lr=6.50e-05 | pos=64
|
| 136 |
+
Step 10300 | epoch 1/2 | loss=6.7283 | avg=5.9547 | acc=11.2% | lr=6.62e-05 | pos=64
|
| 137 |
+
Step 10400 | epoch 1/2 | loss=6.2089 | avg=6.0322 | acc=10.9% | lr=6.75e-05 | pos=64
|
| 138 |
+
Step 10500 | epoch 1/2 | loss=7.0789 | avg=6.0858 | acc=6.2% | lr=6.88e-05 | pos=64
|
| 139 |
+
Step 10600 | epoch 1/2 | loss=6.5472 | avg=6.0790 | acc=12.8% | lr=7.00e-05 | pos=64
|
| 140 |
+
Step 10700 | epoch 1/2 | loss=6.8952 | avg=6.0853 | acc=6.6% | lr=7.13e-05 | pos=64
|
| 141 |
+
Step 10800 | epoch 1/2 | loss=5.0417 | avg=6.0856 | acc=15.6% | lr=7.25e-05 | pos=64
|
| 142 |
+
Step 10900 | epoch 1/2 | loss=4.8823 | avg=6.0906 | acc=19.7% | lr=7.38e-05 | pos=64
|
| 143 |
+
Step 11000 | epoch 1/2 | loss=5.6943 | avg=6.1095 | acc=16.9% | lr=7.50e-05 | pos=64
|
| 144 |
+
Step 11100 | epoch 1/2 | loss=6.4133 | avg=6.1913 | acc=13.1% | lr=7.62e-05 | pos=64
|
| 145 |
+
Step 11200 | epoch 1/2 | loss=7.7836 | avg=6.2704 | acc=8.1% | lr=7.75e-05 | pos=64
|
| 146 |
+
Step 11300 | epoch 1/2 | loss=5.7336 | avg=6.2414 | acc=8.1% | lr=7.88e-05 | pos=64
|
| 147 |
+
Step 11400 | epoch 1/2 | loss=7.5261 | avg=6.2279 | acc=4.7% | lr=8.00e-05 | pos=64
|
| 148 |
+
Step 11500 | epoch 1/2 | loss=7.2932 | avg=6.2434 | acc=5.3% | lr=8.13e-05 | pos=64
|
| 149 |
+
Step 11600 | epoch 1/2 | loss=3.8389 | avg=6.2505 | acc=26.9% | lr=8.25e-05 | pos=64
|
| 150 |
+
Step 11700 | epoch 1/2 | loss=6.8235 | avg=6.2256 | acc=8.8% | lr=8.38e-05 | pos=64
|
| 151 |
+
Step 11800 | epoch 1/2 | loss=5.8012 | avg=6.2382 | acc=11.6% | lr=8.50e-05 | pos=64
|
| 152 |
+
Step 11900 | epoch 1/2 | loss=5.3869 | avg=6.2630 | acc=14.4% | lr=8.63e-05 | pos=64
|
| 153 |
+
Step 12000 | epoch 1/2 | loss=5.2938 | avg=6.2744 | acc=13.8% | lr=8.75e-05 | pos=64
|
| 154 |
+
Step 12100 | epoch 1/2 | loss=6.6599 | avg=6.4246 | acc=11.2% | lr=8.88e-05 | pos=64
|
| 155 |
+
Step 12200 | epoch 1/2 | loss=6.5154 | avg=6.3953 | acc=6.2% | lr=9.00e-05 | pos=64
|
| 156 |
+
Step 12300 | epoch 1/2 | loss=5.3954 | avg=6.4561 | acc=14.7% | lr=9.12e-05 | pos=64
|
| 157 |
+
Step 12400 | epoch 1/2 | loss=7.5228 | avg=6.3996 | acc=5.0% | lr=9.25e-05 | pos=64
|
| 158 |
+
Step 12500 | epoch 1/2 | loss=7.7880 | avg=6.3830 | acc=6.2% | lr=9.38e-05 | pos=64
|
| 159 |
+
Step 12600 | epoch 1/2 | loss=7.4444 | avg=6.3519 | acc=6.6% | lr=9.50e-05 | pos=64
|
| 160 |
+
Step 12700 | epoch 1/2 | loss=7.9002 | avg=6.3342 | acc=6.6% | lr=9.63e-05 | pos=64
|
| 161 |
+
Step 12800 | epoch 1/2 | loss=6.0377 | avg=6.3263 | acc=10.0% | lr=9.75e-05 | pos=64
|
| 162 |
+
Step 12900 | epoch 1/2 | loss=6.9872 | avg=6.3369 | acc=8.8% | lr=9.88e-05 | pos=64
|
| 163 |
+
Step 13000 | epoch 1/2 | loss=5.5612 | avg=6.3423 | acc=15.6% | lr=1.00e-04 | pos=64
|
| 164 |
+
Step 13100 | epoch 1/2 | loss=5.8940 | avg=6.5114 | acc=8.4% | lr=1.00e-04 | pos=64
|
| 165 |
+
Step 13200 | epoch 1/2 | loss=7.6319 | avg=6.4637 | acc=4.7% | lr=1.00e-04 | pos=64
|
| 166 |
+
Step 13300 | epoch 1/2 | loss=5.4036 | avg=6.4090 | acc=14.1% | lr=1.00e-04 | pos=64
|
| 167 |
+
Step 13400 | epoch 1/2 | loss=5.3561 | avg=6.3912 | acc=13.1% | lr=1.00e-04 | pos=64
|
| 168 |
+
Step 13500 | epoch 1/2 | loss=6.9826 | avg=6.3646 | acc=8.8% | lr=1.00e-04 | pos=64
|
| 169 |
+
Step 13600 | epoch 1/2 | loss=5.7324 | avg=6.3270 | acc=14.7% | lr=1.00e-04 | pos=64
|
| 170 |
+
Step 13700 | epoch 1/2 | loss=4.5450 | avg=6.2767 | acc=19.7% | lr=1.00e-04 | pos=64
|
| 171 |
+
Step 13800 | epoch 1/2 | loss=4.9770 | avg=6.2691 | acc=15.0% | lr=1.00e-04 | pos=64
|
| 172 |
+
Step 13900 | epoch 1/2 | loss=5.7575 | avg=6.2462 | acc=12.5% | lr=1.00e-04 | pos=64
|
| 173 |
+
Step 14000 | epoch 1/2 | loss=6.1865 | avg=6.2350 | acc=11.6% | lr=1.00e-04 | pos=64
|
| 174 |
+
Step 14100 | epoch 1/2 | loss=5.2309 | avg=6.1144 | acc=14.1% | lr=1.00e-04 | pos=64
|
| 175 |
+
Step 14200 | epoch 1/2 | loss=6.7469 | avg=6.0611 | acc=6.6% | lr=9.99e-05 | pos=64
|
| 176 |
+
Step 14300 | epoch 1/2 | loss=5.6130 | avg=6.1187 | acc=15.3% | lr=9.99e-05 | pos=64
|
| 177 |
+
Step 14400 | epoch 1/2 | loss=7.1063 | avg=6.1532 | acc=6.9% | lr=9.99e-05 | pos=64
|
| 178 |
+
Step 14500 | epoch 1/2 | loss=6.6918 | avg=6.0775 | acc=10.9% | lr=9.99e-05 | pos=64
|
| 179 |
+
Step 14600 | epoch 1/2 | loss=5.2415 | avg=6.0832 | acc=13.4% | lr=9.99e-05 | pos=64
|
| 180 |
+
Step 14700 | epoch 1/2 | loss=6.1558 | avg=6.0358 | acc=10.6% | lr=9.99e-05 | pos=64
|
| 181 |
+
Step 14800 | epoch 1/2 | loss=6.6280 | avg=6.0206 | acc=9.7% | lr=9.99e-05 | pos=64
|
| 182 |
+
Step 14900 | epoch 1/2 | loss=6.3373 | avg=6.0078 | acc=13.8% | lr=9.99e-05 | pos=64
|
| 183 |
+
Step 15000 | epoch 1/2 | loss=6.4039 | avg=6.0172 | acc=8.8% | lr=9.98e-05 | pos=64
|
| 184 |
+
[Checkpoint] Saved step 15000 (loss=6.4039) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 185 |
+
[Save @ step 15000] loss=6.4039
|
| 186 |
+
[Checkpoint] Saved step 15000 (loss=6.4039) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step15000.pt
|
| 187 |
+
[Prune @ step 15000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 188 |
+
Step 15100 | epoch 1/2 | loss=4.2505 | avg=5.8018 | acc=29.7% | lr=9.98e-05 | pos=64
|
| 189 |
+
Step 15200 | epoch 1/2 | loss=5.3202 | avg=5.8986 | acc=18.4% | lr=9.98e-05 | pos=64
|
| 190 |
+
Step 15300 | epoch 1/2 | loss=4.9784 | avg=5.9032 | acc=22.2% | lr=9.98e-05 | pos=64
|
| 191 |
+
Step 15400 | epoch 1/2 | loss=5.5990 | avg=5.9162 | acc=7.5% | lr=9.98e-05 | pos=64
|
| 192 |
+
Step 15500 | epoch 1/2 | loss=6.0779 | avg=5.8561 | acc=12.8% | lr=9.97e-05 | pos=64
|
| 193 |
+
Step 15600 | epoch 1/2 | loss=5.3501 | avg=5.8783 | acc=17.5% | lr=9.97e-05 | pos=64
|
| 194 |
+
Step 15700 | epoch 1/2 | loss=5.4835 | avg=5.8528 | acc=14.1% | lr=9.97e-05 | pos=64
|
| 195 |
+
Step 15800 | epoch 1/2 | loss=5.8244 | avg=5.8483 | acc=11.6% | lr=9.97e-05 | pos=64
|
| 196 |
+
Step 15900 | epoch 1/2 | loss=5.0472 | avg=5.8358 | acc=11.6% | lr=9.97e-05 | pos=64
|
| 197 |
+
Step 16000 | epoch 1/2 | loss=5.7255 | avg=5.8358 | acc=14.7% | lr=9.96e-05 | pos=64
|
| 198 |
+
Step 16100 | epoch 1/2 | loss=6.9115 | avg=5.9519 | acc=9.1% | lr=9.96e-05 | pos=64
|
| 199 |
+
Step 16200 | epoch 1/2 | loss=7.3667 | avg=5.9884 | acc=3.8% | lr=9.96e-05 | pos=64
|
| 200 |
+
Step 16300 | epoch 1/2 | loss=5.9598 | avg=5.8925 | acc=9.7% | lr=9.96e-05 | pos=64
|
| 201 |
+
Step 16400 | epoch 1/2 | loss=4.7891 | avg=5.8581 | acc=20.6% | lr=9.95e-05 | pos=64
|
| 202 |
+
Step 16500 | epoch 1/2 | loss=5.1974 | avg=5.8379 | acc=16.6% | lr=9.95e-05 | pos=64
|
| 203 |
+
Step 16600 | epoch 1/2 | loss=4.4763 | avg=5.8057 | acc=21.6% | lr=9.95e-05 | pos=64
|
| 204 |
+
Step 16700 | epoch 1/2 | loss=5.6903 | avg=5.8039 | acc=14.4% | lr=9.94e-05 | pos=64
|
| 205 |
+
Step 16800 | epoch 1/2 | loss=6.3023 | avg=5.7967 | acc=10.3% | lr=9.94e-05 | pos=64
|
| 206 |
+
Step 16900 | epoch 1/2 | loss=4.3212 | avg=5.7818 | acc=23.4% | lr=9.94e-05 | pos=64
|
| 207 |
+
Step 17000 | epoch 1/2 | loss=3.9120 | avg=5.7626 | acc=22.8% | lr=9.94e-05 | pos=64
|
| 208 |
+
Step 17100 | epoch 1/2 | loss=6.4101 | avg=5.7769 | acc=9.4% | lr=9.93e-05 | pos=64
|
| 209 |
+
Step 17200 | epoch 1/2 | loss=4.9407 | avg=5.8005 | acc=19.1% | lr=9.93e-05 | pos=64
|
| 210 |
+
Step 17300 | epoch 1/2 | loss=8.5146 | avg=5.7985 | acc=4.7% | lr=9.93e-05 | pos=64
|
| 211 |
+
Step 17400 | epoch 1/2 | loss=6.6819 | avg=5.7593 | acc=6.6% | lr=9.92e-05 | pos=64
|
| 212 |
+
Step 17500 | epoch 1/2 | loss=5.3934 | avg=5.7124 | acc=11.2% | lr=9.92e-05 | pos=64
|
| 213 |
+
Step 17600 | epoch 1/2 | loss=5.6320 | avg=5.7167 | acc=13.1% | lr=9.92e-05 | pos=64
|
| 214 |
+
Step 17700 | epoch 1/2 | loss=4.9097 | avg=5.7025 | acc=19.7% | lr=9.91e-05 | pos=64
|
| 215 |
+
Step 17800 | epoch 1/2 | loss=5.3642 | avg=5.6747 | acc=11.2% | lr=9.91e-05 | pos=64
|
| 216 |
+
Step 17900 | epoch 1/2 | loss=5.7257 | avg=5.6797 | acc=9.4% | lr=9.90e-05 | pos=64
|
| 217 |
+
Step 18000 | epoch 1/2 | loss=7.2424 | avg=5.6691 | acc=5.0% | lr=9.90e-05 | pos=64
|
| 218 |
+
Step 18100 | epoch 1/2 | loss=4.9557 | avg=5.6849 | acc=17.8% | lr=9.90e-05 | pos=64
|
| 219 |
+
Step 18200 | epoch 1/2 | loss=5.3597 | avg=5.7598 | acc=12.8% | lr=9.89e-05 | pos=64
|
| 220 |
+
Step 18300 | epoch 1/2 | loss=5.5707 | avg=5.7254 | acc=16.9% | lr=9.89e-05 | pos=64
|
| 221 |
+
Step 18400 | epoch 1/2 | loss=5.3697 | avg=5.6821 | acc=14.7% | lr=9.88e-05 | pos=64
|
| 222 |
+
Step 18500 | epoch 1/2 | loss=5.9737 | avg=5.6687 | acc=8.4% | lr=9.88e-05 | pos=64
|
| 223 |
+
Step 18600 | epoch 1/2 | loss=6.3940 | avg=5.6782 | acc=12.5% | lr=9.87e-05 | pos=64
|
| 224 |
+
Step 18700 | epoch 1/2 | loss=6.1741 | avg=5.6582 | acc=9.7% | lr=9.87e-05 | pos=64
|
| 225 |
+
Step 18800 | epoch 1/2 | loss=5.0890 | avg=5.6381 | acc=20.0% | lr=9.87e-05 | pos=64
|
| 226 |
+
Step 18900 | epoch 1/2 | loss=9.5439 | avg=5.6474 | acc=7.2% | lr=9.86e-05 | pos=64
|
| 227 |
+
Step 19000 | epoch 1/2 | loss=6.2727 | avg=5.6501 | acc=12.2% | lr=9.86e-05 | pos=64
|
| 228 |
+
Step 19100 | epoch 1/2 | loss=5.0060 | avg=5.6381 | acc=15.3% | lr=9.85e-05 | pos=64
|
| 229 |
+
Step 19200 | epoch 1/2 | loss=4.6388 | avg=5.6294 | acc=23.1% | lr=9.85e-05 | pos=64
|
| 230 |
+
Step 19300 | epoch 1/2 | loss=5.7475 | avg=5.6296 | acc=14.4% | lr=9.84e-05 | pos=64
|
| 231 |
+
Step 19400 | epoch 1/2 | loss=6.7555 | avg=5.6299 | acc=7.8% | lr=9.84e-05 | pos=64
|
| 232 |
+
Step 19500 | epoch 1/2 | loss=7.1358 | avg=5.5876 | acc=6.6% | lr=9.83e-05 | pos=64
|
| 233 |
+
Step 19600 | epoch 1/2 | loss=4.5881 | avg=5.5850 | acc=25.6% | lr=9.83e-05 | pos=64
|
| 234 |
+
Step 19700 | epoch 1/2 | loss=4.3789 | avg=5.5623 | acc=22.5% | lr=9.82e-05 | pos=64
|
| 235 |
+
Step 19800 | epoch 1/2 | loss=5.7571 | avg=5.5662 | acc=13.8% | lr=9.81e-05 | pos=64
|
| 236 |
+
Step 19900 | epoch 1/2 | loss=8.4748 | avg=5.5578 | acc=2.5% | lr=9.81e-05 | pos=64
|
| 237 |
+
Step 20000 | epoch 1/2 | loss=5.1173 | avg=5.5488 | acc=18.8% | lr=9.80e-05 | pos=64
|
| 238 |
+
[Checkpoint] Saved step 20000 (loss=5.1173) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 239 |
+
[Save @ step 20000] loss=5.1173
|
| 240 |
+
[Checkpoint] Saved step 20000 (loss=5.1173) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step20000.pt
|
| 241 |
+
[Prune @ step 20000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 242 |
+
Step 20100 | epoch 1/2 | loss=5.4756 | avg=5.5505 | acc=14.1% | lr=9.80e-05 | pos=64
|
| 243 |
+
Step 20200 | epoch 1/2 | loss=5.1688 | avg=5.5710 | acc=14.7% | lr=9.79e-05 | pos=64
|
| 244 |
+
Step 20300 | epoch 1/2 | loss=8.4751 | avg=5.5537 | acc=4.7% | lr=9.79e-05 | pos=64
|
| 245 |
+
Step 20400 | epoch 1/2 | loss=5.0624 | avg=5.5354 | acc=14.4% | lr=9.78e-05 | pos=64
|
| 246 |
+
Step 20500 | epoch 1/2 | loss=5.6380 | avg=5.5492 | acc=10.6% | lr=9.78e-05 | pos=64
|
| 247 |
+
Step 20600 | epoch 1/2 | loss=4.9411 | avg=5.5314 | acc=17.2% | lr=9.77e-05 | pos=64
|
| 248 |
+
Step 20700 | epoch 1/2 | loss=5.1845 | avg=5.5182 | acc=17.2% | lr=9.76e-05 | pos=64
|
| 249 |
+
Step 20800 | epoch 1/2 | loss=4.4440 | avg=5.4811 | acc=18.4% | lr=9.76e-05 | pos=64
|
| 250 |
+
Step 20900 | epoch 1/2 | loss=6.9643 | avg=5.4939 | acc=11.6% | lr=9.75e-05 | pos=64
|
| 251 |
+
Step 21000 | epoch 1/2 | loss=7.3513 | avg=5.4954 | acc=4.7% | lr=9.74e-05 | pos=64
|
| 252 |
+
Step 21100 | epoch 1/2 | loss=4.9541 | avg=5.4434 | acc=13.4% | lr=9.74e-05 | pos=64
|
| 253 |
+
Step 21200 | epoch 1/2 | loss=5.8833 | avg=5.3946 | acc=10.6% | lr=9.73e-05 | pos=64
|
| 254 |
+
Step 21300 | epoch 1/2 | loss=4.7049 | avg=5.3664 | acc=17.5% | lr=9.73e-05 | pos=64
|
| 255 |
+
Step 21400 | epoch 1/2 | loss=7.0340 | avg=5.3853 | acc=3.8% | lr=9.72e-05 | pos=64
|
| 256 |
+
Step 21500 | epoch 1/2 | loss=4.7712 | avg=5.4051 | acc=14.4% | lr=9.71e-05 | pos=64
|
| 257 |
+
Step 21600 | epoch 1/2 | loss=4.1569 | avg=5.3882 | acc=20.3% | lr=9.71e-05 | pos=64
|
| 258 |
+
Step 21700 | epoch 1/2 | loss=4.9068 | avg=5.3744 | acc=17.5% | lr=9.70e-05 | pos=64
|
| 259 |
+
Step 21800 | epoch 1/2 | loss=5.4254 | avg=5.3602 | acc=11.6% | lr=9.69e-05 | pos=64
|
| 260 |
+
Step 21900 | epoch 1/2 | loss=6.2506 | avg=5.3689 | acc=14.1% | lr=9.68e-05 | pos=64
|
| 261 |
+
Step 22000 | epoch 1/2 | loss=5.2534 | avg=5.3725 | acc=14.7% | lr=9.68e-05 | pos=64
|
| 262 |
+
Step 22100 | epoch 1/2 | loss=4.6903 | avg=5.3125 | acc=18.1% | lr=9.67e-05 | pos=64
|
| 263 |
+
Step 22200 | epoch 1/2 | loss=4.0345 | avg=5.3496 | acc=17.2% | lr=9.66e-05 | pos=64
|
| 264 |
+
Step 22300 | epoch 1/2 | loss=5.4078 | avg=5.3544 | acc=15.6% | lr=9.66e-05 | pos=64
|
| 265 |
+
Step 22400 | epoch 1/2 | loss=6.7715 | avg=5.3836 | acc=8.1% | lr=9.65e-05 | pos=64
|
| 266 |
+
Step 22500 | epoch 1/2 | loss=9.3450 | avg=5.3618 | acc=8.8% | lr=9.64e-05 | pos=64
|
| 267 |
+
Step 22600 | epoch 1/2 | loss=6.1452 | avg=5.3786 | acc=12.8% | lr=9.63e-05 | pos=64
|
| 268 |
+
Step 22700 | epoch 1/2 | loss=6.4993 | avg=5.3722 | acc=10.0% | lr=9.63e-05 | pos=64
|
| 269 |
+
Step 22800 | epoch 1/2 | loss=6.7072 | avg=5.3715 | acc=7.5% | lr=9.62e-05 | pos=64
|
| 270 |
+
Step 22900 | epoch 1/2 | loss=5.6727 | avg=5.3747 | acc=16.9% | lr=9.61e-05 | pos=64
|
| 271 |
+
Step 23000 | epoch 1/2 | loss=4.0313 | avg=5.3578 | acc=27.5% | lr=9.60e-05 | pos=64
|
| 272 |
+
Step 23100 | epoch 1/2 | loss=6.6814 | avg=5.3814 | acc=9.1% | lr=9.60e-05 | pos=64
|
| 273 |
+
Step 23200 | epoch 1/2 | loss=3.7822 | avg=5.2563 | acc=26.6% | lr=9.59e-05 | pos=64
|
| 274 |
+
Step 23300 | epoch 1/2 | loss=5.5860 | avg=5.3178 | acc=25.9% | lr=9.58e-05 | pos=64
|
| 275 |
+
Step 23400 | epoch 1/2 | loss=3.8420 | avg=5.3004 | acc=25.6% | lr=9.57e-05 | pos=64
|
| 276 |
+
Step 23500 | epoch 1/2 | loss=4.1972 | avg=5.2734 | acc=21.9% | lr=9.56e-05 | pos=64
|
| 277 |
+
Step 23600 | epoch 1/2 | loss=4.7770 | avg=5.2667 | acc=18.8% | lr=9.55e-05 | pos=64
|
| 278 |
+
Step 23700 | epoch 1/2 | loss=5.3051 | avg=5.2498 | acc=17.8% | lr=9.55e-05 | pos=64
|
| 279 |
+
Step 23800 | epoch 1/2 | loss=5.1812 | avg=5.2774 | acc=13.1% | lr=9.54e-05 | pos=64
|
| 280 |
+
Step 23900 | epoch 1/2 | loss=5.8178 | avg=5.2822 | acc=16.6% | lr=9.53e-05 | pos=64
|
| 281 |
+
Step 24000 | epoch 1/2 | loss=4.4594 | avg=5.2729 | acc=17.8% | lr=9.52e-05 | pos=64
|
| 282 |
+
Step 24100 | epoch 1/2 | loss=4.0387 | avg=5.1920 | acc=20.9% | lr=9.51e-05 | pos=64
|
| 283 |
+
Step 24200 | epoch 1/2 | loss=6.8931 | avg=5.2629 | acc=9.1% | lr=9.50e-05 | pos=64
|
| 284 |
+
Step 24300 | epoch 1/2 | loss=4.7364 | avg=5.2967 | acc=22.5% | lr=9.50e-05 | pos=64
|
| 285 |
+
Step 24400 | epoch 1/2 | loss=4.4333 | avg=5.2412 | acc=23.8% | lr=9.49e-05 | pos=64
|
| 286 |
+
Step 24500 | epoch 1/2 | loss=4.4960 | avg=5.2340 | acc=23.4% | lr=9.48e-05 | pos=64
|
| 287 |
+
Step 24600 | epoch 1/2 | loss=4.1843 | avg=5.2443 | acc=23.4% | lr=9.47e-05 | pos=64
|
| 288 |
+
Step 24700 | epoch 1/2 | loss=7.4006 | avg=5.2936 | acc=6.9% | lr=9.46e-05 | pos=64
|
| 289 |
+
Step 24800 | epoch 1/2 | loss=3.6557 | avg=5.2591 | acc=28.4% | lr=9.45e-05 | pos=64
|
| 290 |
+
Step 24900 | epoch 1/2 | loss=4.9822 | avg=5.2392 | acc=17.5% | lr=9.44e-05 | pos=64
|
| 291 |
+
Step 25000 | epoch 1/2 | loss=4.4623 | avg=5.2394 | acc=18.1% | lr=9.43e-05 | pos=64
|
| 292 |
+
[Checkpoint] Saved step 25000 (loss=4.4623) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 293 |
+
[Save @ step 25000] loss=4.4623
|
| 294 |
+
[Checkpoint] Saved step 25000 (loss=4.4623) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step25000.pt
|
| 295 |
+
[Prune @ step 25000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 296 |
+
Step 25100 | epoch 1/2 | loss=3.3957 | avg=5.2251 | acc=32.2% | lr=9.42e-05 | pos=64
|
| 297 |
+
Step 25200 | epoch 1/2 | loss=3.5391 | avg=5.2375 | acc=27.5% | lr=9.41e-05 | pos=64
|
| 298 |
+
Step 25300 | epoch 1/2 | loss=4.9235 | avg=5.2656 | acc=15.6% | lr=9.40e-05 | pos=64
|
| 299 |
+
Step 25400 | epoch 1/2 | loss=5.1743 | avg=5.2758 | acc=15.6% | lr=9.39e-05 | pos=64
|
| 300 |
+
Step 25500 | epoch 1/2 | loss=3.6510 | avg=5.2463 | acc=25.3% | lr=9.39e-05 | pos=64
|
| 301 |
+
Step 25600 | epoch 1/2 | loss=5.4870 | avg=5.2134 | acc=11.9% | lr=9.38e-05 | pos=64
|
| 302 |
+
Step 25700 | epoch 1/2 | loss=3.2654 | avg=5.2024 | acc=38.4% | lr=9.37e-05 | pos=64
|
| 303 |
+
Step 25800 | epoch 1/2 | loss=4.1588 | avg=5.2047 | acc=21.9% | lr=9.36e-05 | pos=64
|
| 304 |
+
Step 25900 | epoch 1/2 | loss=3.7836 | avg=5.2088 | acc=23.8% | lr=9.35e-05 | pos=64
|
| 305 |
+
Step 26000 | epoch 1/2 | loss=4.3097 | avg=5.2034 | acc=20.3% | lr=9.34e-05 | pos=64
|
| 306 |
+
Step 26100 | epoch 1/2 | loss=4.6737 | avg=5.1301 | acc=19.7% | lr=9.33e-05 | pos=64
|
| 307 |
+
Step 26200 | epoch 1/2 | loss=3.9339 | avg=5.2112 | acc=26.2% | lr=9.32e-05 | pos=64
|
| 308 |
+
Step 26300 | epoch 1/2 | loss=6.8034 | avg=5.1860 | acc=11.6% | lr=9.31e-05 | pos=64
|
| 309 |
+
Step 26400 | epoch 1/2 | loss=5.6778 | avg=5.1827 | acc=12.2% | lr=9.30e-05 | pos=64
|
| 310 |
+
Step 26500 | epoch 1/2 | loss=5.0070 | avg=5.2093 | acc=18.1% | lr=9.29e-05 | pos=64
|
| 311 |
+
Step 26600 | epoch 1/2 | loss=6.1985 | avg=5.1966 | acc=13.1% | lr=9.28e-05 | pos=64
|
| 312 |
+
Step 26700 | epoch 1/2 | loss=5.7865 | avg=5.2021 | acc=13.1% | lr=9.26e-05 | pos=64
|
| 313 |
+
Step 26800 | epoch 1/2 | loss=4.6918 | avg=5.1976 | acc=15.6% | lr=9.25e-05 | pos=64
|
| 314 |
+
Step 26900 | epoch 1/2 | loss=6.2116 | avg=5.1911 | acc=11.9% | lr=9.24e-05 | pos=64
|
| 315 |
+
Step 27000 | epoch 1/2 | loss=3.0124 | avg=5.1775 | acc=31.2% | lr=9.23e-05 | pos=64
|
| 316 |
+
Step 27100 | epoch 1/2 | loss=4.9378 | avg=5.2475 | acc=13.1% | lr=9.22e-05 | pos=64
|
| 317 |
+
Step 27200 | epoch 1/2 | loss=4.4908 | avg=5.1403 | acc=20.3% | lr=9.21e-05 | pos=64
|
| 318 |
+
Step 27300 | epoch 1/2 | loss=4.0158 | avg=5.1399 | acc=25.9% | lr=9.20e-05 | pos=64
|
| 319 |
+
Step 27400 | epoch 1/2 | loss=5.7095 | avg=5.1167 | acc=13.1% | lr=9.19e-05 | pos=64
|
| 320 |
+
Step 27500 | epoch 1/2 | loss=6.7299 | avg=5.1009 | acc=10.9% | lr=9.18e-05 | pos=64
|
| 321 |
+
Step 27600 | epoch 1/2 | loss=5.1221 | avg=5.0998 | acc=13.1% | lr=9.17e-05 | pos=64
|
| 322 |
+
Step 27700 | epoch 1/2 | loss=5.4922 | avg=5.1194 | acc=12.5% | lr=9.16e-05 | pos=64
|
| 323 |
+
Step 27800 | epoch 1/2 | loss=5.9491 | avg=5.1337 | acc=13.4% | lr=9.15e-05 | pos=64
|
| 324 |
+
Step 27900 | epoch 1/2 | loss=4.2654 | avg=5.1359 | acc=24.4% | lr=9.13e-05 | pos=64
|
| 325 |
+
Step 28000 | epoch 1/2 | loss=5.3780 | avg=5.1356 | acc=16.6% | lr=9.12e-05 | pos=64
|
| 326 |
+
Step 28100 | epoch 1/2 | loss=5.6094 | avg=4.9985 | acc=13.4% | lr=9.11e-05 | pos=64
|
| 327 |
+
Step 28200 | epoch 1/2 | loss=4.0248 | avg=5.1104 | acc=32.8% | lr=9.10e-05 | pos=64
|
| 328 |
+
Step 28300 | epoch 1/2 | loss=4.5946 | avg=5.1675 | acc=15.3% | lr=9.09e-05 | pos=64
|
| 329 |
+
Step 28400 | epoch 1/2 | loss=6.2588 | avg=5.1339 | acc=13.4% | lr=9.08e-05 | pos=64
|
| 330 |
+
Step 28500 | epoch 1/2 | loss=5.9369 | avg=5.1111 | acc=9.1% | lr=9.07e-05 | pos=64
|
| 331 |
+
Step 28600 | epoch 1/2 | loss=7.0753 | avg=5.1176 | acc=10.6% | lr=9.05e-05 | pos=64
|
| 332 |
+
Step 28700 | epoch 1/2 | loss=4.8857 | avg=5.1273 | acc=14.4% | lr=9.04e-05 | pos=64
|
| 333 |
+
Step 28800 | epoch 1/2 | loss=4.1414 | avg=5.1216 | acc=25.6% | lr=9.03e-05 | pos=64
|
| 334 |
+
Step 28900 | epoch 1/2 | loss=5.8579 | avg=5.1102 | acc=11.9% | lr=9.02e-05 | pos=64
|
| 335 |
+
Step 29000 | epoch 1/2 | loss=5.0406 | avg=5.1018 | acc=15.6% | lr=9.01e-05 | pos=64
|
| 336 |
+
Step 29100 | epoch 1/2 | loss=5.7378 | avg=4.9941 | acc=12.2% | lr=9.00e-05 | pos=64
|
| 337 |
+
Step 29200 | epoch 1/2 | loss=5.6251 | avg=5.0211 | acc=12.5% | lr=8.98e-05 | pos=64
|
| 338 |
+
Step 29300 | epoch 1/2 | loss=4.2895 | avg=4.9873 | acc=19.1% | lr=8.97e-05 | pos=64
|
| 339 |
+
Step 29400 | epoch 1/2 | loss=5.7916 | avg=5.0025 | acc=15.6% | lr=8.96e-05 | pos=64
|
| 340 |
+
Step 29500 | epoch 1/2 | loss=4.0017 | avg=5.0211 | acc=17.8% | lr=8.95e-05 | pos=64
|
| 341 |
+
Step 29600 | epoch 1/2 | loss=5.8437 | avg=5.0314 | acc=12.2% | lr=8.93e-05 | pos=64
|
| 342 |
+
Step 29700 | epoch 1/2 | loss=4.3955 | avg=5.0171 | acc=20.6% | lr=8.92e-05 | pos=64
|
| 343 |
+
Step 29800 | epoch 1/2 | loss=5.0815 | avg=5.0323 | acc=20.0% | lr=8.91e-05 | pos=64
|
| 344 |
+
Step 29900 | epoch 1/2 | loss=4.6394 | avg=5.0057 | acc=24.4% | lr=8.90e-05 | pos=64
|
| 345 |
+
Step 30000 | epoch 1/2 | loss=5.7835 | avg=5.0147 | acc=11.2% | lr=8.89e-05 | pos=64
|
| 346 |
+
[Checkpoint] Saved step 30000 (loss=5.7835) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 347 |
+
[Save @ step 30000] loss=5.7835
|
| 348 |
+
[Checkpoint] Saved step 30000 (loss=5.7835) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step30000.pt
|
| 349 |
+
[Prune @ step 30000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 350 |
+
Step 30100 | epoch 1/2 | loss=4.7267 | avg=4.8656 | acc=21.9% | lr=8.87e-05 | pos=64
|
| 351 |
+
Step 30200 | epoch 1/2 | loss=4.4325 | avg=4.9138 | acc=22.2% | lr=8.86e-05 | pos=64
|
| 352 |
+
Step 30300 | epoch 1/2 | loss=4.8922 | avg=4.9353 | acc=15.3% | lr=8.85e-05 | pos=64
|
| 353 |
+
Step 30400 | epoch 1/2 | loss=4.9547 | avg=4.9822 | acc=15.9% | lr=8.83e-05 | pos=64
|
| 354 |
+
Step 30500 | epoch 1/2 | loss=5.1371 | avg=4.9771 | acc=12.8% | lr=8.82e-05 | pos=64
|
| 355 |
+
Step 30600 | epoch 1/2 | loss=3.7506 | avg=5.0036 | acc=26.6% | lr=8.81e-05 | pos=64
|
| 356 |
+
Step 30700 | epoch 1/2 | loss=5.1478 | avg=4.9848 | acc=18.1% | lr=8.80e-05 | pos=64
|
| 357 |
+
Step 30800 | epoch 1/2 | loss=4.3728 | avg=4.9735 | acc=23.4% | lr=8.78e-05 | pos=64
|
| 358 |
+
Step 30900 | epoch 1/2 | loss=5.3286 | avg=4.9876 | acc=17.8% | lr=8.77e-05 | pos=64
|
| 359 |
+
Step 31000 | epoch 1/2 | loss=4.2759 | avg=4.9893 | acc=20.0% | lr=8.76e-05 | pos=64
|
| 360 |
+
Step 31100 | epoch 1/2 | loss=5.9771 | avg=5.1014 | acc=14.4% | lr=8.74e-05 | pos=64
|
| 361 |
+
Step 31200 | epoch 1/2 | loss=4.1367 | avg=5.0642 | acc=33.1% | lr=8.73e-05 | pos=64
|
| 362 |
+
Step 31300 | epoch 1/2 | loss=5.9808 | avg=4.9843 | acc=10.6% | lr=8.72e-05 | pos=64
|
| 363 |
+
Step 31400 | epoch 1/2 | loss=4.9826 | avg=4.9561 | acc=14.4% | lr=8.70e-05 | pos=64
|
| 364 |
+
Step 31500 | epoch 1/2 | loss=4.0148 | avg=4.9658 | acc=23.1% | lr=8.69e-05 | pos=64
|
| 365 |
+
Step 31600 | epoch 1/2 | loss=3.3919 | avg=4.9370 | acc=26.9% | lr=8.68e-05 | pos=64
|
| 366 |
+
Step 31700 | epoch 1/2 | loss=4.8560 | avg=4.9134 | acc=18.8% | lr=8.66e-05 | pos=64
|
| 367 |
+
Step 31800 | epoch 1/2 | loss=5.1120 | avg=4.9220 | acc=13.8% | lr=8.65e-05 | pos=64
|
| 368 |
+
Step 31900 | epoch 1/2 | loss=6.1961 | avg=4.9327 | acc=13.4% | lr=8.64e-05 | pos=64
|
| 369 |
+
Step 32000 | epoch 1/2 | loss=4.5527 | avg=4.9425 | acc=23.8% | lr=8.62e-05 | pos=64
|
| 370 |
+
Step 32100 | epoch 1/2 | loss=4.9468 | avg=5.1283 | acc=18.4% | lr=8.61e-05 | pos=64
|
| 371 |
+
Step 32200 | epoch 1/2 | loss=3.7239 | avg=4.9893 | acc=28.1% | lr=8.59e-05 | pos=64
|
| 372 |
+
Step 32300 | epoch 1/2 | loss=5.6031 | avg=4.9943 | acc=9.1% | lr=8.58e-05 | pos=64
|
| 373 |
+
Step 32400 | epoch 1/2 | loss=3.6938 | avg=5.0201 | acc=28.1% | lr=8.57e-05 | pos=64
|
| 374 |
+
Step 32500 | epoch 1/2 | loss=4.3661 | avg=5.0048 | acc=24.1% | lr=8.55e-05 | pos=64
|
| 375 |
+
Step 32600 | epoch 1/2 | loss=4.6400 | avg=4.9678 | acc=14.1% | lr=8.54e-05 | pos=64
|
| 376 |
+
Step 32700 | epoch 1/2 | loss=5.0756 | avg=4.9601 | acc=16.2% | lr=8.52e-05 | pos=64
|
| 377 |
+
Step 32800 | epoch 1/2 | loss=4.4300 | avg=4.9512 | acc=21.9% | lr=8.51e-05 | pos=64
|
| 378 |
+
Step 32900 | epoch 1/2 | loss=5.4190 | avg=4.9723 | acc=13.1% | lr=8.50e-05 | pos=64
|
| 379 |
+
Step 33000 | epoch 1/2 | loss=4.1838 | avg=4.9546 | acc=27.2% | lr=8.48e-05 | pos=64
|
| 380 |
+
Step 33100 | epoch 1/2 | loss=5.1738 | avg=4.8528 | acc=13.8% | lr=8.47e-05 | pos=64
|
| 381 |
+
Step 33200 | epoch 1/2 | loss=6.6131 | avg=5.0019 | acc=10.3% | lr=8.45e-05 | pos=64
|
| 382 |
+
Step 33300 | epoch 1/2 | loss=4.0026 | avg=4.9887 | acc=24.1% | lr=8.44e-05 | pos=64
|
| 383 |
+
Step 33400 | epoch 1/2 | loss=5.3191 | avg=4.9493 | acc=17.2% | lr=8.42e-05 | pos=64
|
| 384 |
+
Step 33500 | epoch 1/2 | loss=6.1506 | avg=4.9538 | acc=12.8% | lr=8.41e-05 | pos=64
|
| 385 |
+
Step 33600 | epoch 1/2 | loss=4.4988 | avg=4.9433 | acc=18.8% | lr=8.40e-05 | pos=64
|
| 386 |
+
Step 33700 | epoch 1/2 | loss=4.9283 | avg=4.9385 | acc=18.1% | lr=8.38e-05 | pos=64
|
| 387 |
+
Step 33800 | epoch 1/2 | loss=3.6502 | avg=4.9370 | acc=31.2% | lr=8.37e-05 | pos=64
|
| 388 |
+
Step 33900 | epoch 1/2 | loss=5.3868 | avg=4.9375 | acc=15.9% | lr=8.35e-05 | pos=64
|
| 389 |
+
Step 34000 | epoch 1/2 | loss=4.7499 | avg=4.9267 | acc=20.9% | lr=8.34e-05 | pos=64
|
| 390 |
+
Step 34100 | epoch 1/2 | loss=4.0668 | avg=4.8603 | acc=25.0% | lr=8.32e-05 | pos=64
|
| 391 |
+
Step 34200 | epoch 1/2 | loss=6.0244 | avg=4.7980 | acc=13.8% | lr=8.31e-05 | pos=64
|
| 392 |
+
Step 34300 | epoch 1/2 | loss=6.1788 | avg=4.9079 | acc=11.2% | lr=8.29e-05 | pos=64
|
| 393 |
+
Step 34400 | epoch 1/2 | loss=4.1456 | avg=4.8985 | acc=25.9% | lr=8.28e-05 | pos=64
|
| 394 |
+
Step 34500 | epoch 1/2 | loss=4.1256 | avg=4.8664 | acc=22.5% | lr=8.26e-05 | pos=64
|
| 395 |
+
Step 34600 | epoch 1/2 | loss=3.3021 | avg=4.8585 | acc=31.2% | lr=8.25e-05 | pos=64
|
| 396 |
+
Step 34700 | epoch 1/2 | loss=4.5752 | avg=4.8328 | acc=21.9% | lr=8.23e-05 | pos=64
|
| 397 |
+
Step 34800 | epoch 1/2 | loss=4.3158 | avg=4.8388 | acc=22.8% | lr=8.22e-05 | pos=64
|
| 398 |
+
Step 34900 | epoch 1/2 | loss=4.7157 | avg=4.8370 | acc=20.9% | lr=8.20e-05 | pos=64
|
| 399 |
+
Step 35000 | epoch 1/2 | loss=4.5456 | avg=4.8354 | acc=22.2% | lr=8.19e-05 | pos=64
|
| 400 |
+
[Checkpoint] Saved step 35000 (loss=4.5456) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 401 |
+
[Save @ step 35000] loss=4.5456
|
| 402 |
+
[Checkpoint] Saved step 35000 (loss=4.5456) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step35000.pt
|
| 403 |
+
[Prune @ step 35000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 404 |
+
Step 35100 | epoch 1/2 | loss=4.8744 | avg=4.8083 | acc=19.4% | lr=8.17e-05 | pos=64
|
| 405 |
+
Step 35200 | epoch 1/2 | loss=4.0071 | avg=4.8246 | acc=29.7% | lr=8.16e-05 | pos=64
|
| 406 |
+
Step 35300 | epoch 1/2 | loss=5.6044 | avg=4.7499 | acc=12.5% | lr=8.14e-05 | pos=64
|
| 407 |
+
Step 35400 | epoch 1/2 | loss=4.2247 | avg=4.7446 | acc=28.1% | lr=8.13e-05 | pos=64
|
| 408 |
+
Step 35500 | epoch 1/2 | loss=3.1700 | avg=4.7470 | acc=33.1% | lr=8.11e-05 | pos=64
|
| 409 |
+
Step 35600 | epoch 1/2 | loss=6.0358 | avg=4.7393 | acc=8.1% | lr=8.09e-05 | pos=64
|
| 410 |
+
Step 35700 | epoch 1/2 | loss=5.0337 | avg=4.7483 | acc=13.8% | lr=8.08e-05 | pos=64
|
| 411 |
+
Step 35800 | epoch 1/2 | loss=4.3446 | avg=4.7279 | acc=18.1% | lr=8.06e-05 | pos=64
|
| 412 |
+
Step 35900 | epoch 1/2 | loss=6.4008 | avg=4.7461 | acc=7.5% | lr=8.05e-05 | pos=64
|
| 413 |
+
Step 36000 | epoch 1/2 | loss=4.2510 | avg=4.7567 | acc=18.8% | lr=8.03e-05 | pos=64
|
| 414 |
+
Step 36100 | epoch 1/2 | loss=3.6081 | avg=4.7084 | acc=31.2% | lr=8.02e-05 | pos=64
|
| 415 |
+
Step 36200 | epoch 1/2 | loss=4.5875 | avg=4.7817 | acc=16.2% | lr=8.00e-05 | pos=64
|
| 416 |
+
Step 36300 | epoch 1/2 | loss=5.6620 | avg=4.8110 | acc=15.9% | lr=7.98e-05 | pos=64
|
| 417 |
+
Step 36400 | epoch 1/2 | loss=2.7728 | avg=4.8058 | acc=34.7% | lr=7.97e-05 | pos=64
|
| 418 |
+
Step 36500 | epoch 1/2 | loss=3.4039 | avg=4.7638 | acc=33.1% | lr=7.95e-05 | pos=64
|
| 419 |
+
Step 36600 | epoch 1/2 | loss=5.3272 | avg=4.7858 | acc=17.2% | lr=7.94e-05 | pos=64
|
| 420 |
+
Step 36700 | epoch 1/2 | loss=5.1757 | avg=4.7830 | acc=16.2% | lr=7.92e-05 | pos=64
|
| 421 |
+
Step 36800 | epoch 1/2 | loss=4.8154 | avg=4.7898 | acc=17.2% | lr=7.90e-05 | pos=64
|
| 422 |
+
Step 36900 | epoch 1/2 | loss=3.7366 | avg=4.7910 | acc=23.1% | lr=7.89e-05 | pos=64
|
| 423 |
+
Step 37000 | epoch 1/2 | loss=6.1341 | avg=4.7940 | acc=8.1% | lr=7.87e-05 | pos=64
|
| 424 |
+
Step 37100 | epoch 1/2 | loss=3.4661 | avg=4.7218 | acc=28.1% | lr=7.86e-05 | pos=64
|
| 425 |
+
Step 37200 | epoch 1/2 | loss=5.7530 | avg=4.7379 | acc=7.2% | lr=7.84e-05 | pos=64
|
| 426 |
+
Step 37300 | epoch 1/2 | loss=4.6459 | avg=4.7362 | acc=18.1% | lr=7.82e-05 | pos=64
|
| 427 |
+
Step 37400 | epoch 1/2 | loss=5.7151 | avg=4.7266 | acc=13.8% | lr=7.81e-05 | pos=64
|
| 428 |
+
Step 37500 | epoch 1/2 | loss=5.3537 | avg=4.7269 | acc=17.2% | lr=7.79e-05 | pos=64
|
| 429 |
+
Step 37600 | epoch 1/2 | loss=4.1849 | avg=4.7563 | acc=27.8% | lr=7.77e-05 | pos=64
|
| 430 |
+
Step 37700 | epoch 1/2 | loss=5.0910 | avg=4.7244 | acc=16.9% | lr=7.76e-05 | pos=64
|
| 431 |
+
Step 37800 | epoch 1/2 | loss=4.2421 | avg=4.7285 | acc=26.2% | lr=7.74e-05 | pos=64
|
| 432 |
+
Step 37900 | epoch 1/2 | loss=4.9375 | avg=4.7565 | acc=14.4% | lr=7.72e-05 | pos=64
|
| 433 |
+
Step 38000 | epoch 1/2 | loss=4.2003 | avg=4.7620 | acc=25.6% | lr=7.71e-05 | pos=64
|
| 434 |
+
Step 38100 | epoch 1/2 | loss=4.7036 | avg=4.6738 | acc=20.3% | lr=7.69e-05 | pos=64
|
| 435 |
+
Step 38200 | epoch 1/2 | loss=4.3345 | avg=4.6845 | acc=31.9% | lr=7.68e-05 | pos=64
|
| 436 |
+
Step 38300 | epoch 1/2 | loss=4.1513 | avg=4.7093 | acc=20.3% | lr=7.66e-05 | pos=64
|
| 437 |
+
Step 38400 | epoch 1/2 | loss=5.4202 | avg=4.6723 | acc=9.4% | lr=7.64e-05 | pos=64
|
| 438 |
+
Step 38500 | epoch 1/2 | loss=3.9058 | avg=4.6523 | acc=22.5% | lr=7.62e-05 | pos=64
|
| 439 |
+
Step 38600 | epoch 1/2 | loss=5.6458 | avg=4.6797 | acc=14.4% | lr=7.61e-05 | pos=64
|
| 440 |
+
Step 38700 | epoch 1/2 | loss=6.4054 | avg=4.6650 | acc=14.7% | lr=7.59e-05 | pos=64
|
| 441 |
+
Step 38800 | epoch 1/2 | loss=3.6383 | avg=4.6586 | acc=27.5% | lr=7.57e-05 | pos=64
|
| 442 |
+
Step 38900 | epoch 1/2 | loss=5.0023 | avg=4.6817 | acc=12.5% | lr=7.56e-05 | pos=64
|
| 443 |
+
Step 39000 | epoch 1/2 | loss=5.0706 | avg=4.6753 | acc=15.6% | lr=7.54e-05 | pos=64
|
| 444 |
+
Step 39100 | epoch 1/2 | loss=4.5618 | avg=4.7584 | acc=21.6% | lr=7.52e-05 | pos=64
|
| 445 |
+
Step 39200 | epoch 1/2 | loss=6.4242 | avg=4.7358 | acc=10.3% | lr=7.51e-05 | pos=64
|
| 446 |
+
Step 39300 | epoch 1/2 | loss=5.8426 | avg=4.7251 | acc=15.6% | lr=7.49e-05 | pos=64
|
| 447 |
+
Step 39400 | epoch 1/2 | loss=3.8919 | avg=4.7502 | acc=27.8% | lr=7.47e-05 | pos=64
|
| 448 |
+
Step 39500 | epoch 1/2 | loss=4.7654 | avg=4.7497 | acc=14.4% | lr=7.46e-05 | pos=64
|
| 449 |
+
Step 39600 | epoch 1/2 | loss=5.4378 | avg=4.7354 | acc=14.4% | lr=7.44e-05 | pos=64
|
| 450 |
+
Step 39700 | epoch 1/2 | loss=4.6643 | avg=4.7324 | acc=17.8% | lr=7.42e-05 | pos=64
|
| 451 |
+
Step 39800 | epoch 1/2 | loss=4.2357 | avg=4.7084 | acc=20.9% | lr=7.40e-05 | pos=64
|
| 452 |
+
Step 39900 | epoch 1/2 | loss=3.6500 | avg=4.6791 | acc=25.6% | lr=7.39e-05 | pos=64
|
| 453 |
+
Step 40000 | epoch 1/2 | loss=7.7348 | avg=4.6723 | acc=5.9% | lr=7.37e-05 | pos=64
|
| 454 |
+
[Checkpoint] Saved step 40000 (loss=7.7348) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 455 |
+
[Save @ step 40000] loss=7.7348
|
| 456 |
+
[Checkpoint] Saved step 40000 (loss=7.7348) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step40000.pt
|
| 457 |
+
[Prune @ step 40000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 458 |
+
Step 40100 | epoch 1/2 | loss=3.5851 | avg=4.5714 | acc=29.4% | lr=7.35e-05 | pos=64
|
| 459 |
+
Step 40200 | epoch 1/2 | loss=4.0951 | avg=4.6517 | acc=19.7% | lr=7.33e-05 | pos=64
|
| 460 |
+
Step 40300 | epoch 1/2 | loss=2.4417 | avg=4.6563 | acc=44.4% | lr=7.32e-05 | pos=64
|
| 461 |
+
Step 40400 | epoch 1/2 | loss=3.8633 | avg=4.6643 | acc=30.6% | lr=7.30e-05 | pos=64
|
| 462 |
+
Step 40500 | epoch 1/2 | loss=3.5076 | avg=4.6672 | acc=24.1% | lr=7.28e-05 | pos=64
|
| 463 |
+
Step 40600 | epoch 1/2 | loss=4.5050 | avg=4.6642 | acc=26.6% | lr=7.26e-05 | pos=64
|
| 464 |
+
Step 40700 | epoch 1/2 | loss=4.4809 | avg=4.6755 | acc=26.2% | lr=7.25e-05 | pos=64
|
| 465 |
+
Step 40800 | epoch 1/2 | loss=4.7991 | avg=4.6569 | acc=18.8% | lr=7.23e-05 | pos=64
|
| 466 |
+
Step 40900 | epoch 1/2 | loss=4.5394 | avg=4.6263 | acc=19.4% | lr=7.21e-05 | pos=64
|
| 467 |
+
Step 41000 | epoch 1/2 | loss=6.2422 | avg=4.6316 | acc=10.6% | lr=7.19e-05 | pos=64
|
| 468 |
+
Step 41100 | epoch 1/2 | loss=4.1456 | avg=4.7179 | acc=23.4% | lr=7.18e-05 | pos=64
|
| 469 |
+
Step 41200 | epoch 1/2 | loss=4.6278 | avg=4.7067 | acc=20.3% | lr=7.16e-05 | pos=64
|
| 470 |
+
Step 41300 | epoch 1/2 | loss=3.8254 | avg=4.7079 | acc=27.2% | lr=7.14e-05 | pos=64
|
| 471 |
+
Step 41400 | epoch 1/2 | loss=4.6589 | avg=4.6779 | acc=15.0% | lr=7.12e-05 | pos=64
|
| 472 |
+
Step 41500 | epoch 1/2 | loss=5.2391 | avg=4.6726 | acc=15.6% | lr=7.11e-05 | pos=64
|
| 473 |
+
Step 41600 | epoch 1/2 | loss=3.8890 | avg=4.6655 | acc=25.0% | lr=7.09e-05 | pos=64
|
| 474 |
+
Step 41700 | epoch 1/2 | loss=4.5508 | avg=4.6545 | acc=24.1% | lr=7.07e-05 | pos=64
|
| 475 |
+
Step 41800 | epoch 1/2 | loss=4.3258 | avg=4.6345 | acc=17.2% | lr=7.05e-05 | pos=64
|
| 476 |
+
Step 41900 | epoch 1/2 | loss=4.4810 | avg=4.6326 | acc=15.3% | lr=7.03e-05 | pos=64
|
| 477 |
+
Step 42000 | epoch 1/2 | loss=5.7353 | avg=4.6313 | acc=9.7% | lr=7.02e-05 | pos=64
|
| 478 |
+
Step 42100 | epoch 1/2 | loss=4.5156 | avg=4.4535 | acc=16.2% | lr=7.00e-05 | pos=64
|
| 479 |
+
Step 42200 | epoch 1/2 | loss=4.3565 | avg=4.4592 | acc=19.1% | lr=6.98e-05 | pos=64
|
| 480 |
+
Step 42300 | epoch 1/2 | loss=3.0806 | avg=4.4856 | acc=30.6% | lr=6.96e-05 | pos=64
|
| 481 |
+
Step 42400 | epoch 1/2 | loss=3.4895 | avg=4.5476 | acc=26.2% | lr=6.94e-05 | pos=64
|
| 482 |
+
Step 42500 | epoch 1/2 | loss=5.0846 | avg=4.5426 | acc=16.6% | lr=6.93e-05 | pos=64
|
| 483 |
+
Step 42600 | epoch 1/2 | loss=4.1276 | avg=4.5449 | acc=20.9% | lr=6.91e-05 | pos=64
|
| 484 |
+
Step 42700 | epoch 1/2 | loss=5.2457 | avg=4.5581 | acc=16.2% | lr=6.89e-05 | pos=64
|
| 485 |
+
Step 42800 | epoch 1/2 | loss=5.6974 | avg=4.5725 | acc=21.9% | lr=6.87e-05 | pos=64
|
| 486 |
+
Step 42900 | epoch 1/2 | loss=5.6322 | avg=4.5719 | acc=10.9% | lr=6.85e-05 | pos=64
|
| 487 |
+
Step 43000 | epoch 1/2 | loss=2.8729 | avg=4.5562 | acc=40.0% | lr=6.84e-05 | pos=64
|
| 488 |
+
Step 43100 | epoch 1/2 | loss=5.8592 | avg=4.4284 | acc=10.6% | lr=6.82e-05 | pos=64
|
| 489 |
+
Step 43200 | epoch 1/2 | loss=4.0402 | avg=4.4922 | acc=23.8% | lr=6.80e-05 | pos=64
|
| 490 |
+
Step 43300 | epoch 1/2 | loss=3.9593 | avg=4.5362 | acc=23.4% | lr=6.78e-05 | pos=64
|
| 491 |
+
Step 43400 | epoch 1/2 | loss=4.9662 | avg=4.4911 | acc=16.2% | lr=6.76e-05 | pos=64
|
| 492 |
+
Step 43500 | epoch 1/2 | loss=4.2632 | avg=4.4986 | acc=24.7% | lr=6.74e-05 | pos=64
|
| 493 |
+
Step 43600 | epoch 1/2 | loss=3.8268 | avg=4.5082 | acc=23.1% | lr=6.73e-05 | pos=64
|
| 494 |
+
Step 43700 | epoch 1/2 | loss=2.9263 | avg=4.5108 | acc=31.6% | lr=6.71e-05 | pos=64
|
| 495 |
+
Step 43800 | epoch 1/2 | loss=4.2181 | avg=4.5107 | acc=25.3% | lr=6.69e-05 | pos=64
|
| 496 |
+
Step 43900 | epoch 1/2 | loss=2.0058 | avg=4.5230 | acc=47.5% | lr=6.67e-05 | pos=64
|
| 497 |
+
Step 44000 | epoch 1/2 | loss=3.8730 | avg=4.5080 | acc=18.1% | lr=6.65e-05 | pos=64
|
| 498 |
+
Step 44100 | epoch 1/2 | loss=3.9568 | avg=4.4506 | acc=26.6% | lr=6.63e-05 | pos=64
|
| 499 |
+
Step 44200 | epoch 1/2 | loss=3.9711 | avg=4.4644 | acc=22.2% | lr=6.62e-05 | pos=64
|
| 500 |
+
Step 44300 | epoch 1/2 | loss=3.6343 | avg=4.4960 | acc=29.7% | lr=6.60e-05 | pos=64
|
| 501 |
+
Step 44400 | epoch 1/2 | loss=3.9971 | avg=4.4870 | acc=28.7% | lr=6.58e-05 | pos=64
|
| 502 |
+
Step 44500 | epoch 1/2 | loss=3.7042 | avg=4.4727 | acc=27.5% | lr=6.56e-05 | pos=64
|
| 503 |
+
Step 44600 | epoch 1/2 | loss=4.0212 | avg=4.4760 | acc=26.2% | lr=6.54e-05 | pos=64
|
| 504 |
+
Step 44700 | epoch 1/2 | loss=3.5272 | avg=4.4681 | acc=28.7% | lr=6.52e-05 | pos=64
|
| 505 |
+
Step 44800 | epoch 1/2 | loss=6.0561 | avg=4.4643 | acc=10.9% | lr=6.50e-05 | pos=64
|
| 506 |
+
Step 44900 | epoch 1/2 | loss=4.6864 | avg=4.4690 | acc=18.8% | lr=6.49e-05 | pos=64
|
| 507 |
+
Step 45000 | epoch 1/2 | loss=7.1617 | avg=4.4785 | acc=10.6% | lr=6.47e-05 | pos=64
|
| 508 |
+
[Checkpoint] Saved step 45000 (loss=7.1617) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 509 |
+
[Save @ step 45000] loss=7.1617
|
| 510 |
+
[Checkpoint] Saved step 45000 (loss=7.1617) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step45000.pt
|
| 511 |
+
[Prune @ step 45000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 512 |
+
Step 45100 | epoch 1/2 | loss=3.2811 | avg=4.3899 | acc=34.7% | lr=6.45e-05 | pos=64
|
| 513 |
+
Step 45200 | epoch 1/2 | loss=4.6385 | avg=4.5053 | acc=21.6% | lr=6.43e-05 | pos=64
|
| 514 |
+
Step 45300 | epoch 1/2 | loss=3.3678 | avg=4.5298 | acc=24.1% | lr=6.41e-05 | pos=64
|
| 515 |
+
Step 45400 | epoch 1/2 | loss=4.6456 | avg=4.5162 | acc=23.8% | lr=6.39e-05 | pos=64
|
| 516 |
+
Step 45500 | epoch 1/2 | loss=4.6782 | avg=4.4555 | acc=21.9% | lr=6.37e-05 | pos=64
|
| 517 |
+
Step 45600 | epoch 1/2 | loss=5.1122 | avg=4.4485 | acc=16.2% | lr=6.36e-05 | pos=64
|
| 518 |
+
Step 45700 | epoch 1/2 | loss=4.5492 | avg=4.4482 | acc=18.1% | lr=6.34e-05 | pos=64
|
| 519 |
+
Step 45800 | epoch 1/2 | loss=4.5077 | avg=4.4446 | acc=20.0% | lr=6.32e-05 | pos=64
|
| 520 |
+
Step 45900 | epoch 1/2 | loss=5.6127 | avg=4.4549 | acc=13.1% | lr=6.30e-05 | pos=64
|
| 521 |
+
Step 46000 | epoch 1/2 | loss=6.9352 | avg=4.4635 | acc=7.2% | lr=6.28e-05 | pos=64
|
| 522 |
+
Step 46100 | epoch 1/2 | loss=3.1496 | avg=4.4735 | acc=36.9% | lr=6.26e-05 | pos=64
|
| 523 |
+
--- Epoch 1/2 complete (step 46122) ---
|
| 524 |
+
Step 46200 | epoch 2/2 | loss=4.6651 | avg=4.5114 | acc=15.6% | lr=6.24e-05 | pos=64
|
| 525 |
+
Step 46300 | epoch 2/2 | loss=5.0748 | avg=4.5066 | acc=18.1% | lr=6.22e-05 | pos=64
|
| 526 |
+
Step 46400 | epoch 2/2 | loss=5.6306 | avg=4.4797 | acc=15.0% | lr=6.21e-05 | pos=64
|
| 527 |
+
Step 46500 | epoch 2/2 | loss=6.4561 | avg=4.4427 | acc=13.8% | lr=6.19e-05 | pos=64
|
| 528 |
+
Step 46600 | epoch 2/2 | loss=3.1082 | avg=4.3921 | acc=35.6% | lr=6.17e-05 | pos=64
|
| 529 |
+
Step 46700 | epoch 2/2 | loss=2.9550 | avg=4.3908 | acc=35.6% | lr=6.15e-05 | pos=64
|
| 530 |
+
Step 46800 | epoch 2/2 | loss=4.7431 | avg=4.3802 | acc=17.2% | lr=6.13e-05 | pos=64
|
| 531 |
+
Step 46900 | epoch 2/2 | loss=3.6641 | avg=4.3776 | acc=30.0% | lr=6.11e-05 | pos=64
|
| 532 |
+
Step 47000 | epoch 2/2 | loss=4.2720 | avg=4.3821 | acc=23.4% | lr=6.09e-05 | pos=64
|
| 533 |
+
Step 47100 | epoch 2/2 | loss=4.5297 | avg=4.0870 | acc=24.4% | lr=6.07e-05 | pos=64
|
| 534 |
+
Step 47200 | epoch 2/2 | loss=3.6537 | avg=4.2045 | acc=24.7% | lr=6.05e-05 | pos=64
|
| 535 |
+
Step 47300 | epoch 2/2 | loss=3.0951 | avg=4.2407 | acc=30.3% | lr=6.04e-05 | pos=64
|
| 536 |
+
Step 47400 | epoch 2/2 | loss=6.0267 | avg=4.3068 | acc=11.6% | lr=6.02e-05 | pos=64
|
| 537 |
+
Step 47500 | epoch 2/2 | loss=6.2363 | avg=4.3007 | acc=11.9% | lr=6.00e-05 | pos=64
|
| 538 |
+
Step 47600 | epoch 2/2 | loss=3.5052 | avg=4.3228 | acc=30.6% | lr=5.98e-05 | pos=64
|
| 539 |
+
Step 47700 | epoch 2/2 | loss=5.3205 | avg=4.3406 | acc=13.4% | lr=5.96e-05 | pos=64
|
| 540 |
+
Step 47800 | epoch 2/2 | loss=4.6430 | avg=4.3462 | acc=25.0% | lr=5.94e-05 | pos=32
|
| 541 |
+
Step 47900 | epoch 2/2 | loss=5.5823 | avg=4.3542 | acc=16.2% | lr=5.92e-05 | pos=64
|
| 542 |
+
Step 48000 | epoch 2/2 | loss=5.0836 | avg=4.3626 | acc=13.4% | lr=5.90e-05 | pos=64
|
| 543 |
+
Step 48100 | epoch 2/2 | loss=4.8668 | avg=4.1157 | acc=20.6% | lr=5.88e-05 | pos=64
|
| 544 |
+
Step 48200 | epoch 2/2 | loss=3.2015 | avg=4.1707 | acc=34.4% | lr=5.87e-05 | pos=64
|
| 545 |
+
Step 48300 | epoch 2/2 | loss=5.3611 | avg=4.2596 | acc=13.1% | lr=5.85e-05 | pos=64
|
| 546 |
+
Step 48400 | epoch 2/2 | loss=4.4880 | avg=4.3157 | acc=14.1% | lr=5.83e-05 | pos=64
|
| 547 |
+
Step 48500 | epoch 2/2 | loss=5.8044 | avg=4.3075 | acc=12.5% | lr=5.81e-05 | pos=64
|
| 548 |
+
Step 48600 | epoch 2/2 | loss=5.0874 | avg=4.3231 | acc=16.2% | lr=5.79e-05 | pos=64
|
| 549 |
+
Step 48700 | epoch 2/2 | loss=5.7498 | avg=4.3361 | acc=21.9% | lr=5.77e-05 | pos=64
|
| 550 |
+
Step 48800 | epoch 2/2 | loss=4.2398 | avg=4.3508 | acc=17.5% | lr=5.75e-05 | pos=64
|
| 551 |
+
Step 48900 | epoch 2/2 | loss=4.4350 | avg=4.3526 | acc=21.9% | lr=5.73e-05 | pos=64
|
| 552 |
+
Step 49000 | epoch 2/2 | loss=5.5366 | avg=4.3496 | acc=13.1% | lr=5.71e-05 | pos=64
|
| 553 |
+
Step 49100 | epoch 2/2 | loss=4.8387 | avg=4.4597 | acc=22.2% | lr=5.69e-05 | pos=64
|
| 554 |
+
Step 49200 | epoch 2/2 | loss=4.5019 | avg=4.3805 | acc=14.4% | lr=5.68e-05 | pos=64
|
| 555 |
+
Step 49300 | epoch 2/2 | loss=3.1210 | avg=4.3799 | acc=33.1% | lr=5.66e-05 | pos=64
|
| 556 |
+
Step 49400 | epoch 2/2 | loss=6.9753 | avg=4.4128 | acc=10.0% | lr=5.64e-05 | pos=64
|
| 557 |
+
Step 49500 | epoch 2/2 | loss=3.5888 | avg=4.4103 | acc=28.1% | lr=5.62e-05 | pos=64
|
| 558 |
+
Step 49600 | epoch 2/2 | loss=5.8356 | avg=4.3700 | acc=12.2% | lr=5.60e-05 | pos=64
|
| 559 |
+
Step 49700 | epoch 2/2 | loss=5.1198 | avg=4.3594 | acc=18.8% | lr=5.58e-05 | pos=64
|
| 560 |
+
Step 49800 | epoch 2/2 | loss=4.5969 | avg=4.3558 | acc=23.8% | lr=5.56e-05 | pos=64
|
| 561 |
+
Step 49900 | epoch 2/2 | loss=3.4335 | avg=4.3543 | acc=28.4% | lr=5.54e-05 | pos=64
|
| 562 |
+
Step 50000 | epoch 2/2 | loss=4.0635 | avg=4.3603 | acc=22.8% | lr=5.52e-05 | pos=64
|
| 563 |
+
[Checkpoint] Saved step 50000 (loss=4.0635) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 564 |
+
[Save @ step 50000] loss=4.0635
|
| 565 |
+
[Checkpoint] Saved step 50000 (loss=4.0635) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step50000.pt
|
| 566 |
+
[Prune @ step 50000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 567 |
+
Step 50100 | epoch 2/2 | loss=5.2289 | avg=4.6177 | acc=14.1% | lr=5.50e-05 | pos=64
|
| 568 |
+
Step 50200 | epoch 2/2 | loss=2.9311 | avg=4.4205 | acc=30.0% | lr=5.49e-05 | pos=64
|
| 569 |
+
Step 50300 | epoch 2/2 | loss=3.6592 | avg=4.3510 | acc=20.6% | lr=5.47e-05 | pos=64
|
| 570 |
+
Step 50400 | epoch 2/2 | loss=3.3800 | avg=4.3341 | acc=30.3% | lr=5.45e-05 | pos=64
|
| 571 |
+
Step 50500 | epoch 2/2 | loss=4.5648 | avg=4.3210 | acc=18.8% | lr=5.43e-05 | pos=64
|
| 572 |
+
Step 50600 | epoch 2/2 | loss=2.6815 | avg=4.3175 | acc=40.6% | lr=5.41e-05 | pos=64
|
| 573 |
+
Step 50700 | epoch 2/2 | loss=2.7688 | avg=4.3010 | acc=38.4% | lr=5.39e-05 | pos=64
|
| 574 |
+
Step 50800 | epoch 2/2 | loss=3.5650 | avg=4.2845 | acc=31.6% | lr=5.37e-05 | pos=64
|
| 575 |
+
Step 50900 | epoch 2/2 | loss=4.3807 | avg=4.2764 | acc=22.2% | lr=5.35e-05 | pos=64
|
| 576 |
+
Step 51000 | epoch 2/2 | loss=6.0428 | avg=4.2931 | acc=13.1% | lr=5.33e-05 | pos=64
|
| 577 |
+
Step 51100 | epoch 2/2 | loss=3.4159 | avg=4.4274 | acc=28.1% | lr=5.31e-05 | pos=64
|
| 578 |
+
Step 51200 | epoch 2/2 | loss=3.6057 | avg=4.3325 | acc=27.5% | lr=5.29e-05 | pos=64
|
| 579 |
+
Step 51300 | epoch 2/2 | loss=3.0823 | avg=4.2624 | acc=31.2% | lr=5.28e-05 | pos=64
|
| 580 |
+
Step 51400 | epoch 2/2 | loss=4.4353 | avg=4.2728 | acc=19.3% | lr=5.26e-05 | pos=59
|
| 581 |
+
Step 51500 | epoch 2/2 | loss=3.2198 | avg=4.2668 | acc=30.3% | lr=5.24e-05 | pos=64
|
| 582 |
+
Step 51600 | epoch 2/2 | loss=5.9554 | avg=4.2633 | acc=12.5% | lr=5.22e-05 | pos=64
|
| 583 |
+
Step 51700 | epoch 2/2 | loss=4.5542 | avg=4.2599 | acc=21.6% | lr=5.20e-05 | pos=64
|
| 584 |
+
Step 51800 | epoch 2/2 | loss=2.9085 | avg=4.2548 | acc=38.4% | lr=5.18e-05 | pos=64
|
| 585 |
+
Step 51900 | epoch 2/2 | loss=3.8177 | avg=4.2541 | acc=33.1% | lr=5.16e-05 | pos=26
|
| 586 |
+
Step 52000 | epoch 2/2 | loss=3.5356 | avg=4.2607 | acc=27.5% | lr=5.14e-05 | pos=64
|
| 587 |
+
Step 52100 | epoch 2/2 | loss=3.6337 | avg=4.1506 | acc=33.4% | lr=5.12e-05 | pos=64
|
| 588 |
+
Step 52200 | epoch 2/2 | loss=4.2330 | avg=4.2326 | acc=20.9% | lr=5.10e-05 | pos=64
|
| 589 |
+
Step 52300 | epoch 2/2 | loss=3.7074 | avg=4.1757 | acc=27.8% | lr=5.09e-05 | pos=64
|
| 590 |
+
Step 52400 | epoch 2/2 | loss=2.1662 | avg=4.1801 | acc=47.8% | lr=5.07e-05 | pos=64
|
| 591 |
+
Step 52500 | epoch 2/2 | loss=2.5947 | avg=4.1741 | acc=42.2% | lr=5.05e-05 | pos=64
|
| 592 |
+
Step 52600 | epoch 2/2 | loss=5.7248 | avg=4.1894 | acc=11.2% | lr=5.03e-05 | pos=64
|
| 593 |
+
Step 52700 | epoch 2/2 | loss=2.4033 | avg=4.2100 | acc=45.3% | lr=5.01e-05 | pos=64
|
| 594 |
+
Step 52800 | epoch 2/2 | loss=3.9000 | avg=4.2255 | acc=25.3% | lr=4.99e-05 | pos=64
|
| 595 |
+
Step 52900 | epoch 2/2 | loss=4.7661 | avg=4.2193 | acc=22.2% | lr=4.97e-05 | pos=64
|
| 596 |
+
Step 53000 | epoch 2/2 | loss=5.2609 | avg=4.2174 | acc=17.8% | lr=4.95e-05 | pos=64
|
| 597 |
+
Step 53100 | epoch 2/2 | loss=2.6993 | avg=4.2650 | acc=35.0% | lr=4.93e-05 | pos=64
|
| 598 |
+
Step 53200 | epoch 2/2 | loss=5.0744 | avg=4.2858 | acc=14.4% | lr=4.92e-05 | pos=64
|
| 599 |
+
Step 53300 | epoch 2/2 | loss=5.8003 | avg=4.2816 | acc=17.2% | lr=4.90e-05 | pos=64
|
| 600 |
+
Step 53400 | epoch 2/2 | loss=4.3404 | avg=4.2741 | acc=20.6% | lr=4.88e-05 | pos=64
|
| 601 |
+
Step 53500 | epoch 2/2 | loss=2.6664 | avg=4.2626 | acc=41.9% | lr=4.86e-05 | pos=64
|
| 602 |
+
Step 53600 | epoch 2/2 | loss=4.7678 | avg=4.2828 | acc=18.4% | lr=4.84e-05 | pos=64
|
| 603 |
+
Step 53700 | epoch 2/2 | loss=3.2696 | avg=4.2951 | acc=36.6% | lr=4.82e-05 | pos=64
|
| 604 |
+
Step 53800 | epoch 2/2 | loss=4.6912 | avg=4.2679 | acc=16.6% | lr=4.80e-05 | pos=64
|
| 605 |
+
Step 53900 | epoch 2/2 | loss=3.8017 | avg=4.2719 | acc=29.1% | lr=4.78e-05 | pos=64
|
| 606 |
+
Step 54000 | epoch 2/2 | loss=2.0394 | avg=4.2839 | acc=46.9% | lr=4.76e-05 | pos=64
|
| 607 |
+
Step 54100 | epoch 2/2 | loss=4.9542 | avg=4.0901 | acc=19.4% | lr=4.75e-05 | pos=64
|
| 608 |
+
Step 54200 | epoch 2/2 | loss=3.9687 | avg=4.0914 | acc=26.2% | lr=4.73e-05 | pos=64
|
| 609 |
+
Step 54300 | epoch 2/2 | loss=5.5588 | avg=4.1474 | acc=18.8% | lr=4.71e-05 | pos=64
|
| 610 |
+
Step 54400 | epoch 2/2 | loss=3.4846 | avg=4.1146 | acc=31.2% | lr=4.69e-05 | pos=64
|
| 611 |
+
Step 54500 | epoch 2/2 | loss=3.5327 | avg=4.0975 | acc=26.6% | lr=4.67e-05 | pos=64
|
| 612 |
+
Step 54600 | epoch 2/2 | loss=3.3468 | avg=4.1207 | acc=37.8% | lr=4.65e-05 | pos=64
|
| 613 |
+
Step 54700 | epoch 2/2 | loss=5.5684 | avg=4.1531 | acc=12.5% | lr=4.63e-05 | pos=64
|
| 614 |
+
Step 54800 | epoch 2/2 | loss=5.1904 | avg=4.1616 | acc=20.9% | lr=4.62e-05 | pos=64
|
| 615 |
+
Step 54900 | epoch 2/2 | loss=5.1430 | avg=4.1824 | acc=15.6% | lr=4.60e-05 | pos=64
|
| 616 |
+
Step 55000 | epoch 2/2 | loss=5.3936 | avg=4.1877 | acc=13.8% | lr=4.58e-05 | pos=64
|
| 617 |
+
[Checkpoint] Saved step 55000 (loss=5.3936) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 618 |
+
[Save @ step 55000] loss=5.3936
|
| 619 |
+
[Checkpoint] Saved step 55000 (loss=5.3936) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step55000.pt
|
| 620 |
+
[Prune @ step 55000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 621 |
+
Step 55100 | epoch 2/2 | loss=4.5162 | avg=4.2784 | acc=17.5% | lr=4.56e-05 | pos=64
|
| 622 |
+
Step 55200 | epoch 2/2 | loss=3.8093 | avg=4.1908 | acc=21.2% | lr=4.54e-05 | pos=64
|
| 623 |
+
Step 55300 | epoch 2/2 | loss=4.7255 | avg=4.1757 | acc=18.1% | lr=4.52e-05 | pos=64
|
| 624 |
+
Step 55400 | epoch 2/2 | loss=4.1218 | avg=4.1679 | acc=19.4% | lr=4.50e-05 | pos=64
|
| 625 |
+
Step 55500 | epoch 2/2 | loss=3.7746 | avg=4.1506 | acc=27.2% | lr=4.48e-05 | pos=64
|
| 626 |
+
Step 55600 | epoch 2/2 | loss=4.0509 | avg=4.1869 | acc=24.7% | lr=4.47e-05 | pos=64
|
| 627 |
+
Step 55700 | epoch 2/2 | loss=3.6973 | avg=4.1932 | acc=27.2% | lr=4.45e-05 | pos=64
|
| 628 |
+
Step 55800 | epoch 2/2 | loss=3.7366 | avg=4.1651 | acc=25.0% | lr=4.43e-05 | pos=64
|
| 629 |
+
Step 55900 | epoch 2/2 | loss=4.5621 | avg=4.1390 | acc=16.2% | lr=4.41e-05 | pos=64
|
| 630 |
+
Step 56000 | epoch 2/2 | loss=4.4710 | avg=4.1407 | acc=26.6% | lr=4.39e-05 | pos=64
|
| 631 |
+
Step 56100 | epoch 2/2 | loss=3.0139 | avg=4.1708 | acc=33.1% | lr=4.37e-05 | pos=64
|
| 632 |
+
Step 56200 | epoch 2/2 | loss=3.7810 | avg=4.1464 | acc=22.5% | lr=4.36e-05 | pos=64
|
| 633 |
+
Step 56300 | epoch 2/2 | loss=3.9623 | avg=4.0911 | acc=20.3% | lr=4.34e-05 | pos=64
|
| 634 |
+
Step 56400 | epoch 2/2 | loss=4.3553 | avg=4.1003 | acc=23.1% | lr=4.32e-05 | pos=64
|
| 635 |
+
Step 56500 | epoch 2/2 | loss=5.0065 | avg=4.0988 | acc=12.8% | lr=4.30e-05 | pos=64
|
| 636 |
+
Step 56600 | epoch 2/2 | loss=2.5300 | avg=4.1006 | acc=49.7% | lr=4.28e-05 | pos=64
|
| 637 |
+
Step 56700 | epoch 2/2 | loss=5.8755 | avg=4.0992 | acc=13.4% | lr=4.26e-05 | pos=64
|
| 638 |
+
Step 56800 | epoch 2/2 | loss=5.3956 | avg=4.1137 | acc=16.2% | lr=4.25e-05 | pos=64
|
| 639 |
+
Step 56900 | epoch 2/2 | loss=5.2464 | avg=4.1265 | acc=10.0% | lr=4.23e-05 | pos=64
|
| 640 |
+
Step 57000 | epoch 2/2 | loss=3.0738 | avg=4.1386 | acc=31.2% | lr=4.21e-05 | pos=64
|
| 641 |
+
Step 57100 | epoch 2/2 | loss=2.6303 | avg=4.2434 | acc=41.9% | lr=4.19e-05 | pos=64
|
| 642 |
+
Step 57200 | epoch 2/2 | loss=2.7785 | avg=4.1662 | acc=37.2% | lr=4.17e-05 | pos=64
|
| 643 |
+
Step 57300 | epoch 2/2 | loss=3.0878 | avg=4.1299 | acc=33.4% | lr=4.15e-05 | pos=64
|
| 644 |
+
Step 57400 | epoch 2/2 | loss=4.4544 | avg=4.1786 | acc=18.1% | lr=4.14e-05 | pos=64
|
| 645 |
+
Step 57500 | epoch 2/2 | loss=5.2340 | avg=4.2009 | acc=24.4% | lr=4.12e-05 | pos=64
|
| 646 |
+
Step 57600 | epoch 2/2 | loss=2.3631 | avg=4.1756 | acc=42.8% | lr=4.10e-05 | pos=64
|
| 647 |
+
Step 57700 | epoch 2/2 | loss=2.8706 | avg=4.1659 | acc=38.1% | lr=4.08e-05 | pos=64
|
| 648 |
+
Step 57800 | epoch 2/2 | loss=4.6117 | avg=4.1614 | acc=19.7% | lr=4.06e-05 | pos=64
|
| 649 |
+
Step 57900 | epoch 2/2 | loss=4.4408 | avg=4.1482 | acc=26.6% | lr=4.05e-05 | pos=64
|
| 650 |
+
Step 58000 | epoch 2/2 | loss=4.8935 | avg=4.1372 | acc=17.2% | lr=4.03e-05 | pos=64
|
| 651 |
+
Step 58100 | epoch 2/2 | loss=5.1582 | avg=4.0288 | acc=14.4% | lr=4.01e-05 | pos=64
|
| 652 |
+
Step 58200 | epoch 2/2 | loss=4.6675 | avg=4.0410 | acc=14.1% | lr=3.99e-05 | pos=64
|
| 653 |
+
Step 58300 | epoch 2/2 | loss=3.7610 | avg=4.0434 | acc=27.8% | lr=3.97e-05 | pos=64
|
| 654 |
+
Step 58400 | epoch 2/2 | loss=4.6507 | avg=4.0494 | acc=21.6% | lr=3.96e-05 | pos=64
|
| 655 |
+
Step 58500 | epoch 2/2 | loss=4.7374 | avg=4.0741 | acc=23.1% | lr=3.94e-05 | pos=64
|
| 656 |
+
Step 58600 | epoch 2/2 | loss=4.1819 | avg=4.0661 | acc=21.2% | lr=3.92e-05 | pos=64
|
| 657 |
+
Step 58700 | epoch 2/2 | loss=2.6859 | avg=4.0721 | acc=38.1% | lr=3.90e-05 | pos=64
|
| 658 |
+
Step 58800 | epoch 2/2 | loss=3.4361 | avg=4.1087 | acc=26.9% | lr=3.88e-05 | pos=64
|
| 659 |
+
Step 58900 | epoch 2/2 | loss=4.4231 | avg=4.0821 | acc=20.3% | lr=3.87e-05 | pos=64
|
| 660 |
+
Step 59000 | epoch 2/2 | loss=4.5820 | avg=4.0734 | acc=18.8% | lr=3.85e-05 | pos=64
|
| 661 |
+
Step 59100 | epoch 2/2 | loss=4.5509 | avg=3.9606 | acc=20.0% | lr=3.83e-05 | pos=64
|
| 662 |
+
Step 59200 | epoch 2/2 | loss=3.7237 | avg=3.9542 | acc=24.7% | lr=3.81e-05 | pos=64
|
| 663 |
+
Step 59300 | epoch 2/2 | loss=4.6538 | avg=3.9588 | acc=16.6% | lr=3.80e-05 | pos=64
|
| 664 |
+
Step 59400 | epoch 2/2 | loss=4.2155 | avg=4.0015 | acc=21.2% | lr=3.78e-05 | pos=64
|
| 665 |
+
Step 59500 | epoch 2/2 | loss=5.7525 | avg=4.0206 | acc=11.9% | lr=3.76e-05 | pos=64
|
| 666 |
+
Step 59600 | epoch 2/2 | loss=4.0382 | avg=4.0227 | acc=21.6% | lr=3.74e-05 | pos=64
|
| 667 |
+
Step 59700 | epoch 2/2 | loss=2.7897 | avg=4.0174 | acc=34.4% | lr=3.73e-05 | pos=64
|
| 668 |
+
Step 59800 | epoch 2/2 | loss=3.3053 | avg=4.0501 | acc=30.6% | lr=3.71e-05 | pos=64
|
| 669 |
+
Step 59900 | epoch 2/2 | loss=4.6714 | avg=4.0452 | acc=15.3% | lr=3.69e-05 | pos=64
|
| 670 |
+
Step 60000 | epoch 2/2 | loss=2.9858 | avg=4.0436 | acc=34.1% | lr=3.67e-05 | pos=64
|
| 671 |
+
[Checkpoint] Saved step 60000 (loss=2.9858) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 672 |
+
[Save @ step 60000] loss=2.9858
|
| 673 |
+
[Checkpoint] Saved step 60000 (loss=2.9858) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step60000.pt
|
| 674 |
+
[Prune @ step 60000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 675 |
+
Step 60100 | epoch 2/2 | loss=3.8330 | avg=3.9239 | acc=25.6% | lr=3.66e-05 | pos=64
|
| 676 |
+
Step 60200 | epoch 2/2 | loss=3.5669 | avg=4.1022 | acc=27.2% | lr=3.64e-05 | pos=64
|
| 677 |
+
Step 60300 | epoch 2/2 | loss=3.5239 | avg=4.0814 | acc=30.0% | lr=3.62e-05 | pos=64
|
| 678 |
+
Step 60400 | epoch 2/2 | loss=3.2373 | avg=4.0942 | acc=32.5% | lr=3.60e-05 | pos=64
|
| 679 |
+
Step 60500 | epoch 2/2 | loss=2.7187 | avg=4.0444 | acc=34.4% | lr=3.59e-05 | pos=64
|
| 680 |
+
Step 60600 | epoch 2/2 | loss=4.2308 | avg=4.0627 | acc=22.2% | lr=3.57e-05 | pos=64
|
| 681 |
+
Step 60700 | epoch 2/2 | loss=1.7200 | avg=4.0798 | acc=56.9% | lr=3.55e-05 | pos=64
|
| 682 |
+
Step 60800 | epoch 2/2 | loss=2.9003 | avg=4.0900 | acc=35.0% | lr=3.54e-05 | pos=64
|
| 683 |
+
Step 60900 | epoch 2/2 | loss=5.0054 | avg=4.0693 | acc=20.3% | lr=3.52e-05 | pos=64
|
| 684 |
+
Step 61000 | epoch 2/2 | loss=4.2459 | avg=4.0705 | acc=22.2% | lr=3.50e-05 | pos=64
|
| 685 |
+
Step 61100 | epoch 2/2 | loss=4.2594 | avg=4.0604 | acc=24.7% | lr=3.48e-05 | pos=64
|
| 686 |
+
Step 61200 | epoch 2/2 | loss=3.8607 | avg=4.1537 | acc=23.4% | lr=3.47e-05 | pos=64
|
| 687 |
+
Step 61300 | epoch 2/2 | loss=2.3646 | avg=4.1256 | acc=45.3% | lr=3.45e-05 | pos=64
|
| 688 |
+
Step 61400 | epoch 2/2 | loss=3.8619 | avg=4.0834 | acc=25.3% | lr=3.43e-05 | pos=64
|
| 689 |
+
Step 61500 | epoch 2/2 | loss=3.2487 | avg=4.0629 | acc=37.8% | lr=3.42e-05 | pos=64
|
| 690 |
+
Step 61600 | epoch 2/2 | loss=4.6425 | avg=4.0680 | acc=15.9% | lr=3.40e-05 | pos=64
|
| 691 |
+
Step 61700 | epoch 2/2 | loss=4.3301 | avg=4.0565 | acc=18.8% | lr=3.38e-05 | pos=64
|
| 692 |
+
Step 61800 | epoch 2/2 | loss=6.8729 | avg=4.0740 | acc=11.9% | lr=3.37e-05 | pos=64
|
| 693 |
+
Step 61900 | epoch 2/2 | loss=4.2032 | avg=4.0915 | acc=19.1% | lr=3.35e-05 | pos=64
|
| 694 |
+
Step 62000 | epoch 2/2 | loss=3.9328 | avg=4.0745 | acc=29.4% | lr=3.33e-05 | pos=64
|
| 695 |
+
Step 62100 | epoch 2/2 | loss=4.3181 | avg=4.0363 | acc=16.9% | lr=3.32e-05 | pos=64
|
| 696 |
+
Step 62200 | epoch 2/2 | loss=3.8276 | avg=4.0184 | acc=21.2% | lr=3.30e-05 | pos=64
|
| 697 |
+
Step 62300 | epoch 2/2 | loss=6.0614 | avg=4.0637 | acc=10.3% | lr=3.28e-05 | pos=64
|
| 698 |
+
Step 62400 | epoch 2/2 | loss=3.9388 | avg=4.0939 | acc=19.4% | lr=3.27e-05 | pos=64
|
| 699 |
+
Step 62500 | epoch 2/2 | loss=4.1195 | avg=4.0725 | acc=20.9% | lr=3.25e-05 | pos=64
|
| 700 |
+
Step 62600 | epoch 2/2 | loss=3.3876 | avg=4.1031 | acc=25.9% | lr=3.23e-05 | pos=64
|
| 701 |
+
Step 62700 | epoch 2/2 | loss=3.9373 | avg=4.0890 | acc=26.9% | lr=3.22e-05 | pos=64
|
| 702 |
+
Step 62800 | epoch 2/2 | loss=2.9918 | avg=4.0637 | acc=36.9% | lr=3.20e-05 | pos=64
|
| 703 |
+
Step 62900 | epoch 2/2 | loss=2.9810 | avg=4.0758 | acc=34.1% | lr=3.18e-05 | pos=64
|
| 704 |
+
Step 63000 | epoch 2/2 | loss=2.5209 | avg=4.0705 | acc=40.3% | lr=3.17e-05 | pos=64
|
| 705 |
+
Step 63100 | epoch 2/2 | loss=4.8396 | avg=3.9990 | acc=15.3% | lr=3.15e-05 | pos=64
|
| 706 |
+
Step 63200 | epoch 2/2 | loss=3.2383 | avg=3.9918 | acc=30.9% | lr=3.13e-05 | pos=64
|
| 707 |
+
Step 63300 | epoch 2/2 | loss=3.8425 | avg=3.9815 | acc=31.6% | lr=3.12e-05 | pos=64
|
| 708 |
+
Step 63400 | epoch 2/2 | loss=5.0435 | avg=3.9736 | acc=16.9% | lr=3.10e-05 | pos=64
|
| 709 |
+
Step 63500 | epoch 2/2 | loss=3.1421 | avg=3.9933 | acc=33.8% | lr=3.09e-05 | pos=64
|
| 710 |
+
Step 63600 | epoch 2/2 | loss=3.3418 | avg=3.9799 | acc=29.7% | lr=3.07e-05 | pos=64
|
| 711 |
+
Step 63700 | epoch 2/2 | loss=4.1966 | avg=3.9901 | acc=21.9% | lr=3.05e-05 | pos=64
|
| 712 |
+
Step 63800 | epoch 2/2 | loss=4.3271 | avg=3.9870 | acc=22.8% | lr=3.04e-05 | pos=64
|
| 713 |
+
Step 63900 | epoch 2/2 | loss=4.1081 | avg=3.9595 | acc=22.5% | lr=3.02e-05 | pos=64
|
| 714 |
+
Step 64000 | epoch 2/2 | loss=3.9388 | avg=3.9641 | acc=28.7% | lr=3.01e-05 | pos=64
|
| 715 |
+
Step 64100 | epoch 2/2 | loss=2.7076 | avg=4.1071 | acc=44.7% | lr=2.99e-05 | pos=64
|
| 716 |
+
Step 64200 | epoch 2/2 | loss=3.7021 | avg=4.0373 | acc=26.9% | lr=2.98e-05 | pos=64
|
| 717 |
+
Step 64300 | epoch 2/2 | loss=3.4879 | avg=4.0371 | acc=31.6% | lr=2.96e-05 | pos=64
|
| 718 |
+
Step 64400 | epoch 2/2 | loss=2.7751 | avg=4.0397 | acc=35.3% | lr=2.94e-05 | pos=64
|
| 719 |
+
Step 64500 | epoch 2/2 | loss=5.1389 | avg=4.0244 | acc=14.7% | lr=2.93e-05 | pos=64
|
| 720 |
+
Step 64600 | epoch 2/2 | loss=5.9862 | avg=4.0349 | acc=11.9% | lr=2.91e-05 | pos=64
|
| 721 |
+
Step 64700 | epoch 2/2 | loss=4.7760 | avg=4.0451 | acc=15.3% | lr=2.90e-05 | pos=64
|
| 722 |
+
Step 64800 | epoch 2/2 | loss=2.2713 | avg=4.0212 | acc=48.1% | lr=2.88e-05 | pos=64
|
| 723 |
+
Step 64900 | epoch 2/2 | loss=2.6343 | avg=3.9889 | acc=40.0% | lr=2.87e-05 | pos=64
|
| 724 |
+
Step 65000 | epoch 2/2 | loss=4.3901 | avg=3.9958 | acc=19.7% | lr=2.85e-05 | pos=64
|
| 725 |
+
[Checkpoint] Saved step 65000 (loss=4.3901) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 726 |
+
[Save @ step 65000] loss=4.3901
|
| 727 |
+
[Checkpoint] Saved step 65000 (loss=4.3901) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step65000.pt
|
| 728 |
+
[Prune @ step 65000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 729 |
+
Step 65100 | epoch 2/2 | loss=4.3795 | avg=4.1776 | acc=19.1% | lr=2.84e-05 | pos=64
|
| 730 |
+
Step 65200 | epoch 2/2 | loss=3.1766 | avg=4.0615 | acc=30.0% | lr=2.82e-05 | pos=64
|
| 731 |
+
Step 65300 | epoch 2/2 | loss=4.0979 | avg=4.1010 | acc=25.9% | lr=2.80e-05 | pos=64
|
| 732 |
+
Step 65400 | epoch 2/2 | loss=4.3317 | avg=4.0931 | acc=20.0% | lr=2.79e-05 | pos=64
|
| 733 |
+
Step 65500 | epoch 2/2 | loss=2.4667 | avg=4.0742 | acc=45.0% | lr=2.77e-05 | pos=64
|
| 734 |
+
Step 65600 | epoch 2/2 | loss=4.8428 | avg=4.0713 | acc=17.8% | lr=2.76e-05 | pos=64
|
| 735 |
+
Step 65700 | epoch 2/2 | loss=3.8151 | avg=4.0715 | acc=32.8% | lr=2.74e-05 | pos=64
|
| 736 |
+
Step 65800 | epoch 2/2 | loss=3.7558 | avg=4.0690 | acc=27.8% | lr=2.73e-05 | pos=64
|
| 737 |
+
Step 65900 | epoch 2/2 | loss=2.2229 | avg=4.0481 | acc=42.8% | lr=2.71e-05 | pos=64
|
| 738 |
+
Step 66000 | epoch 2/2 | loss=2.9299 | avg=4.0462 | acc=33.4% | lr=2.70e-05 | pos=64
|
| 739 |
+
Step 66100 | epoch 2/2 | loss=2.6785 | avg=3.9906 | acc=39.4% | lr=2.68e-05 | pos=64
|
| 740 |
+
Step 66200 | epoch 2/2 | loss=6.0460 | avg=4.0197 | acc=15.3% | lr=2.67e-05 | pos=64
|
| 741 |
+
Step 66300 | epoch 2/2 | loss=4.0599 | avg=4.0013 | acc=27.5% | lr=2.65e-05 | pos=64
|
| 742 |
+
Step 66400 | epoch 2/2 | loss=4.4168 | avg=3.9807 | acc=22.2% | lr=2.64e-05 | pos=64
|
| 743 |
+
Step 66500 | epoch 2/2 | loss=4.7449 | avg=4.0224 | acc=19.1% | lr=2.63e-05 | pos=64
|
| 744 |
+
Step 66600 | epoch 2/2 | loss=4.7786 | avg=4.0074 | acc=16.6% | lr=2.61e-05 | pos=64
|
| 745 |
+
Step 66700 | epoch 2/2 | loss=4.5076 | avg=3.9948 | acc=21.9% | lr=2.60e-05 | pos=64
|
| 746 |
+
Step 66800 | epoch 2/2 | loss=3.3022 | avg=4.0162 | acc=30.9% | lr=2.58e-05 | pos=64
|
| 747 |
+
Step 66900 | epoch 2/2 | loss=4.1388 | avg=4.0111 | acc=21.9% | lr=2.57e-05 | pos=64
|
| 748 |
+
Step 67000 | epoch 2/2 | loss=2.4938 | avg=4.0179 | acc=42.5% | lr=2.55e-05 | pos=64
|
| 749 |
+
Step 67100 | epoch 2/2 | loss=3.3502 | avg=4.0698 | acc=36.2% | lr=2.54e-05 | pos=64
|
| 750 |
+
Step 67200 | epoch 2/2 | loss=2.8992 | avg=3.9731 | acc=34.4% | lr=2.52e-05 | pos=64
|
| 751 |
+
Step 67300 | epoch 2/2 | loss=3.8375 | avg=4.0487 | acc=21.9% | lr=2.51e-05 | pos=64
|
| 752 |
+
Step 67400 | epoch 2/2 | loss=5.3267 | avg=4.0175 | acc=17.8% | lr=2.50e-05 | pos=64
|
| 753 |
+
Step 67500 | epoch 2/2 | loss=3.4675 | avg=3.9887 | acc=28.7% | lr=2.48e-05 | pos=64
|
| 754 |
+
Step 67600 | epoch 2/2 | loss=3.7583 | avg=4.0100 | acc=26.2% | lr=2.47e-05 | pos=64
|
| 755 |
+
Step 67700 | epoch 2/2 | loss=3.8188 | avg=3.9977 | acc=26.2% | lr=2.45e-05 | pos=64
|
| 756 |
+
Step 67800 | epoch 2/2 | loss=2.5829 | avg=3.9890 | acc=39.7% | lr=2.44e-05 | pos=64
|
| 757 |
+
Step 67900 | epoch 2/2 | loss=5.0292 | avg=3.9850 | acc=22.2% | lr=2.43e-05 | pos=64
|
| 758 |
+
Step 68000 | epoch 2/2 | loss=3.7859 | avg=3.9706 | acc=28.7% | lr=2.41e-05 | pos=64
|
| 759 |
+
Step 68100 | epoch 2/2 | loss=5.1101 | avg=3.9497 | acc=13.4% | lr=2.40e-05 | pos=64
|
| 760 |
+
Step 68200 | epoch 2/2 | loss=4.4756 | avg=3.9722 | acc=22.5% | lr=2.38e-05 | pos=64
|
| 761 |
+
Step 68300 | epoch 2/2 | loss=4.1080 | avg=4.0035 | acc=25.9% | lr=2.37e-05 | pos=64
|
| 762 |
+
Step 68400 | epoch 2/2 | loss=2.8236 | avg=3.9939 | acc=38.8% | lr=2.36e-05 | pos=64
|
| 763 |
+
Step 68500 | epoch 2/2 | loss=3.0124 | avg=3.9966 | acc=34.1% | lr=2.34e-05 | pos=64
|
| 764 |
+
Step 68600 | epoch 2/2 | loss=3.7765 | avg=3.9815 | acc=32.5% | lr=2.33e-05 | pos=64
|
| 765 |
+
Step 68700 | epoch 2/2 | loss=3.0217 | avg=3.9884 | acc=38.1% | lr=2.32e-05 | pos=64
|
| 766 |
+
Step 68800 | epoch 2/2 | loss=3.5438 | avg=3.9941 | acc=26.2% | lr=2.30e-05 | pos=64
|
| 767 |
+
Step 68900 | epoch 2/2 | loss=3.8703 | avg=3.9851 | acc=30.6% | lr=2.29e-05 | pos=64
|
| 768 |
+
Step 69000 | epoch 2/2 | loss=3.6413 | avg=3.9808 | acc=26.2% | lr=2.28e-05 | pos=64
|
| 769 |
+
Step 69100 | epoch 2/2 | loss=5.3341 | avg=3.8165 | acc=14.1% | lr=2.26e-05 | pos=64
|
| 770 |
+
Step 69200 | epoch 2/2 | loss=4.8652 | avg=3.8429 | acc=16.9% | lr=2.25e-05 | pos=64
|
| 771 |
+
Step 69300 | epoch 2/2 | loss=4.3434 | avg=3.8633 | acc=18.8% | lr=2.24e-05 | pos=64
|
| 772 |
+
Step 69400 | epoch 2/2 | loss=4.7815 | avg=3.8867 | acc=19.7% | lr=2.22e-05 | pos=64
|
| 773 |
+
Step 69500 | epoch 2/2 | loss=3.6497 | avg=3.8812 | acc=26.6% | lr=2.21e-05 | pos=64
|
| 774 |
+
Step 69600 | epoch 2/2 | loss=4.9780 | avg=3.8984 | acc=11.9% | lr=2.20e-05 | pos=64
|
| 775 |
+
Step 69700 | epoch 2/2 | loss=4.0833 | avg=3.9129 | acc=23.1% | lr=2.18e-05 | pos=64
|
| 776 |
+
Step 69800 | epoch 2/2 | loss=3.6186 | avg=3.9157 | acc=29.1% | lr=2.17e-05 | pos=64
|
| 777 |
+
Step 69900 | epoch 2/2 | loss=5.5480 | avg=3.9107 | acc=17.2% | lr=2.16e-05 | pos=64
|
| 778 |
+
Step 70000 | epoch 2/2 | loss=3.7976 | avg=3.9249 | acc=26.6% | lr=2.15e-05 | pos=64
|
| 779 |
+
[Checkpoint] Saved step 70000 (loss=3.7976) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 780 |
+
[Save @ step 70000] loss=3.7976
|
| 781 |
+
[Checkpoint] Saved step 70000 (loss=3.7976) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step70000.pt
|
| 782 |
+
[Prune @ step 70000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 783 |
+
Step 70100 | epoch 2/2 | loss=3.8177 | avg=3.9559 | acc=35.0% | lr=2.13e-05 | pos=64
|
| 784 |
+
Step 70200 | epoch 2/2 | loss=3.2713 | avg=3.9842 | acc=33.1% | lr=2.12e-05 | pos=64
|
| 785 |
+
Step 70300 | epoch 2/2 | loss=4.4606 | avg=4.0190 | acc=20.9% | lr=2.11e-05 | pos=64
|
| 786 |
+
Step 70400 | epoch 2/2 | loss=3.0428 | avg=3.9660 | acc=36.9% | lr=2.10e-05 | pos=64
|
| 787 |
+
Step 70500 | epoch 2/2 | loss=3.2894 | avg=3.9720 | acc=36.6% | lr=2.08e-05 | pos=64
|
| 788 |
+
Step 70600 | epoch 2/2 | loss=3.7879 | avg=3.9610 | acc=24.4% | lr=2.07e-05 | pos=64
|
| 789 |
+
Step 70700 | epoch 2/2 | loss=4.1509 | avg=3.9827 | acc=18.8% | lr=2.06e-05 | pos=64
|
| 790 |
+
Step 70800 | epoch 2/2 | loss=4.3644 | avg=3.9772 | acc=21.2% | lr=2.05e-05 | pos=64
|
| 791 |
+
Step 70900 | epoch 2/2 | loss=5.7064 | avg=3.9717 | acc=14.7% | lr=2.03e-05 | pos=64
|
| 792 |
+
Step 71000 | epoch 2/2 | loss=4.7662 | avg=3.9739 | acc=13.8% | lr=2.02e-05 | pos=64
|
| 793 |
+
Step 71100 | epoch 2/2 | loss=5.3511 | avg=3.8746 | acc=9.1% | lr=2.01e-05 | pos=64
|
| 794 |
+
Step 71200 | epoch 2/2 | loss=3.9317 | avg=3.9620 | acc=21.2% | lr=2.00e-05 | pos=64
|
| 795 |
+
Step 71300 | epoch 2/2 | loss=4.2127 | avg=3.9650 | acc=24.1% | lr=1.99e-05 | pos=64
|
| 796 |
+
Step 71400 | epoch 2/2 | loss=2.6757 | avg=3.9601 | acc=37.5% | lr=1.97e-05 | pos=64
|
| 797 |
+
Step 71500 | epoch 2/2 | loss=2.9107 | avg=3.9780 | acc=36.6% | lr=1.96e-05 | pos=64
|
| 798 |
+
Step 71600 | epoch 2/2 | loss=4.7727 | avg=3.9611 | acc=19.1% | lr=1.95e-05 | pos=64
|
| 799 |
+
Step 71700 | epoch 2/2 | loss=5.1722 | avg=3.9744 | acc=23.1% | lr=1.94e-05 | pos=64
|
| 800 |
+
Step 71800 | epoch 2/2 | loss=3.2547 | avg=3.9943 | acc=36.2% | lr=1.93e-05 | pos=64
|
| 801 |
+
Step 71900 | epoch 2/2 | loss=7.5842 | avg=3.9922 | acc=9.1% | lr=1.92e-05 | pos=64
|
| 802 |
+
Step 72000 | epoch 2/2 | loss=3.9415 | avg=3.9826 | acc=21.9% | lr=1.90e-05 | pos=64
|
| 803 |
+
Step 72100 | epoch 2/2 | loss=5.2223 | avg=4.1068 | acc=13.8% | lr=1.89e-05 | pos=64
|
| 804 |
+
Step 72200 | epoch 2/2 | loss=4.7449 | avg=3.9670 | acc=18.4% | lr=1.88e-05 | pos=64
|
| 805 |
+
Step 72300 | epoch 2/2 | loss=3.4318 | avg=3.9629 | acc=34.7% | lr=1.87e-05 | pos=64
|
| 806 |
+
Step 72400 | epoch 2/2 | loss=4.4708 | avg=4.0072 | acc=18.1% | lr=1.86e-05 | pos=64
|
| 807 |
+
Step 72500 | epoch 2/2 | loss=3.6306 | avg=3.9713 | acc=22.5% | lr=1.85e-05 | pos=64
|
| 808 |
+
Step 72600 | epoch 2/2 | loss=4.0440 | avg=3.9635 | acc=25.9% | lr=1.84e-05 | pos=64
|
| 809 |
+
Step 72700 | epoch 2/2 | loss=5.3495 | avg=3.9759 | acc=16.6% | lr=1.83e-05 | pos=64
|
| 810 |
+
Step 72800 | epoch 2/2 | loss=4.3048 | avg=3.9836 | acc=21.2% | lr=1.81e-05 | pos=64
|
| 811 |
+
Step 72900 | epoch 2/2 | loss=4.7384 | avg=3.9723 | acc=18.8% | lr=1.80e-05 | pos=64
|
| 812 |
+
Step 73000 | epoch 2/2 | loss=3.5557 | avg=3.9727 | acc=28.1% | lr=1.79e-05 | pos=64
|
| 813 |
+
Step 73100 | epoch 2/2 | loss=4.0367 | avg=4.0304 | acc=22.8% | lr=1.78e-05 | pos=64
|
| 814 |
+
Step 73200 | epoch 2/2 | loss=5.0796 | avg=3.9429 | acc=14.7% | lr=1.77e-05 | pos=64
|
| 815 |
+
Step 73300 | epoch 2/2 | loss=4.5306 | avg=3.9430 | acc=17.8% | lr=1.76e-05 | pos=64
|
| 816 |
+
Step 73400 | epoch 2/2 | loss=5.6048 | avg=3.9936 | acc=11.9% | lr=1.75e-05 | pos=64
|
| 817 |
+
Step 73500 | epoch 2/2 | loss=4.9194 | avg=3.9850 | acc=16.2% | lr=1.74e-05 | pos=64
|
| 818 |
+
Step 73600 | epoch 2/2 | loss=2.6192 | avg=4.0012 | acc=36.9% | lr=1.73e-05 | pos=64
|
| 819 |
+
Step 73700 | epoch 2/2 | loss=3.2044 | avg=3.9696 | acc=27.8% | lr=1.72e-05 | pos=64
|
| 820 |
+
Step 73800 | epoch 2/2 | loss=3.2734 | avg=3.9767 | acc=30.6% | lr=1.71e-05 | pos=64
|
| 821 |
+
Step 73900 | epoch 2/2 | loss=2.4732 | avg=3.9814 | acc=40.3% | lr=1.70e-05 | pos=64
|
| 822 |
+
Step 74000 | epoch 2/2 | loss=2.4827 | avg=3.9739 | acc=42.5% | lr=1.69e-05 | pos=64
|
| 823 |
+
Step 74100 | epoch 2/2 | loss=3.3299 | avg=3.8277 | acc=30.0% | lr=1.68e-05 | pos=64
|
| 824 |
+
Step 74200 | epoch 2/2 | loss=4.2136 | avg=3.8443 | acc=19.7% | lr=1.67e-05 | pos=64
|
| 825 |
+
Step 74300 | epoch 2/2 | loss=3.9210 | avg=3.9573 | acc=20.0% | lr=1.66e-05 | pos=64
|
| 826 |
+
Step 74400 | epoch 2/2 | loss=5.3194 | avg=3.9435 | acc=12.8% | lr=1.65e-05 | pos=64
|
| 827 |
+
Step 74500 | epoch 2/2 | loss=5.1844 | avg=3.9397 | acc=13.4% | lr=1.64e-05 | pos=64
|
| 828 |
+
Step 74600 | epoch 2/2 | loss=5.9062 | avg=3.9262 | acc=16.6% | lr=1.63e-05 | pos=64
|
| 829 |
+
Step 74700 | epoch 2/2 | loss=4.4060 | avg=3.9274 | acc=20.0% | lr=1.62e-05 | pos=64
|
| 830 |
+
Step 74800 | epoch 2/2 | loss=4.1646 | avg=3.9522 | acc=28.1% | lr=1.61e-05 | pos=64
|
| 831 |
+
Step 74900 | epoch 2/2 | loss=4.6975 | avg=3.9439 | acc=17.5% | lr=1.60e-05 | pos=64
|
| 832 |
+
Step 75000 | epoch 2/2 | loss=3.8099 | avg=3.9236 | acc=29.4% | lr=1.59e-05 | pos=64
|
| 833 |
+
[Checkpoint] Saved step 75000 (loss=3.8099) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 834 |
+
[Save @ step 75000] loss=3.8099
|
| 835 |
+
[Checkpoint] Saved step 75000 (loss=3.8099) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step75000.pt
|
| 836 |
+
[Prune @ step 75000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 837 |
+
Step 75100 | epoch 2/2 | loss=3.1919 | avg=3.8885 | acc=35.0% | lr=1.58e-05 | pos=64
|
| 838 |
+
Step 75200 | epoch 2/2 | loss=3.0743 | avg=3.9323 | acc=30.0% | lr=1.57e-05 | pos=64
|
| 839 |
+
Step 75300 | epoch 2/2 | loss=3.8457 | avg=3.9156 | acc=28.7% | lr=1.56e-05 | pos=64
|
| 840 |
+
Step 75400 | epoch 2/2 | loss=2.7368 | avg=3.8854 | acc=34.4% | lr=1.55e-05 | pos=64
|
| 841 |
+
Step 75500 | epoch 2/2 | loss=3.8707 | avg=3.8896 | acc=21.6% | lr=1.54e-05 | pos=64
|
| 842 |
+
Step 75600 | epoch 2/2 | loss=4.2949 | avg=3.9212 | acc=22.2% | lr=1.54e-05 | pos=64
|
| 843 |
+
Step 75700 | epoch 2/2 | loss=3.7626 | avg=3.9296 | acc=27.2% | lr=1.53e-05 | pos=64
|
| 844 |
+
Step 75800 | epoch 2/2 | loss=3.9824 | avg=3.9417 | acc=27.5% | lr=1.52e-05 | pos=64
|
| 845 |
+
Step 75900 | epoch 2/2 | loss=4.4798 | avg=3.9416 | acc=18.4% | lr=1.51e-05 | pos=64
|
| 846 |
+
Step 76000 | epoch 2/2 | loss=4.6770 | avg=3.9619 | acc=17.2% | lr=1.50e-05 | pos=64
|
| 847 |
+
Step 76100 | epoch 2/2 | loss=3.6232 | avg=4.0577 | acc=31.9% | lr=1.49e-05 | pos=64
|
| 848 |
+
Step 76200 | epoch 2/2 | loss=5.7295 | avg=3.9855 | acc=11.9% | lr=1.48e-05 | pos=64
|
| 849 |
+
Step 76300 | epoch 2/2 | loss=4.0309 | avg=3.9120 | acc=18.8% | lr=1.47e-05 | pos=64
|
| 850 |
+
Step 76400 | epoch 2/2 | loss=5.7989 | avg=3.8953 | acc=17.5% | lr=1.47e-05 | pos=64
|
| 851 |
+
Step 76500 | epoch 2/2 | loss=3.2187 | avg=3.8625 | acc=37.8% | lr=1.46e-05 | pos=64
|
| 852 |
+
Step 76600 | epoch 2/2 | loss=3.6841 | avg=3.8674 | acc=26.2% | lr=1.45e-05 | pos=64
|
| 853 |
+
Step 76700 | epoch 2/2 | loss=2.7946 | avg=3.8753 | acc=32.2% | lr=1.44e-05 | pos=64
|
| 854 |
+
Step 76800 | epoch 2/2 | loss=2.9086 | avg=3.8620 | acc=34.4% | lr=1.43e-05 | pos=64
|
| 855 |
+
Step 76900 | epoch 2/2 | loss=3.3638 | avg=3.8633 | acc=26.9% | lr=1.42e-05 | pos=64
|
| 856 |
+
Step 77000 | epoch 2/2 | loss=2.6785 | avg=3.8562 | acc=33.4% | lr=1.42e-05 | pos=64
|
| 857 |
+
Step 77100 | epoch 2/2 | loss=2.9324 | avg=3.9094 | acc=37.5% | lr=1.41e-05 | pos=64
|
| 858 |
+
Step 77200 | epoch 2/2 | loss=4.9674 | avg=3.9219 | acc=20.6% | lr=1.40e-05 | pos=64
|
| 859 |
+
Step 77300 | epoch 2/2 | loss=4.9176 | avg=3.9506 | acc=15.6% | lr=1.39e-05 | pos=64
|
| 860 |
+
Step 77400 | epoch 2/2 | loss=3.9582 | avg=3.9039 | acc=29.4% | lr=1.38e-05 | pos=64
|
| 861 |
+
Step 77500 | epoch 2/2 | loss=5.7702 | avg=3.9179 | acc=12.8% | lr=1.38e-05 | pos=64
|
| 862 |
+
Step 77600 | epoch 2/2 | loss=5.2998 | avg=3.9348 | acc=17.8% | lr=1.37e-05 | pos=64
|
| 863 |
+
Step 77700 | epoch 2/2 | loss=3.3116 | avg=3.9036 | acc=31.9% | lr=1.36e-05 | pos=64
|
| 864 |
+
Step 77800 | epoch 2/2 | loss=2.3984 | avg=3.9114 | acc=43.1% | lr=1.35e-05 | pos=64
|
| 865 |
+
Step 77900 | epoch 2/2 | loss=5.5713 | avg=3.9239 | acc=14.1% | lr=1.35e-05 | pos=64
|
| 866 |
+
Step 78000 | epoch 2/2 | loss=3.8396 | avg=3.9350 | acc=21.2% | lr=1.34e-05 | pos=64
|
| 867 |
+
Step 78100 | epoch 2/2 | loss=4.2882 | avg=3.9720 | acc=19.4% | lr=1.33e-05 | pos=64
|
| 868 |
+
Step 78200 | epoch 2/2 | loss=4.6613 | avg=3.9485 | acc=18.1% | lr=1.33e-05 | pos=64
|
| 869 |
+
Step 78300 | epoch 2/2 | loss=5.1955 | avg=3.9910 | acc=18.4% | lr=1.32e-05 | pos=64
|
| 870 |
+
Step 78400 | epoch 2/2 | loss=4.9748 | avg=3.9656 | acc=25.0% | lr=1.31e-05 | pos=64
|
| 871 |
+
Step 78500 | epoch 2/2 | loss=2.4309 | avg=3.9518 | acc=42.5% | lr=1.30e-05 | pos=64
|
| 872 |
+
Step 78600 | epoch 2/2 | loss=4.3748 | avg=3.9567 | acc=16.9% | lr=1.30e-05 | pos=64
|
| 873 |
+
Step 78700 | epoch 2/2 | loss=4.7788 | avg=3.9485 | acc=19.7% | lr=1.29e-05 | pos=64
|
| 874 |
+
Step 78800 | epoch 2/2 | loss=3.7079 | avg=3.9602 | acc=24.4% | lr=1.28e-05 | pos=64
|
| 875 |
+
Step 78900 | epoch 2/2 | loss=3.0928 | avg=3.9682 | acc=36.2% | lr=1.28e-05 | pos=64
|
| 876 |
+
Step 79000 | epoch 2/2 | loss=4.6580 | avg=3.9696 | acc=24.4% | lr=1.27e-05 | pos=64
|
| 877 |
+
Step 79100 | epoch 2/2 | loss=4.3921 | avg=3.8872 | acc=17.5% | lr=1.26e-05 | pos=64
|
| 878 |
+
Step 79200 | epoch 2/2 | loss=3.5990 | avg=3.8944 | acc=32.8% | lr=1.26e-05 | pos=64
|
| 879 |
+
Step 79300 | epoch 2/2 | loss=4.0135 | avg=3.9332 | acc=27.2% | lr=1.25e-05 | pos=64
|
| 880 |
+
Step 79400 | epoch 2/2 | loss=2.0610 | avg=3.9130 | acc=47.2% | lr=1.25e-05 | pos=64
|
| 881 |
+
Step 79500 | epoch 2/2 | loss=3.4324 | avg=3.9120 | acc=31.6% | lr=1.24e-05 | pos=64
|
| 882 |
+
Step 79600 | epoch 2/2 | loss=3.4936 | avg=3.9407 | acc=28.7% | lr=1.23e-05 | pos=64
|
| 883 |
+
Step 79700 | epoch 2/2 | loss=5.2379 | avg=3.9439 | acc=14.1% | lr=1.23e-05 | pos=64
|
| 884 |
+
Step 79800 | epoch 2/2 | loss=3.8758 | avg=3.9491 | acc=22.5% | lr=1.22e-05 | pos=64
|
| 885 |
+
Step 79900 | epoch 2/2 | loss=4.3745 | avg=3.9628 | acc=20.0% | lr=1.22e-05 | pos=64
|
| 886 |
+
Step 80000 | epoch 2/2 | loss=4.3228 | avg=3.9622 | acc=19.1% | lr=1.21e-05 | pos=64
|
| 887 |
+
[Checkpoint] Saved step 80000 (loss=4.3228) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 888 |
+
[Save @ step 80000] loss=4.3228
|
| 889 |
+
[Checkpoint] Saved step 80000 (loss=4.3228) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step80000.pt
|
| 890 |
+
[Prune @ step 80000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 891 |
+
Step 80100 | epoch 2/2 | loss=4.2464 | avg=3.8153 | acc=25.6% | lr=1.20e-05 | pos=64
|
| 892 |
+
Step 80200 | epoch 2/2 | loss=4.7706 | avg=3.8690 | acc=17.2% | lr=1.20e-05 | pos=64
|
| 893 |
+
Step 80300 | epoch 2/2 | loss=4.3177 | avg=3.8873 | acc=18.1% | lr=1.19e-05 | pos=64
|
| 894 |
+
Step 80400 | epoch 2/2 | loss=3.3786 | avg=3.9199 | acc=31.2% | lr=1.19e-05 | pos=64
|
| 895 |
+
Step 80500 | epoch 2/2 | loss=5.1081 | avg=3.9491 | acc=14.7% | lr=1.18e-05 | pos=64
|
| 896 |
+
Step 80600 | epoch 2/2 | loss=3.8872 | avg=3.9412 | acc=26.9% | lr=1.18e-05 | pos=64
|
| 897 |
+
Step 80700 | epoch 2/2 | loss=4.5615 | avg=3.9324 | acc=23.4% | lr=1.17e-05 | pos=64
|
| 898 |
+
Step 80800 | epoch 2/2 | loss=2.6409 | avg=3.9387 | acc=38.4% | lr=1.17e-05 | pos=64
|
| 899 |
+
Step 80900 | epoch 2/2 | loss=4.2988 | avg=3.9430 | acc=24.1% | lr=1.16e-05 | pos=64
|
| 900 |
+
Step 81000 | epoch 2/2 | loss=3.1301 | avg=3.9302 | acc=31.9% | lr=1.16e-05 | pos=64
|
| 901 |
+
Step 81100 | epoch 2/2 | loss=2.5651 | avg=4.2526 | acc=37.5% | lr=1.15e-05 | pos=64
|
| 902 |
+
Step 81200 | epoch 2/2 | loss=3.5898 | avg=4.0857 | acc=28.1% | lr=1.15e-05 | pos=64
|
| 903 |
+
Step 81300 | epoch 2/2 | loss=3.5881 | avg=4.0118 | acc=25.3% | lr=1.14e-05 | pos=64
|
| 904 |
+
Step 81400 | epoch 2/2 | loss=4.5377 | avg=4.0113 | acc=14.1% | lr=1.14e-05 | pos=64
|
| 905 |
+
Step 81500 | epoch 2/2 | loss=4.6724 | avg=4.0048 | acc=20.3% | lr=1.13e-05 | pos=64
|
| 906 |
+
Step 81600 | epoch 2/2 | loss=4.7214 | avg=4.0031 | acc=15.6% | lr=1.13e-05 | pos=64
|
| 907 |
+
Step 81700 | epoch 2/2 | loss=4.3170 | avg=3.9981 | acc=20.9% | lr=1.12e-05 | pos=64
|
| 908 |
+
Step 81800 | epoch 2/2 | loss=4.8377 | avg=3.9901 | acc=20.6% | lr=1.12e-05 | pos=64
|
| 909 |
+
Step 81900 | epoch 2/2 | loss=3.1756 | avg=3.9930 | acc=32.8% | lr=1.11e-05 | pos=64
|
| 910 |
+
Step 82000 | epoch 2/2 | loss=4.3843 | avg=3.9915 | acc=26.2% | lr=1.11e-05 | pos=64
|
| 911 |
+
Step 82100 | epoch 2/2 | loss=3.1136 | avg=3.8791 | acc=36.9% | lr=1.11e-05 | pos=64
|
| 912 |
+
Step 82200 | epoch 2/2 | loss=4.1817 | avg=3.8888 | acc=20.3% | lr=1.10e-05 | pos=64
|
| 913 |
+
Step 82300 | epoch 2/2 | loss=4.8156 | avg=3.8795 | acc=16.6% | lr=1.10e-05 | pos=64
|
| 914 |
+
Step 82400 | epoch 2/2 | loss=3.4888 | avg=3.8815 | acc=27.5% | lr=1.09e-05 | pos=64
|
| 915 |
+
Step 82500 | epoch 2/2 | loss=3.1086 | avg=3.9286 | acc=35.3% | lr=1.09e-05 | pos=64
|
| 916 |
+
Step 82600 | epoch 2/2 | loss=4.4417 | avg=3.9238 | acc=16.2% | lr=1.09e-05 | pos=64
|
| 917 |
+
Step 82700 | epoch 2/2 | loss=2.7371 | avg=3.9209 | acc=37.8% | lr=1.08e-05 | pos=64
|
| 918 |
+
Step 82800 | epoch 2/2 | loss=2.6301 | avg=3.9065 | acc=36.9% | lr=1.08e-05 | pos=64
|
| 919 |
+
Step 82900 | epoch 2/2 | loss=2.8479 | avg=3.9139 | acc=33.8% | lr=1.08e-05 | pos=64
|
| 920 |
+
Step 83000 | epoch 2/2 | loss=4.6181 | avg=3.9240 | acc=19.7% | lr=1.07e-05 | pos=64
|
| 921 |
+
Step 83100 | epoch 2/2 | loss=4.5272 | avg=3.8913 | acc=16.9% | lr=1.07e-05 | pos=64
|
| 922 |
+
Step 83200 | epoch 2/2 | loss=5.2323 | avg=3.9357 | acc=18.1% | lr=1.07e-05 | pos=64
|
| 923 |
+
Step 83300 | epoch 2/2 | loss=3.5506 | avg=3.9637 | acc=31.6% | lr=1.06e-05 | pos=64
|
| 924 |
+
Step 83400 | epoch 2/2 | loss=4.3895 | avg=3.9204 | acc=20.6% | lr=1.06e-05 | pos=64
|
| 925 |
+
Step 83500 | epoch 2/2 | loss=5.3779 | avg=3.9227 | acc=10.6% | lr=1.06e-05 | pos=64
|
| 926 |
+
Step 83600 | epoch 2/2 | loss=3.8845 | avg=3.9226 | acc=22.5% | lr=1.05e-05 | pos=64
|
| 927 |
+
Step 83700 | epoch 2/2 | loss=3.4041 | avg=3.9086 | acc=33.8% | lr=1.05e-05 | pos=64
|
| 928 |
+
Step 83800 | epoch 2/2 | loss=5.1687 | avg=3.9190 | acc=11.9% | lr=1.05e-05 | pos=64
|
| 929 |
+
Step 83900 | epoch 2/2 | loss=3.3404 | avg=3.9041 | acc=31.9% | lr=1.04e-05 | pos=64
|
| 930 |
+
Step 84000 | epoch 2/2 | loss=3.6246 | avg=3.9078 | acc=27.8% | lr=1.04e-05 | pos=64
|
| 931 |
+
Step 84100 | epoch 2/2 | loss=3.4813 | avg=3.9357 | acc=29.7% | lr=1.04e-05 | pos=64
|
| 932 |
+
Step 84200 | epoch 2/2 | loss=4.8564 | avg=3.9044 | acc=13.8% | lr=1.04e-05 | pos=64
|
| 933 |
+
Step 84300 | epoch 2/2 | loss=2.3271 | avg=3.9418 | acc=46.9% | lr=1.03e-05 | pos=64
|
| 934 |
+
Step 84400 | epoch 2/2 | loss=2.5052 | avg=3.9461 | acc=43.4% | lr=1.03e-05 | pos=64
|
| 935 |
+
Step 84500 | epoch 2/2 | loss=2.7013 | avg=3.9402 | acc=41.9% | lr=1.03e-05 | pos=64
|
| 936 |
+
Step 84600 | epoch 2/2 | loss=4.3838 | avg=3.9605 | acc=23.8% | lr=1.03e-05 | pos=64
|
| 937 |
+
Step 84700 | epoch 2/2 | loss=3.0412 | avg=3.9686 | acc=32.5% | lr=1.03e-05 | pos=64
|
| 938 |
+
Step 84800 | epoch 2/2 | loss=3.4061 | avg=3.9932 | acc=29.4% | lr=1.02e-05 | pos=64
|
| 939 |
+
Step 84900 | epoch 2/2 | loss=3.2795 | avg=3.9865 | acc=27.5% | lr=1.02e-05 | pos=64
|
| 940 |
+
Step 85000 | epoch 2/2 | loss=4.0844 | avg=3.9457 | acc=21.9% | lr=1.02e-05 | pos=64
|
| 941 |
+
[Checkpoint] Saved step 85000 (loss=4.0844) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|
| 942 |
+
[Save @ step 85000] loss=4.0844
|
| 943 |
+
[Checkpoint] Saved step 85000 (loss=4.0844) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step85000.pt
|
| 944 |
+
[Prune @ step 85000] zeroed 0.0M / 1407.4M (0.0% sparsity)
|
| 945 |
+
Step 85100 | epoch 2/2 | loss=4.7806 | avg=3.9157 | acc=15.3% | lr=1.02e-05 | pos=64
|
| 946 |
+
Step 85200 | epoch 2/2 | loss=4.5346 | avg=3.9292 | acc=18.8% | lr=1.02e-05 | pos=64
|
| 947 |
+
Step 85300 | epoch 2/2 | loss=5.6210 | avg=3.9368 | acc=13.4% | lr=1.02e-05 | pos=64
|
| 948 |
+
Step 85400 | epoch 2/2 | loss=4.9908 | avg=3.9581 | acc=14.1% | lr=1.01e-05 | pos=64
|
| 949 |
+
Step 85500 | epoch 2/2 | loss=3.1967 | avg=3.9444 | acc=29.1% | lr=1.01e-05 | pos=64
|
| 950 |
+
Step 85600 | epoch 2/2 | loss=3.4585 | avg=3.9472 | acc=31.9% | lr=1.01e-05 | pos=64
|
| 951 |
+
Step 85700 | epoch 2/2 | loss=4.3401 | avg=3.9435 | acc=16.6% | lr=1.01e-05 | pos=64
|
| 952 |
+
Step 85800 | epoch 2/2 | loss=2.7186 | avg=3.9475 | acc=38.4% | lr=1.01e-05 | pos=64
|
| 953 |
+
Step 85900 | epoch 2/2 | loss=3.3888 | avg=3.9339 | acc=22.5% | lr=1.01e-05 | pos=64
|
| 954 |
+
Step 86000 | epoch 2/2 | loss=2.7529 | avg=3.9336 | acc=36.2% | lr=1.01e-05 | pos=64
|
| 955 |
+
Step 86100 | epoch 2/2 | loss=4.3572 | avg=4.0375 | acc=13.4% | lr=1.01e-05 | pos=64
|
| 956 |
+
Step 86200 | epoch 2/2 | loss=3.2749 | avg=3.9801 | acc=32.2% | lr=1.00e-05 | pos=64
|
| 957 |
+
Step 86300 | epoch 2/2 | loss=4.6296 | avg=3.9571 | acc=18.8% | lr=1.00e-05 | pos=64
|
| 958 |
+
Step 86400 | epoch 2/2 | loss=3.8504 | avg=3.9544 | acc=25.9% | lr=1.00e-05 | pos=64
|
| 959 |
+
Step 86500 | epoch 2/2 | loss=4.1305 | avg=3.9431 | acc=20.6% | lr=1.00e-05 | pos=64
|
| 960 |
+
Step 86600 | epoch 2/2 | loss=4.0122 | avg=3.9176 | acc=17.8% | lr=1.00e-05 | pos=64
|
| 961 |
+
Step 86700 | epoch 2/2 | loss=2.8261 | avg=3.9193 | acc=38.1% | lr=1.00e-05 | pos=64
|
| 962 |
+
Step 86800 | epoch 2/2 | loss=2.0856 | avg=3.9150 | acc=44.4% | lr=1.00e-05 | pos=64
|
| 963 |
+
Step 86900 | epoch 2/2 | loss=4.3141 | avg=3.9056 | acc=20.6% | lr=1.00e-05 | pos=64
|
| 964 |
+
Step 87000 | epoch 2/2 | loss=3.7628 | avg=3.9158 | acc=26.2% | lr=1.00e-05 | pos=64
|
| 965 |
+
Step 87100 | epoch 2/2 | loss=4.9560 | avg=3.8938 | acc=15.9% | lr=1.00e-05 | pos=64
|
| 966 |
+
Step 87200 | epoch 2/2 | loss=2.9699 | avg=3.9049 | acc=35.0% | lr=1.00e-05 | pos=64
|
| 967 |
+
--- Epoch 2/2 complete (step 87244) ---
|
| 968 |
+
[Checkpoint] Saved step 87244 (loss=3.7095) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_final.pt
|
| 969 |
+
|
| 970 |
+
============================================================
|
| 971 |
+
TRAINING COMPLETE (--no_eval, run benchmark separately)
|
| 972 |
+
============================================================
|
| 973 |
+
Training complete. Best: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
|