Joysulem commited on Feb 17

Commit

b5bff9c

verified ·

1 Parent(s): 2fd8602

Upload 3258 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +7 -0
FireEcho Engine/About FireEcho.md +33 -0
FireEcho Engine/Bible Readme.txt +667 -0
FireEcho Engine/__pycache__/cutlass_kernels.cpython-312.pyc +3 -0
FireEcho Engine/__pycache__/dsmem_ops.cpython-312.pyc +0 -0
FireEcho Engine/__pycache__/femx_storage.cpython-312.pyc +0 -0
FireEcho Engine/__pycache__/fireecho_kernel.cpython-312.pyc +3 -0
FireEcho Engine/__pycache__/goliath_kernel.cpython-312.pyc +3 -0
FireEcho Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc +3 -0
FireEcho Engine/__pycache__/triton_hebbian.cpython-312.pyc +0 -0
FireEcho Engine/bench_fusion.py +39 -0
FireEcho Engine/benchmark_eagle.py +231 -0
FireEcho Engine/benchmark_fullstack.py +323 -0
FireEcho Engine/benchmark_perplexity.py +358 -0
FireEcho Engine/calibrate_fexc.py +173 -0
FireEcho Engine/calibrate_fexvq.py +227 -0
FireEcho Engine/csrc/cluster_launch.cpp +53 -0
FireEcho Engine/csrc/cluster_launch.h +194 -0
FireEcho Engine/csrc/dsmem_cluster.cuh +344 -0
FireEcho Engine/csrc/femx_bindings.cpp +48 -0
FireEcho Engine/csrc/femx_kernels.cu +422 -0
FireEcho Engine/csrc/fireecho_preproc.cpp +54 -0
FireEcho Engine/csrc/fireecho_preproc_cuda.cu +316 -0
FireEcho Engine/cutlass_kernels.py +2418 -0
FireEcho Engine/debug_acceptance.log +92 -0
FireEcho Engine/debug_acceptance.py +152 -0
FireEcho Engine/debug_bisect.log +78 -0
FireEcho Engine/debug_bisect.py +149 -0
FireEcho Engine/debug_d8_isolate.log +79 -0
FireEcho Engine/debug_d8_isolate.py +156 -0
FireEcho Engine/debug_eval_flow.log +75 -0
FireEcho Engine/debug_eval_flow.py +186 -0
FireEcho Engine/debug_nan_isolate.log +57 -0
FireEcho Engine/debug_nan_isolate.py +174 -0
FireEcho Engine/debug_promptlen.py +110 -0
FireEcho Engine/debug_seqlen.py +65 -0
FireEcho Engine/debug_seqlen_threshold.py +61 -0
FireEcho Engine/debug_specgen_trace.py +171 -0
FireEcho Engine/dsmem_ops.py +789 -0
FireEcho Engine/eagle_data_codemix_cache.pt +3 -0
FireEcho Engine/eagle_data_codemix_cache.pt.bak +3 -0
FireEcho Engine/eagle_data_codemix_cache_old.pt +3 -0
FireEcho Engine/eagle_data_selfgen_cache.pt +3 -0
FireEcho Engine/eagle_data_selfgen_cache.pt.old +3 -0
FireEcho Engine/eagle_precompute.log +0 -0
FireEcho Engine/eagle_precompute_goddess.log +0 -0
FireEcho Engine/eagle_precompute_v2.log +1220 -0
FireEcho Engine/eagle_test.py +164 -0
FireEcho Engine/eagle_train_d8.log +212 -0
FireEcho Engine/eagle_train_goddess.log +973 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/__pycache__/cutlass_kernels.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/__pycache__/fireecho_kernel.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/__pycache__/goliath_kernel.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/eagle_data_codemix_cache.pt.bak filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/eagle_data_selfgen_cache.pt.old filter=lfs diff=lfs merge=lfs -text
+FireEcho[[:space:]]Engine/yay/src/gopath/pkg/mod/github.com/deckarep/golang-set/v2@v2.8.0/new_improved.jpeg filter=lfs diff=lfs merge=lfs -text

FireEcho Engine/About FireEcho.md ADDED Viewed

	@@ -0,0 +1,33 @@

+About FireEcho:
+FireEcho is not a base model. She's a work-in-progress fast engine that connects to LLMs to reach AGI with short & long memory that never forgets. Helps humanity progress faster in calculations, adapt, learn, take notes & find new ones.) To reduce VRAM, no accuracy loss, speed, 0 hallucinations, 0 drift. 30B → 20GB VRAM.
+//////////////////////////// FE quantization names:  ////////////////////////////
+1. FE-MX — FireEcho Mixed-Exponent (block floating point, femx_storage.py)
+2. FE-XC — FireEcho Xtreme Compress (codebook 2-bit, AQLM k-means, goliath_kernel.py)
+3. FE-XVQ — FireEcho XVector Quantization (VPTQ-inspired, Hessian-weighted codebooks)
+4. FE-XC = FireEcho Xtreme Compress — codebook 2-bit quantization (AQLM-style, CodeGEMM psumbook kernel)
+5. FE-XT = FireEcho Xturbo — tree speculative decoding with dynamic branch tuning (Scylla-inspired)
+6. FE-H = FireEcho Hayabusa — async prefetch/offload for scaling draft head layers to CPU (SP-MoE-inspire)
+////////////////////////  FireEcho Quantization Stack:  /////////////////////////
+FE-MX  = FireEcho Mixed-Exponent     (adaptive precision: cold→FP4, warm→FP6, hot→FP8)
+FE-XC  = FireEcho Xtreme Compress    (codebook 2-bit, AQLM-style)
+FE-XVQ = FireEcho Xtreme Vector Quant (Hessian-weighted codebook 2 bit)
+FE-XT  = FireEcho Xturbo    (tree speculative decoding)
+FE-H   = FireEcho Hayabusa    (async prefetch offload)
+FE-MX (Mixed-Exponent) — Adaptive block floating-point precision. Hot experts (frequently used) stay at FP8, warm at FP6, cold drop to FP4. Uses shared block exponents per group — like HDR for weights, more precision where activity demands it.
+FE-XVQ (Xtreme Vector Quant) — Hessian-weighted codebook 2-bit. Like FE-XC but uses second-order gradient info (the Hessian matrix) to learn smarter codebooks — weight groups that impact output most get more precise codebook entries. Same 2 bits/weight but better quality through calibration-data-driven optimization.
+FE-XC (Xtreme Compress) — Learned codebook 2-bit quantization. Instead of crude rounding to 2-bit integers  (which destroys quality), it learns 256 codebook vectors via k-means, then stores 2 uint8 indices per weight group. Same 2 bits/weight storage as INT2, but much better quality. Uses a "psumbook" trick: precomputes  dot(codebook, input) once per token, then all 8 active experts just do scalar lookups instead of vector math.  Result: 5.3x faster than FP4 at same bandwidth.
+FE-XT (Xturbo) — Tree speculative decoding. Instead of predicting one token chain, the draft head explores  b=4-16 branches simultaneously (like a tree). The target model verifies all branches in a single batched forward pass. Accepts the longest correct branch. Dynamic b tuning adjusts branch width based on acceptance  rate (Scylla Eq.4). Target: 3-5x speedup over standard speculation.
+FE-H (Hayabusa) — Async CPU offload for the draft head. When the draft head gets large (D=8-16 layers,  357M-1.2B params), it doesn't all fit in VRAM alongside the 20GB model. Hayabusa offloads deep draft layers to CPU RAM and JIT-prefetches them to GPU during the verification step (when GPU is busy with the target model     anyway). Overlaps CPU→GPU transfer with GPU compute = free memory savings.

FireEcho Engine/Bible Readme.txt ADDED Viewed

	@@ -0,0 +1,667 @@

+================================================================================
+                          FIREECHO ENGINE
+     High-Performance Single-GPU Inference Kernel for 30B+ MoE Models
+================================================================================
+   Creator & Sole Author: Luis E. Davila Flores (@Joysulem)
+   License: CC BY-NC 4.0 (free for research, attribution required)
+   Status: Production-quality single-GPU decode, research extensions active
+================================================================================
+ WHAT IS FIREECHO?
+================================================================================
+FireEcho is a custom inference engine that runs a 30 BILLION parameter
+Mixture-of-Experts model (Qwen3-Omni-30B) on a SINGLE consumer GPU at
+45+ tokens/second — using only 20 GB VRAM.
+No multi-GPU. No cloud. No NVIDIA proprietary libraries.
+Just Triton + PyTorch + one GPU.
+Key numbers:
+  - 30.5B total params, ~3.3B active per token (128 experts, top-8 routing)
+  - 4x compression via Goliath FP4 fused dequant-matmul (61 GB -> 20 GB)
+  - 124x speedup from baseline (0.4 -> 49.4 tok/s) through 7 optimization layers
+  - Zero NVIDIA proprietary dependencies (no cuQuantizer, CUTLASS, TensorRT)
+  - Runs anywhere Triton compiles: NVIDIA CUDA, AMD ROCm, Intel XPU
+What makes FireEcho different from vLLM/TGI/llama.cpp:
+  - Goliath kernel: FP4 dequantization INSIDE the Triton matmul loop (no separate
+    dequant step, no global memory materialization)
+  - Packed MoE: All 128 experts packed into one contiguous buffer per layer,
+    expert IDs stay on GPU — zero CPU-GPU sync during decode
+  - FlashDecode: Custom Triton attention kernel with online softmax for M=1 GQA
+  - Hebbian Memory: Biologically-inspired fast weights that learn at inference time
+  - FE-XC/INT2: Cold experts auto-demote to 2-bit (codebook or scalar) — further
+    bandwidth savings without touching hot experts
+  - CUDA Graph decode: Entire decode step captured as a graph, ~15.8ms/step
+================================================================================
+ CURRENT STATUS & REALISTIC EXPECTATIONS
+================================================================================
+WHAT WORKS (production-quality):
+  [x] Full Qwen3-Omni-30B inference at 45+ tok/s on RTX 5090
+  [x] Goliath FP4 quantization (20 GB VRAM, FP16-quality output)
+  [x] Packed MoE with fused dequant-matmul (zero CPU sync)
+  [x] FlashDecode attention (online softmax, valid_len masking)
+  [x] CUDA Graph decode (graph-captured forward pass)
+  [x] Flat KV cache (pre-allocated, zero allocation per token)
+  [x] FP8 KV cache (50% VRAM savings on attention)
+  [x] FE-XC cold expert demotion (codebook 2-bit, 5.3x faster kernel)
+  [x] INT2 ultra-cold expert demotion (scalar 2-bit)
+  [x] Hebbian persistent memory (learns during inference)
+  [x] Atlas gatekeeper (expert banning + MoDES skipping)
+  [x] Streaming shard loader (110s cold start, 3.1 GB CPU RAM)
+WHAT'S RESEARCH/EXPERIMENTAL:
+  [ ] EAGLE-3 speculative decoding (infrastructure done, head needs training)
+  [ ] FE-XT tree speculation (code done, needs trained draft head)
+  [ ] FE-H Hayabusa async prefetch (code done, needs benchmarking)
+  [ ] Batched speculative decode (infrastructure done)
+  [ ] Multi-GPU (not implemented — single-GPU is the design philosophy)
+WILL NOT WORK ON:
+  - GPUs with < 24 GB VRAM (model is 20 GB + KV cache)
+  - CUDA < 12.4 (BF16 atomics, FP8 support needed)
+  - CPU-only (Triton compiles to GPU targets)
+================================================================================
+ HARDWARE & SOFTWARE REQUIREMENTS
+================================================================================
+  Component          Minimum              Recommended
+  ─────────────────  ───────────────────  ────────────────────────
+  GPU                RTX 4090 (24 GB)*    RTX 5090 (32 GB)
+  GPU VRAM           24 GB                32 GB
+  CPU                Any modern x86_64    Ryzen 9 9950X / i9-14900K
+  System RAM         32 GB                64 GB
+  CUDA               12.4+               12.8+
+  Python             3.10 - 3.12         3.12
+  PyTorch            2.4.0+              2.6.0+cu128
+  Triton             3.0+                3.2+
+  OS                 Linux (x86_64)      Arch Linux / Ubuntu 22.04+
+  * RTX 4090: Will work but FP4 kernels may be slower (no Blackwell tensor cores)
+  * RTX 3090: Marginal — 24 GB VRAM is tight, FP8 not supported
+  * AMD GPUs: Triton compiles to ROCm — untested but architecturally supported
+  Tested configuration (author's machine):
+    AMD Ryzen 9 9950X + NVIDIA RTX 5090 32 GB + 64 GB DDR5
+    Arch Linux, CUDA 12.8, Python 3.12, PyTorch 2.6.0+cu128, Triton 3.2
+================================================================================
+ INSTALLATION
+================================================================================
+Step 1: Clone the repository
+─────────────────────────────
+  git clone https://github.com/Joysulem/FireEcho.git
+  cd FireEcho
+Step 2: Create a Python virtual environment
+────────────────────────────────────────────
+  python3.12 -m venv .venv
+  source .venv/bin/activate
+Step 3: Install dependencies
+─────────────────────────────
+  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
+  pip install triton transformers tokenizers safetensors sentencepiece
+Step 4: Verify installation
+────────────────────────────
+  python -c "import torch; print('CUDA:', torch.cuda.is_available(), '|', torch.version.cuda)"
+  python -c "import triton; print('Triton:', triton.__version__)"
+  Expected output:
+    CUDA: True | 12.8
+    Triton: 3.2.0
+Step 5: Download a model (Qwen3-Omni-30B recommended)
+──────────────────────────────────────────────────────
+  # Option A: Via huggingface-cli
+  pip install huggingface-hub
+  huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model/Qwen3-Omni
+  # Option B: Via git lfs
+  git lfs install
+  git clone https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ./model/Qwen3-Omni
+================================================================================
+ QUICK SMOKE TEST (run this first!)
+================================================================================
+  cd FireEcho/kernel/FireEcho\ Engine/
+  python -c "
+  from fireecho_kernel import FireEchoEngine
+  import torch
+  # Load model (takes ~110 seconds, streams layer-by-layer)
+  engine = FireEchoEngine.from_pretrained('./model/Qwen3-Omni')
+  # Quick generation test
+  tokens = engine.tokenizer.encode('The capital of France is', return_tensors='pt').cuda()
+  output = engine.generate(tokens, max_new_tokens=20, temperature=0.0)
+  print(engine.tokenizer.decode(output[0]))
+  print(f'VRAM used: {torch.cuda.max_memory_allocated()/1e9:.1f} GB')
+  "
+  Expected output:
+    The capital of France is Paris. Paris is the largest city in France...
+    VRAM used: 23.1 GB
+  If this works, your setup is correct. If not, check:
+    - CUDA version matches PyTorch build (torch.version.cuda)
+    - GPU has enough VRAM (nvidia-smi)
+    - Model path is correct
+================================================================================
+ BASIC INFERENCE USAGE
+================================================================================
+─── Minimal example ───
+  from fireecho_kernel import FireEchoEngine
+  # Load model with FP4 quantization (automatic for Qwen3-Omni)
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  # Encode input
+  input_ids = engine.tokenizer.encode(
+      "Explain quantum computing in simple terms",
+      return_tensors='pt'
+  ).cuda()
+  # Generate
+  output = engine.generate(
+      input_ids,
+      max_new_tokens=200,
+      temperature=0.7,
+      top_p=0.9,
+      use_cache=True
+  )
+  # Decode and print
+  print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
+─── High-performance decode (all optimizations) ───
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  # Enable flat KV cache (eliminates torch.cat overhead)
+  engine.enable_flat_decode()          # +403 MB VRAM, BF16 KV
+  # Or FP8 KV cache (half the VRAM, same speed)
+  engine.enable_flat_decode(kv_dtype='fp8')  # +208 MB VRAM
+  # Enable CUDA Graph decode (captures forward pass as graph)
+  engine.enable_cuda_graph_decode()    # +0 VRAM, ~10% faster
+  # Enable Atlas gatekeeper (prunes cold experts at runtime)
+  engine.enable_atlas(
+      profile_prompts=8,
+      ban_pct=0.25,              # Ban bottom 25% of experts per layer
+      modes_threshold=2.0        # MoDES: skip MoE for uncertain tokens
+  )
+  # Enable FE-XC cold expert demotion (2-bit codebook)
+  engine.enable_auto_fexc_demotion(cold_threshold=0.10)
+  # Enable INT2 ultra-cold expert demotion
+  engine.enable_auto_int2_demotion(cold_threshold=0.05)
+  # Generate with everything enabled
+  output = engine.generate(input_ids, max_new_tokens=500)
+─── Interactive chat loop ───
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  engine.enable_flat_decode()
+  engine.enable_cuda_graph_decode()
+  print("FireEcho Chat (type 'quit' to exit)")
+  while True:
+      user_input = input("\nYou: ")
+      if user_input.lower() == 'quit':
+          break
+      # Format as chat (Qwen3 format)
+      prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
+      input_ids = engine.tokenizer.encode(prompt, return_tensors='pt').cuda()
+      output = engine.generate(
+          input_ids,
+          max_new_tokens=500,
+          temperature=0.7,
+          top_p=0.9
+      )
+      response = engine.tokenizer.decode(
+          output[0][input_ids.shape[1]:],
+          skip_special_tokens=True
+      )
+      print(f"\nFireEcho: {response}")
+================================================================================
+ BENCHMARKING
+================================================================================
+─── Quick speed test ───
+  python benchmark_fullstack.py
+  This runs 7 optimization layers, stacking each one:
+    L0: Baseline (FP4 + packed MoE + flat KV BF16)    ~45 tok/s
+    L1: + FP8 KV cache                                ~42 tok/s
+    L2: + L2 layer prefetch                            ~42 tok/s
+    L3: + Atlas Ban & Pick (8->~5 experts)             ~40 tok/s
+    L4: + FE-XC cold experts (2-bit codebook)          ~39 tok/s
+    L5: + INT2 coldest experts (2-bit scalar)           ~38 tok/s
+    L6: + CUDA Graph decode                             ~TBD
+  Note: L1-L5 are slightly slower than L0 due to overhead from
+  additional dispatch logic. The REAL benefit comes when combined
+  with speculative decoding (EAGLE-3) — the bandwidth savings from
+  FE-XC/INT2 allow more tokens to be verified per unit time.
+─── EAGLE-3 benchmark (speculative decode) ───
+  python benchmark_eagle.py --checkpoint eagle_checkpoints/eagle_best.pt
+  Note: Requires a trained draft head. See "EAGLE-3 Training" section.
+================================================================================
+ FEATURE REFERENCE (Cheat Sheet)
+================================================================================
+  Feature                  How to enable                        VRAM cost
+  ───────────────────────  ─────────────────────────────────    ──────────
+  Flat KV cache (BF16)     engine.enable_flat_decode()          +403 MB
+  Flat KV cache (FP8)      engine.enable_flat_decode('fp8')     +208 MB
+  CUDA Graph decode        engine.enable_cuda_graph_decode()    ~0
+  Atlas gatekeeper         engine.enable_atlas()                ~0
+  FE-XC cold demotion      engine.enable_auto_fexc_demotion()   ~0*
+  INT2 cold demotion       engine.enable_auto_int2_demotion()   ~0*
+  L2 layer prefetch        engine.enable_l2_prefetch()          ~0
+  Hebbian memory           engine.enable_hebbian()              +50 MB
+  EAGLE-3 speculation      engine.enable_eagle(checkpoint)      +200 MB
+  * FE-XC/INT2 actually SAVES VRAM by compressing cold expert weights
+  Quantization formats available:
+    - Goliath FP4: 4-bit fused dequant (default for MoE experts)
+    - Goliath FP8: 8-bit fused dequant (optional for attention)
+    - Goliath INT2: 2-bit scalar quantization (coldest experts)
+    - FE-XC: 2-bit codebook (2x8 AQLM-style, near-FP16 quality)
+    - FE-XVQ: Hessian-weighted 2-bit codebook (VPTQ-inspired)
+    - FE-MX: Block floating point (FEMX4/FEMX6/FEMX8 for Hebbian)
+================================================================================
+ HOW THE ENGINE WORKS (Architecture Overview)
+================================================================================
+  FireEcho loads a model and replaces standard PyTorch operations with
+  custom Triton kernels at every level:
+  1. LOADING (from_pretrained)
+     - Streams model shards one layer at a time (3.1 GB CPU RAM peak)
+     - Quantizes each layer to Goliath FP4 on GPU as it loads
+     - Packs all 128 MoE experts into contiguous buffers per layer
+     - Total: 61 GB BF16 -> 20 GB FP4 in 110 seconds
+  2. PREFILL (processing the input prompt)
+     - Standard attention + MoE forward pass
+     - Uses FlashAttention-style Triton kernel for long sequences
+     - Builds KV cache for all layers
+  3. DECODE (generating tokens one at a time)
+     - Each token goes through 48 transformer layers:
+       For each layer:
+         a) RMSNorm
+         b) Attention: Q/K/V projection (BF16 matmul) -> RoPE -> FlashDecode
+            (custom Triton kernel, M=1, online softmax, reads only valid KV)
+         c) RMSNorm
+         d) MoE Router: softmax over 128 experts -> top-8 selection
+         e) Expert FFN: Goliath FP4 packed matmul (gate_up + down)
+            - Hot experts: FP4 (highest quality)
+            - Cold experts: FE-XC 2-bit codebook (5.3x faster kernel)
+            - Coldest experts: INT2 2-bit scalar
+         f) Residual connection
+     - With CUDA Graph: entire 48-layer forward captured as one graph
+       launch -> ~15.8ms per token
+  4. SPECULATIVE DECODE (EAGLE-3, when draft head is trained)
+     - Draft head predicts next K tokens (K=5 default)
+     - Target model verifies all K+1 tokens in one forward pass
+     - Accepts matching tokens, rejects and rolls back on mismatch
+     - Expected: 3-5x speedup with 70%+ acceptance rate
+  Memory layout during decode:
+    ┌──────────────────────────────────────────────────┐
+    │ GPU VRAM (32 GB total)                           │
+    ├──────────────────────────────────────────────────┤
+    │ Model weights (FP4 quantized)        19.6 GB     │
+    │ KV cache (flat, FP8)                  0.2 GB     │
+    │ Hebbian memory                        0.05 GB    │
+    │ CUDA Graph buffers                    0.1 GB     │
+    │ Activations + workspace               1.0 GB     │
+    │ ─────────────────────────────────────────────    │
+    │ Total                                ~21.0 GB    │
+    │ Free                                 ~11.0 GB    │
+    └──────────────────────────────────────────────────┘
+================================================================================
+ FILE STRUCTURE
+================================================================================
+  FireEcho Engine/
+  ├── fireecho_kernel.py      Main engine (9000+ lines)
+  │                           - FireEchoEngine: load, generate, speculate
+  │                           - FireEchoConfig: model configuration
+  │                           - MoEFFN: mixture-of-experts with packed dispatch
+  │                           - HebbianMemory: biologically-inspired fast weights
+  │                           - FireEchoEagleHead: EAGLE-3 draft head
+  │                           - FlashDecode Triton kernel
+  │                           - CUDA Graph capture/replay
+  │
+  ├── goliath_kernel.py       Quantized GEMM kernels (3000+ lines)
+  │                           - GoliathFP4Weights: FP4 fused dequant-matmul
+  │                           - GoliathFP8Weights: FP8 fused dequant-matmul
+  │                           - GoliathINT2Weights: INT2 scalar quantization
+  │                           - GoliathFEXCWeights: FE-XC codebook 2-bit
+  │                           - GoliathFEXVQWeights: Hessian-weighted codebook
+  │                           - Packed MoE kernels (FP4, INT2, FE-XC)
+  │                           - Fused SwiGLU+Down kernel
+  │                           - GoliathQuantumLinear (training)
+  │
+  ├── triton_hebbian.py       Fused Triton kernels for Hebbian memory
+  │                           - fused_competition, fused_soft_hebbian
+  │                           - fused_traces_update, fused_gate_output
+  │
+  ├── femx_storage.py         FE-MX block floating point storage
+  │                           - FEMX2, FEMX4, FEMX6, FEMX8 tiers
+  │                           - Stochastic rounding, age-adaptive precision
+  │
+  ├── persistent_memory.py    AGI-like persistent memory
+  │                           - EpisodicLog: raw experience buffer
+  │                           - SemanticJournal: compressed knowledge
+  │                           - ReflectionEngine: self-evaluation
+  │
+  ├── benchmark_fullstack.py  Full-stack benchmark (L0-L6)
+  ├── benchmark_eagle.py      EAGLE-3 speculative decode benchmark
+  ├── train_eagle_head.py     EAGLE-3 draft head training script
+  └── calibrate_fexc.py       FE-XC codebook calibration
+================================================================================
+ THE GOLIATH KERNEL (What Makes It Fast)
+================================================================================
+Standard quantized inference:
+  1. Load FP4 weights from VRAM
+  2. Dequantize to BF16 in global memory (writes 61 GB!)
+  3. Run matmul on the BF16 weights
+  Problem: Step 2 doubles memory traffic and VRAM usage
+Goliath approach:
+  1. Load FP4 weights directly into Triton registers
+  2. Dequantize INSIDE the matmul tile loop (in registers, zero global write)
+  3. Accumulate in FP32
+  Problem: None. This is strictly better.
+  Code path (simplified):
+    for k_block in range(0, K, BLOCK_K):
+        # Load FP4 packed bytes (2 values per byte)
+        w_packed = tl.load(weight_ptr + offsets)
+        # Dequantize in-register
+        w_lo = (w_packed & 0xF).to(tl.float32) * scale  # low nibble
+        w_hi = (w_packed >> 4).to(tl.float32) * scale   # high nibble
+        # Matmul tile (tensor core)
+        acc += tl.dot(a_tile, w_tile)
+  Result: 4x less memory traffic, same numerical quality.
+Packed MoE:
+  Standard approach: Loop over 8 active experts, one matmul each = 16 kernel
+  launches per layer (gate_up + down per expert).
+  Goliath Packed MoE: All 128 experts packed into one [128, K//2, N] buffer.
+  Single kernel launch reads expert_id from GPU tensor, indexes into buffer.
+  Result: 2 kernel launches per layer (gate_up + down), expert selection
+  stays entirely on GPU.
+================================================================================
+ HEBBIAN MEMORY (What Makes It Smart)
+================================================================================
+Standard LLMs: Frozen weights after training. Context window is the only memory.
+FireEcho Hebbian Memory:
+  - Fast weights that update DURING inference (no backpropagation)
+  - Inspired by biological synaptic plasticity (Hebb's rule: "neurons that
+    fire together wire together")
+  - Stores patterns from the current conversation
+  - Retrieves relevant patterns to augment generation
+How it works:
+  1. Input token embedding is projected to query/key/value
+  2. Query matches against stored memory slots (competitive retrieval)
+  3. Top-K most relevant memories are retrieved
+  4. Retrieved context is mixed with transformer hidden state
+  5. Memory slots are updated via Hebbian learning rule
+  Updates use:
+    - Soft competitive learning (winner-take-most)
+    - Three-factor STDP (spike-timing dependent plasticity)
+    - Intrinsic plasticity (per-slot gain adaptation)
+    - PMI correction (pointwise mutual information bias)
+    - GHA decorrelation (prevent redundant memories)
+    - Kappa switching (amplified encoding for novel patterns)
+  Enable:
+    engine.enable_hebbian()
+  The memory persists within a session and can be saved/loaded:
+    engine.save_persistent_memory("memory.pt")
+    engine.load_persistent_memory("memory.pt")
+================================================================================
+ COMPRESSION STACK (Why 30B Fits in 20 GB)
+================================================================================
+  Level    Format     Bits   Compression   Quality        Used For
+  ──────   ─────────  ────   ───────────   ────────────   ────────────────
+  Base     BF16       16     1x            Perfect        Attention Q/K/V/O
+  Hot      Goliath    4      4x            Near-perfect   Active MoE experts
+           FP4
+  Cold     FE-XC      2      8x            Very good      Rarely-used experts
+                                           (codebook)
+  Coldest  INT2       2      8x            Acceptable     Least-used experts
+                                           (scalar)
+  Combined with MoE sparsity (8/128 active = 6.25%):
+    Effective model size per token:
+      Attention: 8 × (4 projections × 2048 × 128 × 2 bytes) = 16 MB
+      MoE: 8 experts × 3 projections × 768 × 2048 × 0.5 bytes = 18.9 MB
+      Other: embeddings, norms, router = ~13 MB
+      Total per token: ~48 MB
+    RTX 5090 bandwidth: 1.79 TB/s
+    Theoretical max: 1,790,000 / 48 = 37,291 tok/s (compute-bound limit)
+    Practical (30% utilization): ~45 tok/s (memory-bound, current result)
+  With FE-XC/INT2 cold experts replacing 80%+ of inactive expert weights:
+    MoE bandwidth: 18.9 MB * 0.5 (half are 2-bit) = ~10 MB
+    Total per token: ~39 MB
+    At 30% utilization: ~55 tok/s
+  With EAGLE-3 (70% acceptance, K=5 draft):
+    Effective throughput: 55 * 3.5 (average accepted tokens per verify) = ~193 tok/s
+================================================================================
+ EAGLE-3 SPECULATIVE DECODING
+================================================================================
+EAGLE-3 is a draft-then-verify acceleration technique:
+  Normal decode: 1 token per forward pass through 48 MoE layers
+  EAGLE-3: Draft head predicts 5 tokens cheaply, target model verifies all 6
+            in one forward pass. If 4/5 match -> 5 tokens for the cost of ~2.
+  Architecture of draft head:
+    - Takes hidden states from layers 8, 24, 47 + token embedding
+    - Compresses via FC layer (8192 -> 2048)
+    - Passes through D transformer layers (D=2 to D=50)
+    - Shares lm_head with target model
+    - Total params: 115M (D=2) to 2.12B (D=50)
+  Training:
+    python train_eagle_head.py \
+        --offline \                    # Use precomputed hidden states
+        --num_head_layers 50 \         # D=50 layers
+        --draft_depth 5 \              # K=5 draft steps
+        --lr 5e-4 \                    # Learning rate
+        --epochs 5 \                   # Training epochs
+        --loss_type ce \               # Cross-entropy loss
+        --batch_positions \            # Batched M=64 (10x faster)
+        --use_quantum_linear \         # Goliath FP8 forward + Quantum Gold backward
+        --compile                      # torch.compile the head
+  Usage after training:
+    engine.enable_eagle("eagle_checkpoints/eagle_best.pt")
+    output = engine.speculative_generate(input_ids, max_new_tokens=500)
+================================================================================
+ SPEED OPTIMIZATION HISTORY
+================================================================================
+  Step  Optimization                              tok/s   Speedup
+  ─���──  ────────────────────────────────────────  ──────  ───────
+  0     Baseline (128-expert Python loop)           0.4   1x
+  1     Grouped dispatch + TF32 + Triton autotune   7.7   19x
+  2     Fused gate_up_proj (2->1 matmul/expert)     9.5   24x
+  3     Single-token decode fast path              12.6   32x
+  4     Multi-expert Goliath kernel (2 launches)   18.8   47x
+  5     Packed MoE (contiguous buffer, GPU IDs)    30.8   77x
+  6     Flat decode KV cache (zero torch.cat)      40.9   102x
+  7     CUDA Graph + FlashDecode                   49.4   124x
+  Where the time goes at 45 tok/s (22ms per token):
+    Attention (FlashDecode):  0.28ms/layer x 48 = 13.4ms (61%)
+    MoE (Goliath FP4):        0.17ms/layer x 48 =  8.2ms (37%)
+    Other (norms, router):                         0.4ms  (2%)
+================================================================================
+ KNOWN LIMITATIONS & GOTCHAS
+================================================================================
+  - Single-GPU only (by design — multi-GPU adds complexity for marginal gain)
+  - Minimum 24 GB VRAM (model alone is 20 GB)
+  - FP4 quantization has ~0.05-0.15 relative error vs BF16 (negligible in practice)
+  - First 10+ forward passes are slow (Triton kernel compilation/autotuning)
+  - CUDA Graph capture requires fixed tensor shapes (only decode, not prefill)
+  - Hebbian memory adds ~50 MB VRAM and slight latency
+  - FE-XC codebook learning takes 1-2 minutes on first enable
+  - No pip package yet (source install only)
+  - Tested primarily on RTX 5090 — other GPUs may need Triton autotune re-run
+  - MoDES expert skipping can hurt quality if threshold is too aggressive
+================================================================================
+ TROUBLESHOOTING
+================================================================================
+  Problem: "CUDA out of memory"
+  Fix: Check nvidia-smi for other processes using VRAM. Kill them.
+       Or reduce max_kv_blocks in config (default 256 = 4K tokens = 3.1 GB).
+  Problem: Very slow first few generations
+  Fix: Normal — Triton is compiling and autotuning kernels. Wait ~10 forward
+       passes for warmup. Subsequent runs use cached kernels.
+  Problem: "No module named 'triton'"
+  Fix: pip install triton (requires CUDA toolkit installed)
+  Problem: "RuntimeError: Triton compilation failed"
+  Fix: Check CUDA version matches PyTorch: python -c "import torch; print(torch.version.cuda)"
+       Triton 3.0+ needs CUDA 12.0+.
+  Problem: NaN in output
+  Fix: Check if using prefill with >20 tokens (packed MoE kernel needs 3D grid).
+       This was a fixed bug — update to latest code.
+  Problem: CUDA Graph capture crashes
+  Fix: Atlas .item() calls conflict with graph capture. The engine auto-skips
+       these during capture (fixed). Update to latest code.
+================================================================================
+ RESEARCH PAPERS & REFERENCES
+================================================================================
+  FireEcho builds on ideas from:
+  Quantization:
+    - AQLM (arxiv 2401.06118): Additive quantization for LLMs -> FE-XC codebook
+    - VPTQ (Hessian-weighted): Second-order optimal codebooks -> FE-XVQ
+    - FP4 Training (arxiv 2501.17116): Gradient flow through FP4
+  Speculative Decoding:
+    - EAGLE-3 (Li et al.): Draft-then-verify with shared lm_head
+    - Scylla (arxiv 2505.07858): Tree-based multi-candidate verification -> FE-XT
+    - Medusa: Multi-head parallel drafting
+  MoE Optimization:
+    - SP-MoE (arxiv 2510.10302): Async expert prefetch -> FE-H Hayabusa
+    - MoE-Inference-Bench: Expert sizing analysis
+  Hebbian/Neuroscience:
+    - Lansner BCPNN: Bayesian confidence propagation neural networks
+    - Triesch 2005: Intrinsic plasticity
+    - Sanger's GHA: Generalized Hebbian algorithm
+    - McClelland et al. 1995: Complementary learning systems
+  Tensor Decomposition:
+    - MPS/TT decomposition: Quantum-inspired weight compression
+================================================================================
+ WHERE TO GET HELP
+================================================================================
+  GitHub Issues: https://github.com/Joysulem/FireEcho/issues
+    Include: GPU model, CUDA version, PyTorch version, full error traceback
+  X / Twitter: @Joysulem
+    Tag me with questions, benchmarks, or usage reports
+  Email: (floresluise1988@gmail.com)
+================================================================================
+ LICENSE
+================================================================================
+  Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
+  You are free to:
+    - Share: copy and redistribute the material in any medium or format
+    - Adapt: remix, transform, and build upon the material
+  Under the following terms:
+    - Attribution: You must give appropriate credit to Luis E. Davila Flores,
+      provide a link to the license, and indicate if changes were made.
+    - NonCommercial: You may not use the material for commercial purposes.
+  Full license: https://creativecommons.org/licenses/by-nc/4.0/
+  For commercial licensing inquiries, contact: @Joysulem on X/Twitter
+================================================================================
+  FireEcho Engine — Created by Luis E. Davila Flores
+  "One GPU. One file. One import. Full pipeline."
+================================================================================

FireEcho Engine/__pycache__/cutlass_kernels.cpython-312.pyc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4baeab19c5823d68cfa41ebbb0754cf7aeedc546d25247acdfcef8b75c5c383
+size 104083

FireEcho Engine/__pycache__/dsmem_ops.cpython-312.pyc ADDED Viewed

Binary file (26.1 kB). View file

FireEcho Engine/__pycache__/femx_storage.cpython-312.pyc ADDED Viewed

Binary file (21.7 kB). View file

FireEcho Engine/__pycache__/fireecho_kernel.cpython-312.pyc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:34de898847f5fd027b2726515d35b46da6c694ca651a99f827992062af8b4b7f
+size 703662

FireEcho Engine/__pycache__/goliath_kernel.cpython-312.pyc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:996c50c408ca615417071331d98d070fa0557d35ef1f63fff51792ba27ae84fb
+size 126662

FireEcho Engine/__pycache__/hebbian_finetune_demo.cpython-312.pyc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68631075853c27682a27cc8d2d202408148f220de706cede76ddd77cf371ff84
+size 148146

FireEcho Engine/__pycache__/triton_hebbian.cpython-312.pyc ADDED Viewed

Binary file (33.9 kB). View file

FireEcho Engine/bench_fusion.py ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/usr/bin/env python3
+"""
+FireEcho Fusion Benchmark — Goliath vs Legacy FFN
+===================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+Quick benchmark: Goliath fusion vs legacy in FusedFFN.
+"""
+import torch, time
+from fireecho_kernel import FusedFFN, _GOLIATH_AVAILABLE, _CUTLASS_AVAILABLE
+print("GPU:", torch.cuda.get_device_name(0))
+print("Goliath:", _GOLIATH_AVAILABLE, " CUTLASS:", _CUTLASS_AVAILABLE)
+print()
+dim, ffn_dim, B, S = 4096, 11008, 4, 64
+x = torch.randn(B, S, dim, device="cuda", dtype=torch.bfloat16)
+warmup, iters = 10, 50
+total_flops = 3 * 2 * B * S * dim * ffn_dim
+for name, bits, goliath in [
+    ("Goliath FP4", 4, True),
+    ("Goliath FP8", 8, True),
+    ("Legacy quant", 4, False),
+    ("BF16 no-quant", 4, False),
+]:
+    use_q = name != "BF16 no-quant"
+    ffn = FusedFFN(dim, ffn_dim, use_nvfp4=use_q, goliath_bits=bits, use_goliath=goliath).cuda().eval()
+    with torch.no_grad():
+        for _ in range(warmup):
+            ffn(x)
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        for _ in range(iters):
+            ffn(x)
+        torch.cuda.synchronize()
+        t = (time.perf_counter() - t0) / iters
+    print(f"  {name:16s}: {t*1000:.2f}ms  ({total_flops/t/1e12:.1f} TFLOPS)")

FireEcho Engine/benchmark_eagle.py ADDED Viewed

	@@ -0,0 +1,231 @@

+#!/usr/bin/env python3
+# =============================================================================
+# Copyright (c) 2024-2026 Luis E. Davila Flores. All rights reserved.
+#
+# FireEcho Engine — High-Performance Inference Kernel
+# Creator & Sole Author: Luis E. Davila Flores
+#
+# Licensed under Creative Commons Attribution-NonCommercial 4.0 International
+# (CC BY-NC 4.0). You may share and adapt this work for non-commercial
+# purposes with proper attribution. Full license terms:
+# https://creativecommons.org/licenses/by-nc/4.0/
+# =============================================================================
+"""
+FireEcho EAGLE-3 Benchmark — Speculative vs Standard Decode
+=============================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+Benchmark EAGLE-3 speculative decoding vs standard decode.
+Compares:
+1. Standard generate() (baseline tok/s)
+2. Speculative generate() with trained EAGLE head
+3. Reports acceptance rate, speedup, tok/s
+Usage:
+    PYTHONUNBUFFERED=1 python benchmark_eagle.py [--checkpoint eagle_best.pt]
+"""
+import sys, os, time, argparse, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_DIR = os.path.join(os.path.dirname(__file__), "eagle_checkpoints")
+TEST_PROMPTS = [
+    "Explain the theory of general relativity in simple terms.",
+    "Write a Python function to find the longest palindromic substring.",
+    "What are the main differences between TCP and UDP protocols?",
+    "Describe the process of photosynthesis step by step.",
+    "What caused the fall of the Roman Empire?",
+]
+def load_benchmark_engine():
+    """Load Qwen3-Omni with Goliath FP4 quantization via load_engine()."""
+    print("=" * 60)
+    print("Loading Qwen3-Omni engine...")
+    print("=" * 60)
+    engine, tokenizer, config = load_engine(
+        MODEL_PATH, max_seq_len=4096, device="cuda",
+    )
+    engine.pack_all_experts()
+    engine.kv_cache.enable_flat_decode()
+    engine.eval()
+    return engine, tokenizer
+def benchmark_standard(engine, tokenizer, prompts, max_tokens=100, warmup=2):
+    """Benchmark standard generate()."""
+    print("\n" + "=" * 60)
+    print("Benchmark: Standard generate()")
+    print("=" * 60)
+    # Warmup
+    for i in range(warmup):
+        ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
+        engine.generate(ids, max_new_tokens=20, temperature=0.0, top_k=0, top_p=1.0)
+        print(f"  Warmup {i+1}/{warmup}")
+    results = []
+    for prompt in prompts:
+        input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        prompt_len = input_ids.shape[1]
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        output = engine.generate(
+            input_ids, max_new_tokens=max_tokens, temperature=0.0,
+            top_k=0, top_p=1.0)  # Pure greedy for fair comparison
+        torch.cuda.synchronize()
+        elapsed = time.perf_counter() - t0
+        gen_len = output.shape[1] - prompt_len
+        tok_s = gen_len / elapsed
+        text = tokenizer.decode(output[0, prompt_len:], skip_special_tokens=True)
+        results.append({
+            'prompt': prompt[:50],
+            'gen_len': gen_len,
+            'elapsed': elapsed,
+            'tok_s': tok_s,
+        })
+        print(f"  [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
+    avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
+    avg_gen = sum(r['gen_len'] for r in results) / len(results)
+    print(f"\n  Standard avg: {avg_tok_s:.1f} tok/s, {avg_gen:.0f} tokens/prompt")
+    return avg_tok_s, results
+def benchmark_speculative(engine, tokenizer, prompts, checkpoint_path,
+                          max_tokens=100, warmup=2, draft_depth=5,
+                          num_head_layers=2):
+    """Benchmark speculative generate() with EAGLE head."""
+    print("\n" + "=" * 60)
+    print(f"Benchmark: Speculative generate() (depth={draft_depth}, D={num_head_layers})")
+    print(f"  Checkpoint: {os.path.basename(checkpoint_path)}")
+    print("=" * 60)
+    # Enable EAGLE
+    engine.enable_eagle(capture_layers=(8, 24, 47), draft_depth=draft_depth,
+                        num_head_layers=num_head_layers)
+    # Load checkpoint to CPU first (avoid OOM from double-loading to GPU)
+    ckpt = torch.load(checkpoint_path, weights_only=False, map_location='cpu')
+    engine.eagle_head.load_state_dict(ckpt['eagle_head'], strict=False)
+    step = ckpt.get('step', '?')
+    loss = ckpt.get('loss', '?')
+    del ckpt  # Free CPU copy immediately
+    print(f"  Loaded step {step}, loss={loss}")
+    # Warmup (also warms Triton kernels)
+    for i in range(warmup):
+        ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
+        engine.speculative_generate(ids, max_new_tokens=20, temperature=0.0)
+        print(f"  Warmup {i+1}/{warmup}")
+    results = []
+    total_drafted = 0
+    total_accepted = 0
+    for prompt in prompts:
+        input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        prompt_len = input_ids.shape[1]
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        output = engine.speculative_generate(
+            input_ids, max_new_tokens=max_tokens, temperature=0.0,
+            draft_depth=draft_depth)
+        torch.cuda.synchronize()
+        elapsed = time.perf_counter() - t0
+        gen_len = output.shape[1] - prompt_len
+        tok_s = gen_len / elapsed
+        results.append({
+            'prompt': prompt[:50],
+            'gen_len': gen_len,
+            'elapsed': elapsed,
+            'tok_s': tok_s,
+        })
+        print(f"  [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
+    avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
+    avg_gen = sum(r['gen_len'] for r in results) / len(results)
+    print(f"\n  Speculative avg: {avg_tok_s:.1f} tok/s, {avg_gen:.0f} tokens/prompt")
+    return avg_tok_s, results
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--checkpoint', default='eagle_best.pt',
+                        help='EAGLE checkpoint filename')
+    parser.add_argument('--max-tokens', type=int, default=100)
+    parser.add_argument('--warmup', type=int, default=3)
+    parser.add_argument('--depth', type=int, default=5)
+    parser.add_argument('--num_head_layers', type=int, default=2,
+                        help='Number of layers in eagle head (D)')
+    args = parser.parse_args()
+    checkpoint_path = os.path.join(EAGLE_DIR, args.checkpoint)
+    if not os.path.exists(checkpoint_path):
+        print(f"ERROR: Checkpoint not found: {checkpoint_path}")
+        sys.exit(1)
+    # Load engine + tokenizer
+    engine, tokenizer = load_benchmark_engine()
+    # Benchmark standard
+    std_tok_s, std_results = benchmark_standard(
+        engine, tokenizer, TEST_PROMPTS,
+        max_tokens=args.max_tokens, warmup=args.warmup)
+    # Benchmark speculative
+    spec_tok_s, spec_results = benchmark_speculative(
+        engine, tokenizer, TEST_PROMPTS, checkpoint_path,
+        max_tokens=args.max_tokens, warmup=args.warmup,
+        draft_depth=args.depth,
+        num_head_layers=args.num_head_layers)
+    # Also try depth=3 (less wasted compute with low acceptance)
+    spec3_tok_s, _ = benchmark_speculative(
+        engine, tokenizer, TEST_PROMPTS, checkpoint_path,
+        max_tokens=args.max_tokens, warmup=1,
+        draft_depth=3,
+        num_head_layers=args.num_head_layers)
+    # Read checkpoint step for summary
+    ckpt_info = torch.load(checkpoint_path, weights_only=False, map_location='cpu')
+    ckpt_step = ckpt_info.get('step', '?')
+    del ckpt_info
+    # Summary
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    print(f"  Standard generate():          {std_tok_s:6.1f} tok/s")
+    print(f"  Speculative (depth=5):        {spec_tok_s:6.1f} tok/s  "
+          f"({spec_tok_s/std_tok_s:.2f}x)")
+    print(f"  Speculative (depth=3):        {spec3_tok_s:6.1f} tok/s  "
+          f"({spec3_tok_s/std_tok_s:.2f}x)")
+    print(f"  Checkpoint: {args.checkpoint} (step {ckpt_step})")
+    print("=" * 60)
+    # VRAM
+    vram_gb = torch.cuda.max_memory_allocated() / 1e9
+    print(f"  Peak VRAM: {vram_gb:.2f} GB")
+if __name__ == '__main__':
+    main()

FireEcho Engine/benchmark_fullstack.py ADDED Viewed

	@@ -0,0 +1,323 @@

+#!/usr/bin/env python3
+# =============================================================================
+# Copyright (c) 2024-2026 Luis E. Davila Flores. All rights reserved.
+#
+# FireEcho Engine — High-Performance Inference Kernel
+# Creator & Sole Author: Luis E. Davila Flores
+#
+# Licensed under Creative Commons Attribution-NonCommercial 4.0 International
+# (CC BY-NC 4.0). You may share and adapt this work for non-commercial
+# purposes with proper attribution. Full license terms:
+# https://creativecommons.org/licenses/by-nc/4.0/
+# =============================================================================
+"""
+FireEcho Full-Stack Benchmark — Path B: Every Optimization Stacked
+===================================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+Stacks ALL FireEcho architecture optimizations and benchmarks each layer:
+Already in baseline:
+  - Goliath FP4 packed MoE (dequant-matmul Triton kernels)
+  - Fused SwiGLU+Down (1 kernel launch, not 3)
+  - FlashDecode attention (Triton online softmax)
+  - Flat KV cache (zero torch.cat, pre-allocated)
+Layer 0: Baseline (all above)                                  — current ~37 tok/s
+Layer 1: + FP8 KV cache (half attention bandwidth)
+Layer 2: + L2 prefetch (next layer pre-staged in L2 cache)
+Layer 3: + Atlas Ban & Pick + MoDES (8→~5 experts + skip easy tokens)
+Layer 4: + FE-XC cold expert demotion (5.3x faster 2-bit codebook kernel)
+Layer 5: + CUDA Graph decode (zero Python overhead, single graph replay)
+Target: 15.8ms → ~8ms base forward = 125+ tok/s (no speculation)
+Usage:
+    PYTHONUNBUFFERED=1 python benchmark_fullstack.py
+"""
+import sys, os, time, argparse, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+TEST_PROMPTS = [
+    "Explain the theory of general relativity in simple terms.",
+    "Write a Python function to find the longest palindromic substring.",
+    "What are the main differences between TCP and UDP protocols?",
+    "Describe the process of photosynthesis step by step.",
+    "What caused the fall of the Roman Empire?",
+    "How does a compiler optimize code?",
+    "Explain how public key cryptography works.",
+    "What is the difference between a stack and a queue?",
+]
+def benchmark_generate(engine, tokenizer, prompts, max_tokens=100, warmup=3,
+                       label="Standard"):
+    """Benchmark generate() with current engine config."""
+    print(f"\n{'=' * 60}")
+    print(f"Benchmark: {label}")
+    print(f"{'=' * 60}")
+    # Warmup (critical for Triton kernel compilation)
+    for i in range(warmup):
+        ids = tokenizer.encode(prompts[0], return_tensors='pt').cuda()
+        engine.generate(ids, max_new_tokens=20, temperature=0.0, top_k=0, top_p=1.0)
+        print(f"  Warmup {i+1}/{warmup}")
+    results = []
+    for prompt in prompts:
+        input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        prompt_len = input_ids.shape[1]
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        output = engine.generate(
+            input_ids, max_new_tokens=max_tokens, temperature=0.0,
+            top_k=0, top_p=1.0)
+        torch.cuda.synchronize()
+        elapsed = time.perf_counter() - t0
+        gen_len = output.shape[1] - prompt_len
+        tok_s = gen_len / elapsed
+        results.append({
+            'prompt': prompt[:50],
+            'gen_len': gen_len,
+            'elapsed': elapsed,
+            'tok_s': tok_s,
+        })
+        print(f"  [{gen_len:3d} tok] {tok_s:6.1f} tok/s | {prompt[:50]}...")
+    avg_tok_s = sum(r['tok_s'] for r in results) / len(results)
+    avg_gen = sum(r['gen_len'] for r in results) / len(results)
+    print(f"\n  >> {label}: {avg_tok_s:.1f} tok/s avg, {avg_gen:.0f} tokens/prompt")
+    return avg_tok_s
+def main():
+    parser = argparse.ArgumentParser(description="FireEcho Full-Stack Benchmark")
+    parser.add_argument('--max-tokens', type=int, default=200)
+    parser.add_argument('--warmup', type=int, default=3)
+    parser.add_argument('--atlas-prompts', type=int, default=50,
+                        help='Number of prompts for Atlas profiling')
+    parser.add_argument('--ban-ratio', type=float, default=0.25,
+                        help='Atlas Ban & Pick: fraction of experts to ban')
+    parser.add_argument('--modes-threshold', type=float, default=2.0,
+                        help='Atlas MoDES: multiplier on uniform baseline (2.0 = skip when max_prob < 2/128)')
+    parser.add_argument('--fexc-cold-pct', type=float, default=0.10,
+                        help='FE-XC: fraction of experts to demote to 2-bit codebook')
+    parser.add_argument('--int2-cold-pct', type=float, default=0.05,
+                        help='INT2: fraction of coldest experts to demote to 2-bit scalar')
+    args = parser.parse_args()
+    summary = {}
+    # =====================================================================
+    # Load engine — baseline config (Goliath FP4 + packed MoE + flat KV BF16)
+    # =====================================================================
+    print("=" * 60)
+    print("FireEcho Full-Stack Benchmark — Path B")
+    print("Stacking ALL optimizations, measuring each layer")
+    print("=" * 60)
+    print("\nLoading Qwen3-Omni engine...")
+    engine, tokenizer, config = load_engine(
+        MODEL_PATH, max_seq_len=4096, device="cuda",
+    )
+    engine.pack_all_experts()
+    engine.kv_cache.enable_flat_decode()  # BF16 flat KV (baseline)
+    engine.eval()
+    # Suppress FE-MX tier updates during benchmarking (prints + overhead kill GPU perf)
+    # Set tier interval to effectively infinite so the modulo check never triggers
+    for layer in engine.layers:
+        if hasattr(layer, 'ffn'):
+            layer.ffn._quiet = True
+            layer.ffn.femx_tier_interval = 10_000_000  # Never trigger during benchmark
+    vram_base = torch.cuda.max_memory_allocated() / 1e9
+    print(f"  Base VRAM: {vram_base:.2f} GB")
+    # =====================================================================
+    # Layer 0: Baseline
+    # =====================================================================
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 0: Baseline (FP4 + packed MoE + flat KV BF16)")
+    summary['L0_baseline'] = tok_s
+    # =====================================================================
+    # Layer 1: FP8 KV cache
+    # =====================================================================
+    print("\n>> Enabling FP8 KV cache...")
+    engine.kv_cache.enable_flat_decode(kv_dtype='fp8')
+    print("  [FP8 KV] Enabled — 50% attention bandwidth reduction")
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 1: + FP8 KV cache")
+    summary['L1_fp8_kv'] = tok_s
+    # =====================================================================
+    # Layer 2: L2 prefetch
+    # =====================================================================
+    print("\n>> Enabling L2 layer-ahead prefetch...")
+    engine.enable_l2_prefetch()
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 2: + L2 prefetch")
+    summary['L2_l2_prefetch'] = tok_s
+    # =====================================================================
+    # Layer 3: Atlas Ban & Pick (requires profiling first)
+    # =====================================================================
+    print("\n>> Enabling Atlas the Gatekeeper (Ban & Pick)...")
+    engine.enable_atlas(ban_threshold=0.01, modes_threshold=args.modes_threshold)
+    engine.atlas_profile(tokenizer, num_prompts=args.atlas_prompts)
+    engine.atlas_ban(ban_ratio=args.ban_ratio)
+    engine.atlas_stats()
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 3: + Atlas Ban & Pick (8→~5 experts)")
+    summary['L3_atlas_ban'] = tok_s
+    # =====================================================================
+    # Layer 4: FE-XC cold expert demotion
+    # =====================================================================
+    print("\n>> Enabling FE-XC cold expert demotion...")
+    engine.enable_auto_fexc_demotion(cold_threshold_pct=args.fexc_cold_pct)
+    # Build up expert usage statistics with enough tokens to establish cold/hot
+    # Need usage > femx_cold_threshold(50) for hot experts, so run ~1000 tokens
+    print("  Building expert usage statistics (8 prompts × 50 tokens)...")
+    for prompt in TEST_PROMPTS:
+        ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        with torch.no_grad():
+            engine.generate(ids, max_new_tokens=50, temperature=0.0,
+                            top_k=0, top_p=1.0)
+    # Trigger tier updates + FE-XC demotion on each MoE layer
+    # This may take a few seconds as codebooks are learned per-layer
+    print("  Triggering FE-XC demotion (learning codebooks)...")
+    fexc_count = 0
+    for layer in engine.layers:
+        if hasattr(layer.ffn, 'update_expert_tiers'):
+            layer.ffn.update_expert_tiers()
+            if hasattr(layer.ffn, '_expert_is_fexc'):
+                fexc_count += layer.ffn._expert_is_fexc.sum().item()
+    print(f"  [FE-XC] {fexc_count} total experts demoted across all layers")
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 4: + FE-XC cold experts (2-bit codebook)")
+    summary['L4_fexc'] = tok_s
+    # =====================================================================
+    # Layer 5: INT2 coldest expert demotion (three-way: FP4/FE-XC/INT2)
+    # =====================================================================
+    print("\n>> Enabling INT2 coldest expert demotion...")
+    engine.enable_auto_int2_demotion(cold_threshold_pct=args.int2_cold_pct)
+    # Trigger tier update to demote coldest experts to INT2
+    int2_count = 0
+    for layer in engine.layers:
+        if hasattr(layer.ffn, 'update_expert_tiers'):
+            layer.ffn.update_expert_tiers()
+            if hasattr(layer.ffn, '_expert_is_int2'):
+                int2_count += layer.ffn._expert_is_int2.sum().item()
+    print(f"  [INT2] {int2_count} coldest experts demoted across all layers")
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup,
+                               label="Layer 5: + INT2 coldest experts (2-bit scalar)")
+    summary['L5_int2'] = tok_s
+    # =====================================================================
+    # Layer 6: CUDA Graph decode (captures entire 48-layer forward as one graph)
+    # Must be LAST — captures the current state of all optimizations
+    # =====================================================================
+    print("\n>> Enabling CUDA Graph decode...")
+    engine.enable_cuda_graph_decode(max_seq_len=4096)
+    print("  [CUDA Graph] Capturing full 48-layer decode as single graph replay")
+    tok_s = benchmark_generate(engine, tokenizer, TEST_PROMPTS,
+                               max_tokens=args.max_tokens, warmup=args.warmup + 2,
+                               label="Layer 6: + CUDA Graph (zero Python overhead)")
+    summary['L6_cuda_graph'] = tok_s
+    # =====================================================================
+    # SUMMARY
+    # =====================================================================
+    vram_final = torch.cuda.max_memory_allocated() / 1e9
+    final_key = 'L6_cuda_graph'
+    print("\n" + "=" * 70)
+    print("FIREECHO FULL-STACK BENCHMARK SUMMARY")
+    print("=" * 70)
+    print()
+    print("  Components already in baseline:")
+    print("    - Goliath FP4 packed MoE (Triton dequant-matmul)")
+    print("    - Fused SwiGLU+Down (1 kernel launch per expert)")
+    print("    - FlashDecode attention (Triton online softmax)")
+    print("    - Flat KV cache (zero torch.cat)")
+    print()
+    print(f"  {'Layer':<55s} {'tok/s':>8s} {'vs base':>8s}")
+    print(f"  {'-'*55} {'-'*8} {'-'*8}")
+    base = summary['L0_baseline']
+    display_order = [
+        ('L0_baseline',     'Baseline (Goliath FP4 + packed MoE + fused SwiGLU)'),
+        ('L1_fp8_kv',       '+ FP8 KV cache (half attention bandwidth)'),
+        ('L2_l2_prefetch',  '+ L2 layer-ahead prefetch'),
+        ('L3_atlas_ban',    '+ Atlas Ban & Pick + MoDES (FE-AGK)'),
+        ('L4_fexc',         '+ FE-XC cold expert demotion (2-bit codebook)'),
+        ('L5_int2',         '+ INT2 coldest experts (2-bit scalar)'),
+        ('L6_cuda_graph',   '+ CUDA Graph decode (zero Python overhead)'),
+    ]
+    for key, name in display_order:
+        val = summary[key]
+        speedup = val / base if base > 0 else 0
+        print(f"  {name:<55s} {val:>7.1f}  {speedup:>6.2f}x")
+    final = summary[final_key]
+    print(f"\n  Base VRAM: {vram_base:.2f} GB")
+    print(f"  Peak VRAM: {vram_final:.2f} GB")
+    print(f"  Total speedup: {final / base:.2f}x over baseline")
+    print(f"\n  Baseline forward: ~{1000/base:.1f}ms/token")
+    print(f"  Full-stack forward: ~{1000/final:.1f}ms/token")
+    print(f"\n  With 50% speculation acceptance: ~{final * 6 / 1:.0f} tok/s (est.)")
+    print(f"  With 70% speculation acceptance: ~{final * 8 / 1:.0f} tok/s (est.)")
+    print("=" * 70)
+    # Save results
+    results_path = os.path.join(os.path.dirname(__file__), "fullstack_benchmark_results.txt")
+    with open(results_path, 'w') as f:
+        f.write("FireEcho Full-Stack Benchmark Results\n")
+        f.write(f"Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
+        f.write(f"GPU: RTX 5090 32GB\n\n")
+        f.write("Components in baseline:\n")
+        f.write("  Goliath FP4 packed MoE, Fused SwiGLU+Down,\n")
+        f.write("  FlashDecode attention, Flat KV cache\n\n")
+        for key, name in display_order:
+            val = summary[key]
+            speedup = val / base
+            f.write(f"{name}: {val:.1f} tok/s ({speedup:.2f}x)\n")
+        f.write(f"\nBaseline: {base:.1f} tok/s\n")
+        f.write(f"Full-stack: {final:.1f} tok/s\n")
+        f.write(f"Speedup: {final/base:.2f}x\n")
+        f.write(f"Peak VRAM: {vram_final:.2f} GB\n")
+    print(f"\n  Results saved to: {results_path}")
+if __name__ == '__main__':
+    main()

FireEcho Engine/benchmark_perplexity.py ADDED Viewed

	@@ -0,0 +1,358 @@

+#!/usr/bin/env python3
+"""Perplexity benchmark for FireEcho quantization formats.
+Evaluates WikiText-2 perplexity across quantization configs:
+  1. FP4 baseline (Goliath FP4, all experts)
+  2. FE-XC 10% cold (codebook 2-bit, plain k-means)
+  3. FE-XVQ 10% cold (codebook 2-bit, Hessian-weighted k-means)
+  4. INT2 10% cold (scalar 2-bit)
+Each config runs in a SEPARATE SUBPROCESS to guarantee clean CUDA context
+(PyTorch's memory allocator doesn't fully release between del+gc.collect).
+Usage:
+    python benchmark_perplexity.py [--max_tokens 50000] [--stride 256]
+Output: PPL comparison table suitable for paper.
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+"""
+import sys
+import os
+import time
+import math
+import json
+import argparse
+import subprocess
+import tempfile
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+MODEL_DIR = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
+FEXVQ_CODEBOOKS = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                                'fexvq_codebooks.pt')
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+# ===== Worker code (runs in subprocess) =====
+def run_single_config(config, max_tokens, stride, max_len, cold_pct, result_file):
+    """Run a single config evaluation. Called in subprocess."""
+    import torch
+    import torch.nn.functional as F
+    sys.path.insert(0, SCRIPT_DIR)
+    print(f"\n{'=' * 70}")
+    print(f"  Config: {config.upper()}")
+    print(f"{'=' * 70}")
+    # Load model
+    from fireecho_kernel import FireEchoEngine
+    from transformers import AutoTokenizer
+    print("[1] Loading model...")
+    engine = FireEchoEngine.from_pretrained(MODEL_DIR)
+    engine.pack_all_experts()
+    engine.eval()
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, trust_remote_code=True)
+    # Load WikiText-2
+    from datasets import load_dataset
+    print("  Loading WikiText-2 test set...")
+    ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
+    text = "\n\n".join([t for t in ds["text"] if t.strip()])
+    print(f"  Text length: {len(text):,} chars")
+    tokens = tokenizer.encode(text, add_special_tokens=False)
+    if max_tokens > 0 and len(tokens) > max_tokens:
+        tokens = tokens[:max_tokens]
+    print(f"  Tokenized: {len(tokens):,} tokens")
+    token_ids = torch.tensor(tokens, dtype=torch.long)
+    # Warmup usage counters
+    warmup_prompts = [
+        "Explain how neural networks learn from data.",
+        "Write a Python function that sorts a list.",
+        "What are the main causes of climate change?",
+        "Describe the architecture of a transformer.",
+        "How does public key cryptography work?",
+        "What is the halting problem?",
+        "Explain quantum computing simply.",
+        "Write a recursive Fibonacci function.",
+        "What are the fundamental forces in physics?",
+        "How does the human immune system work?",
+        "Describe the process of photosynthesis.",
+        "What is the P vs NP problem?",
+        "How does GPS determine your location?",
+        "Explain machine learning overfitting.",
+        "What are design patterns in software?",
+        "How do search engines rank pages?",
+        "Describe the lifecycle of a star.",
+        "What is Shannon's information theory?",
+        "How do operating systems manage memory?",
+        "Explain the CAP theorem.",
+    ]
+    print(f"  Warming up expert usage (20 prompts)...")
+    for prompt in warmup_prompts:
+        ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        engine.reset_cache()
+        engine._current_seq_id = 0
+        engine.generate(ids, max_new_tokens=32, temperature=0.0)
+    ffn = engine.layers[0].ffn
+    if hasattr(ffn, 'expert_usage'):
+        usage = ffn.expert_usage
+        top5 = usage.topk(5)
+        bot5 = usage.topk(5, largest=False)
+        print(f"  Layer 0 usage: top5={top5.values.tolist()}, bot5={bot5.values.tolist()}")
+    # Apply quantization config
+    if config == 'fp4':
+        print("  [FP4 baseline — no demotion]")
+    elif config == 'fexc':
+        engine.enable_auto_fexc_demotion(cold_threshold_pct=cold_pct)
+        total = 0
+        for layer in engine.layers:
+            layer.ffn._maybe_demote_to_fexc()
+            if hasattr(layer.ffn, '_expert_is_fexc'):
+                total += layer.ffn._expert_is_fexc.sum().item()
+        print(f"  FE-XC demoted: {total} experts ({total // len(engine.layers)}/layer)")
+    elif config == 'fexvq':
+        if os.path.exists(FEXVQ_CODEBOOKS):
+            print(f"  Loading pre-calibrated FE-XVQ codebooks...")
+            ckpt = torch.load(FEXVQ_CODEBOOKS, weights_only=True)
+            codebooks = ckpt['codebooks']
+            engine.enable_auto_fexc_demotion(cold_threshold_pct=cold_pct)
+            # Force init + inject Hessian-weighted codebooks BEFORE demotion
+            for li, layer in enumerate(engine.layers):
+                ffn_l = layer.ffn
+                if not getattr(ffn_l, '_fexc_enabled', False):
+                    ffn_l._init_fexc_buffers()
+                if li in codebooks:
+                    ffn_l.gu_codebooks = codebooks[li]['gate_up'].cuda().half()
+                    ffn_l.dn_codebooks = codebooks[li]['down'].cuda().half()
+            total = 0
+            for layer in engine.layers:
+                layer.ffn._maybe_demote_to_fexc()
+                if hasattr(layer.ffn, '_expert_is_fexc'):
+                    total += layer.ffn._expert_is_fexc.sum().item()
+            print(f"  FE-XVQ demoted: {total} experts ({total // len(engine.layers)}/layer)")
+        else:
+            print(f"  ERROR: No pre-calibrated codebooks at {FEXVQ_CODEBOOKS}")
+            json.dump({'error': 'no codebooks'}, open(result_file, 'w'))
+            return
+    elif config == 'int2':
+        engine.enable_auto_int2_demotion(cold_threshold_pct=cold_pct)
+        total = 0
+        for layer in engine.layers:
+            layer.ffn._maybe_demote_to_int2()
+            if hasattr(layer.ffn, '_expert_is_int2'):
+                total += layer.ffn._expert_is_int2.sum().item()
+        print(f"  INT2 demoted: {total} experts ({total // len(engine.layers)}/layer)")
+    vram_gb = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram_gb:.1f} GB")
+    # Evaluate perplexity
+    print(f"\n  Evaluating perplexity...")
+    t0 = time.time()
+    total_nll = 0.0
+    total_tokens = 0
+    num_windows = 0
+    seq_len = token_ids.shape[0]
+    num_windows_total = max(1, (seq_len - max_len) // stride + 1)
+    for begin in range(0, seq_len - 1, stride):
+        end = min(begin + max_len, seq_len)
+        input_ids = token_ids[begin:end].unsqueeze(0).cuda()
+        engine.reset_cache()
+        engine._current_seq_id = 0
+        if hasattr(engine.kv_cache, '_graph_mode'):
+            engine.kv_cache._graph_mode = False
+        with torch.no_grad():
+            logits = engine.forward(input_ids, use_cache=False)
+        shift_logits = logits[:, :-1, :].contiguous()
+        shift_labels = input_ids[:, 1:].contiguous()
+        if begin > 0:
+            overlap = max_len - stride
+            shift_logits = shift_logits[:, overlap:, :]
+            shift_labels = shift_labels[:, overlap:]
+        if shift_labels.numel() == 0:
+            continue
+        loss = F.cross_entropy(
+            shift_logits.view(-1, shift_logits.size(-1)),
+            shift_labels.view(-1),
+            reduction='sum'
+        )
+        total_nll += loss.item()
+        total_tokens += shift_labels.numel()
+        num_windows += 1
+        if num_windows % 20 == 0 or num_windows == 1:
+            elapsed = time.time() - t0
+            current_ppl = math.exp(total_nll / total_tokens)
+            tok_per_s = total_tokens / elapsed
+            print(f"    Window {num_windows}/{num_windows_total}: "
+                  f"PPL={current_ppl:.2f}, {total_tokens} tok, "
+                  f"{tok_per_s:.0f} tok/s eval")
+    elapsed = time.time() - t0
+    ppl = math.exp(total_nll / total_tokens) if total_tokens > 0 else float('inf')
+    print(f"    Final: PPL={ppl:.2f}, {total_tokens} tok, "
+          f"{num_windows} windows, {elapsed:.1f}s")
+    # Write result
+    result = {
+        'config': config,
+        'ppl': ppl,
+        'tokens': total_tokens,
+        'vram_gb': vram_gb,
+        'time_s': elapsed,
+    }
+    with open(result_file, 'w') as f:
+        json.dump(result, f)
+# ===== Main orchestrator =====
+def main():
+    parser = argparse.ArgumentParser(description='FireEcho Perplexity Benchmark')
+    parser.add_argument('--max_tokens', type=int, default=50000,
+                        help='Max tokens from WikiText-2 (default: 50000)')
+    parser.add_argument('--stride', type=int, default=256,
+                        help='Sliding window stride (default: 256)')
+    parser.add_argument('--max_len', type=int, default=512,
+                        help='Max context per window (default: 512)')
+    parser.add_argument('--configs', type=str, default='fp4,fexc,fexvq,int2',
+                        help='Comma-separated configs to test (default: fp4,fexc,fexvq,int2)')
+    parser.add_argument('--cold_pct', type=float, default=0.10,
+                        help='Fraction of experts to demote (default: 0.10)')
+    parser.add_argument('--_worker', type=str, default=None,
+                        help=argparse.SUPPRESS)  # Internal: run single config
+    parser.add_argument('--_result_file', type=str, default=None,
+                        help=argparse.SUPPRESS)
+    args = parser.parse_args()
+    # Worker mode: run single config in subprocess
+    if args._worker:
+        run_single_config(args._worker, args.max_tokens, args.stride,
+                         args.max_len, args.cold_pct, args._result_file)
+        return
+    # Orchestrator mode: spawn subprocesses
+    configs = [c.strip() for c in args.configs.split(',')]
+    print("=" * 70)
+    print("  FireEcho Perplexity Benchmark")
+    print("  WikiText-2 | Qwen3-Omni 30B MoE | RTX 5090")
+    print("=" * 70)
+    print(f"  Max tokens: {args.max_tokens:,}")
+    print(f"  Window: {args.max_len}, stride: {args.stride}")
+    print(f"  Cold threshold: {args.cold_pct*100:.0f}%")
+    print(f"  Configs: {configs}")
+    print(f"  Subprocess isolation: enabled (clean CUDA context per config)")
+    results = {}
+    script_path = os.path.abspath(__file__)
+    python = sys.executable
+    for config in configs:
+        # Create temp file for result
+        fd, result_file = tempfile.mkstemp(suffix='.json', prefix=f'ppl_{config}_')
+        os.close(fd)
+        try:
+            cmd = [
+                python, '-u', script_path,
+                '--_worker', config,
+                '--_result_file', result_file,
+                '--max_tokens', str(args.max_tokens),
+                '--stride', str(args.stride),
+                '--max_len', str(args.max_len),
+                '--cold_pct', str(args.cold_pct),
+            ]
+            ret = subprocess.run(cmd, cwd=SCRIPT_DIR)
+            if ret.returncode != 0:
+                print(f"\n  SUBPROCESS FAILED for {config.upper()} (exit code {ret.returncode})")
+                results[config] = {'error': f'exit code {ret.returncode}'}
+                continue
+            # Read result
+            with open(result_file) as f:
+                r = json.load(f)
+            if 'error' in r:
+                results[config] = r
+            else:
+                results[config] = r
+                print(f"  >> {config.upper()}: PPL={r['ppl']:.2f}, "
+                      f"VRAM={r['vram_gb']:.1f}G, {r['time_s']:.0f}s")
+        except Exception as e:
+            print(f"\n  ERROR launching {config.upper()}: {e}")
+            results[config] = {'error': str(e)}
+        finally:
+            if os.path.exists(result_file):
+                os.unlink(result_file)
+    # === Results Table ===
+    print(f"\n{'=' * 70}")
+    print(f"  RESULTS — WikiText-2 Perplexity")
+    print(f"{'=' * 70}")
+    print(f"\n{'Config':<12} {'PPL':>8} {'Δ PPL':>8} {'VRAM':>8} {'Tokens':>10} {'bits/w':>7} {'Time':>7}")
+    print(f"{'─' * 66}")
+    baseline_ppl = results.get('fp4', {}).get('ppl', None)
+    for config in configs:
+        if config not in results:
+            continue
+        r = results[config]
+        if r.get('error'):
+            print(f"{config.upper():<12} {'ERROR':>8} {'—':>8} {'—':>8} {'—':>10} {'—':>7} {'—':>7}")
+            continue
+        delta = f"+{r['ppl'] - baseline_ppl:.2f}" if baseline_ppl and config != 'fp4' else "—"
+        bits = {'fp4': '4.0', 'fexc': '~2.2', 'fexvq': '~2.2', 'int2': '2.0'}.get(config, '?')
+        time_s = f"{r.get('time_s', 0):.0f}s"
+        print(f"{config.upper():<12} {r['ppl']:>8.2f} {delta:>8} {r['vram_gb']:>7.1f}G "
+              f"{r['tokens']:>10,} {bits:>7} {time_s:>7}")
+    # Ablation analysis: FE-XC vs FE-XVQ
+    if (baseline_ppl and 'fexc' in results and 'fexvq' in results
+            and not results['fexc'].get('error') and not results['fexvq'].get('error')):
+        fexc_delta = results['fexc']['ppl'] - baseline_ppl
+        fexvq_delta = results['fexvq']['ppl'] - baseline_ppl
+        print(f"\n  Ablation: Hessian-weighted codebooks (FE-XVQ vs FE-XC)")
+        print(f"    FE-XC  (plain k-means):     +{fexc_delta:.2f} PPL")
+        print(f"    FE-XVQ (Hessian-weighted):   +{fexvq_delta:.2f} PPL")
+        if fexc_delta > 0:
+            hessian_gain = (1 - fexvq_delta / fexc_delta) * 100
+            print(f"    Hessian reduces {hessian_gain:.0f}% of codebook PPL degradation")
+    # FE-XVQ vs INT2
+    if (baseline_ppl and 'fexvq' in results and 'int2' in results
+            and not results['fexvq'].get('error') and not results['int2'].get('error')):
+        fexvq_delta = results['fexvq']['ppl'] - baseline_ppl
+        int2_delta = results['int2']['ppl'] - baseline_ppl
+        if int2_delta > 0:
+            improvement = (1 - fexvq_delta / int2_delta) * 100
+            print(f"\n  FE-XVQ recovers {improvement:.0f}% of INT2's PPL degradation")
+            print(f"  (same 2-bit storage, codebook quality advantage)")
+    # Note about BF16
+    print(f"\n  Note: BF16 baseline omitted — Qwen3-Omni 30B BF16 = ~61GB,")
+    print(f"  exceeds RTX 5090 32GB. FP4 (Goliath) is practical baseline.")
+    print(f"\n{'=' * 70}")
+if __name__ == '__main__':
+    main()

FireEcho Engine/calibrate_fexc.py ADDED Viewed

	@@ -0,0 +1,173 @@

+#!/usr/bin/env python3
+"""FE-XC Offline Calibration — Learn codebooks for all 48 MoE layers.
+Reads packed FP4 expert weights from a loaded FireEchoEngine, learns shared
+codebooks per layer via residual k-means, then saves them to disk.
+This is a one-time offline step (~2-5 minutes on GPU). The saved codebooks are
+reused by enable_auto_fexc_demotion() during inference to demote cold experts.
+Usage:
+    python calibrate_fexc.py [--output fexc_codebooks.pt] [--sample_experts 8] [--n_iters 20]
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+"""
+import sys
+import os
+import time
+import argparse
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import torch
+from goliath_kernel import GoliathFP4Weights, GoliathFEXCWeights
+def calibrate_layer_codebooks(packed_w, packed_s, packed_ts, shape_K, shape_N,
+                              sample_experts=8, n_iters=20, total_experts=128):
+    """Learn shared codebooks for one projection type in one MoE layer.
+    Args:
+        packed_w: [E, K//2, N] uint8 — packed FP4 weights
+        packed_s: [E, ...] — block scales
+        packed_ts: [E] — tensor scales
+        shape_K, shape_N: original weight dimensions
+        sample_experts: number of experts to pool for k-means
+        n_iters: k-means iterations
+        total_experts: total number of experts in layer
+    Returns:
+        codebooks: [2, 256, 8] float16 — shared codebooks for this projection
+    """
+    n_sample = min(sample_experts, total_experts)
+    perm = torch.randperm(total_experts, device='cpu')[:n_sample]
+    # Dequantize sampled experts and collect weight groups
+    groups_list = []
+    for e_idx in perm:
+        fp4 = GoliathFP4Weights(
+            packed=packed_w[e_idx],
+            block_scales=packed_s[e_idx],
+            tensor_scale=packed_ts[e_idx].item(),
+            shape=(shape_K, shape_N),
+        )
+        w_float = fp4.to_float()  # [K, N] on GPU
+        groups_list.append(w_float.view(-1, 8))  # [K*N/8, 8]
+    # Pool all groups
+    all_groups = torch.cat(groups_list, dim=0)  # [n_sample * K*N/8, 8]
+    # Learn codebooks via GoliathFEXCWeights.from_float (residual k-means)
+    # We pass a fake [K, N] matrix reshaped from pooled groups
+    # Just need one expert's worth of groups to get codebooks
+    ref_expert = GoliathFP4Weights(
+        packed=packed_w[perm[0]],
+        block_scales=packed_s[perm[0]],
+        tensor_scale=packed_ts[perm[0]].item(),
+        shape=(shape_K, shape_N),
+    )
+    fexc = GoliathFEXCWeights.from_float(ref_expert.to_float(), n_iters=n_iters)
+    return fexc.codebooks  # [2, 256, 8] float16
+def main():
+    parser = argparse.ArgumentParser(description='FE-XC Codebook Calibration')
+    parser.add_argument('--output', type=str, default='fexc_codebooks.pt',
+                        help='Output path for codebooks (default: fexc_codebooks.pt)')
+    parser.add_argument('--sample_experts', type=int, default=8,
+                        help='Number of experts to sample per layer for k-means')
+    parser.add_argument('--n_iters', type=int, default=20,
+                        help='K-means iterations')
+    parser.add_argument('--model_dir', type=str, default=None,
+                        help='Model directory (default: auto-detect from config)')
+    args = parser.parse_args()
+    # Lazy import — heavy
+    from fireecho_kernel import FireEchoEngine
+    print("=" * 70)
+    print("FE-XC Codebook Calibration")
+    print("=" * 70)
+    # Load engine (FP4 quantized)
+    model_dir = args.model_dir
+    if model_dir is None:
+        # Default Qwen3-Omni path
+        model_dir = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
+    print(f"Loading model from: {model_dir}")
+    engine = FireEchoEngine.from_pretrained(model_dir)
+    engine.pack_all_experts()
+    print(f"Model loaded. {len(engine.layers)} layers.")
+    # Calibrate each MoE layer
+    codebooks = {}  # layer_idx -> {'gate_up': [2,256,8], 'down': [2,256,8]}
+    total_layers = len(engine.layers)
+    t_start = time.time()
+    for layer_idx, layer in enumerate(engine.layers):
+        ffn = layer.ffn
+        if not hasattr(ffn, 'packed_gu_w'):
+            print(f"  Layer {layer_idx}: skipping (not MoE or not packed)")
+            continue
+        K_gu = ffn.packed_gu_w.shape[1] * 2
+        N_gu = ffn.packed_gu_w.shape[2]
+        K_dn = ffn.packed_dn_w.shape[1] * 2
+        N_dn = ffn.packed_dn_w.shape[2]
+        n_experts = ffn.packed_gu_w.shape[0]
+        t0 = time.time()
+        # gate_up codebooks
+        gu_cb = calibrate_layer_codebooks(
+            ffn.packed_gu_w, ffn.packed_gu_s, ffn.packed_gu_ts,
+            K_gu, N_gu,
+            sample_experts=args.sample_experts,
+            n_iters=args.n_iters,
+            total_experts=n_experts)
+        # down codebooks
+        dn_cb = calibrate_layer_codebooks(
+            ffn.packed_dn_w, ffn.packed_dn_s, ffn.packed_dn_ts,
+            K_dn, N_dn,
+            sample_experts=args.sample_experts,
+            n_iters=args.n_iters,
+            total_experts=n_experts)
+        codebooks[layer_idx] = {
+            'gate_up': gu_cb.cpu(),
+            'down': dn_cb.cpu(),
+        }
+        elapsed = time.time() - t0
+        print(f"  Layer {layer_idx}/{total_layers}: "
+              f"gate_up=[{K_gu}x{N_gu}] down=[{K_dn}x{N_dn}] "
+              f"— {elapsed:.1f}s")
+    total_time = time.time() - t_start
+    print(f"\nCalibration complete: {len(codebooks)} layers in {total_time:.1f}s")
+    # Save
+    output_path = args.output
+    if not os.path.isabs(output_path):
+        output_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                                   output_path)
+    torch.save({
+        'codebooks': codebooks,
+        'config': {
+            'sample_experts': args.sample_experts,
+            'n_iters': args.n_iters,
+            'n_centroids': 256,
+            'group_size': 8,
+            'num_codebooks': 2,
+        },
+        'num_layers': len(codebooks),
+    }, output_path)
+    print(f"Saved codebooks to: {output_path}")
+    print(f"File size: {os.path.getsize(output_path) / 1024:.1f} KB")
+if __name__ == '__main__':
+    main()

FireEcho Engine/calibrate_fexvq.py ADDED Viewed

	@@ -0,0 +1,227 @@

+#!/usr/bin/env python3
+"""FE-XVQ Calibration — Hessian-weighted codebook learning for all 48 MoE layers.
+Runs calibration prompts through the model, collects Hessian diagonals
+(input covariance) at each MoE layer, then learns Hessian-weighted codebooks
+via GoliathFEXVQWeights. Saves codebooks to disk for later use.
+This is a one-time offline step:
+  1. Load model (~2 min)
+  2. Run calibration prompts (~2-5 min for 50 prompts)
+  3. Learn codebooks (~5-10 min on CPU)
+  4. Save to fexvq_codebooks.pt
+The codebooks can then be loaded by enable_auto_fexvq_demotion() during
+inference to demote cold experts with Hessian-optimal quality.
+Usage:
+    python calibrate_fexvq.py [--output fexvq_codebooks.pt] [--n_prompts 50]
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+"""
+import sys
+import os
+import time
+import argparse
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import torch
+from goliath_kernel import GoliathFP4Weights, GoliathFEXVQWeights
+# Calibration prompts — diverse to capture broad input distribution
+CALIBRATION_PROMPTS = [
+    "Explain the theory of general relativity in simple terms.",
+    "Write a Python function to sort a list using quicksort.",
+    "What are the main causes of climate change?",
+    "Describe the process of photosynthesis step by step.",
+    "How does a neural network learn from data?",
+    "What is the difference between TCP and UDP protocols?",
+    "Explain quantum computing to a 10 year old.",
+    "Write a recursive function to compute Fibonacci numbers.",
+    "What were the main events of World War II?",
+    "How does the human immune system fight infections?",
+    "Describe the architecture of a modern CPU.",
+    "What is the significance of the Turing test?",
+    "Explain how blockchain technology works.",
+    "Write a Python class for a binary search tree.",
+    "What are the fundamental forces in physics?",
+    "How do vaccines work at the molecular level?",
+    "Describe the water cycle and its importance.",
+    "What is the P vs NP problem in computer science?",
+    "Explain the concept of entropy in thermodynamics.",
+    "How does natural language processing work?",
+    "What are the principles of object-oriented programming?",
+    "Describe the structure of DNA and how it replicates.",
+    "What is the significance of Euler's identity?",
+    "How do operating systems manage memory?",
+    "Explain the concept of dark matter and dark energy.",
+    "Write a function to find the shortest path in a graph.",
+    "What are the key differences between Python and C++?",
+    "How does the internet route packets between networks?",
+    "Explain the CAP theorem in distributed systems.",
+    "What is the role of mitochondria in cellular respiration?",
+    "Describe how a compiler transforms source code to machine code.",
+    "What are the main branches of mathematics?",
+    "How do electric vehicles work compared to combustion engines?",
+    "Explain the concept of recursion with examples.",
+    "What is CRISPR and how does it edit genes?",
+    "How does public key cryptography ensure security?",
+    "Describe the lifecycle of a star from birth to death.",
+    "What are design patterns in software engineering?",
+    "How does the human brain process visual information?",
+    "Explain the concept of containerization in DevOps.",
+    "What are the ethical considerations of artificial intelligence?",
+    "How do search engines rank web pages?",
+    "Describe the process of protein folding.",
+    "What is the halting problem and why is it important?",
+    "How does GPS determine your location?",
+    "Explain the concept of machine learning overfitting.",
+    "What are the properties of prime numbers?",
+    "How does a quantum computer differ from a classical computer?",
+    "Describe the architecture of a transformer neural network.",
+    "What is the significance of Shannon's information theory?",
+]
+def main():
+    parser = argparse.ArgumentParser(description='FE-XVQ Hessian Codebook Calibration')
+    parser.add_argument('--output', type=str, default='fexvq_codebooks.pt',
+                        help='Output path for codebooks (default: fexvq_codebooks.pt)')
+    parser.add_argument('--n_prompts', type=int, default=50,
+                        help='Number of calibration prompts (default: 50)')
+    parser.add_argument('--max_tokens', type=int, default=32,
+                        help='Max tokens per calibration prompt (default: 32)')
+    parser.add_argument('--n_iters', type=int, default=20,
+                        help='K-means iterations (default: 20)')
+    parser.add_argument('--model_dir', type=str, default=None,
+                        help='Model directory')
+    args = parser.parse_args()
+    from fireecho_kernel import FireEchoEngine
+    print("=" * 70)
+    print("FE-XVQ Hessian Codebook Calibration")
+    print("=" * 70)
+    # Load engine
+    model_dir = args.model_dir
+    if model_dir is None:
+        model_dir = '/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct'
+    print(f"Loading model from: {model_dir}")
+    engine = FireEchoEngine.from_pretrained(model_dir)
+    engine.pack_all_experts()
+    print(f"Model loaded. {len(engine.layers)} layers.")
+    # Enable Hessian collection
+    print(f"\n--- Phase 1: Collecting Hessian ({args.n_prompts} prompts) ---")
+    engine.enable_auto_fexvq_demotion(cold_threshold_pct=0.10)
+    # Tokenize and run calibration prompts
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+    prompts = CALIBRATION_PROMPTS[:args.n_prompts]
+    t_start = time.time()
+    for i, prompt in enumerate(prompts):
+        input_ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+        with torch.no_grad():
+            engine.generate(input_ids, max_new_tokens=args.max_tokens, temperature=0.0)
+        if (i + 1) % 10 == 0 or i == 0:
+            elapsed = time.time() - t_start
+            print(f"  Prompt {i+1}/{len(prompts)} ({elapsed:.1f}s)")
+    calib_time = time.time() - t_start
+    print(f"  Calibration done: {len(prompts)} prompts in {calib_time:.1f}s")
+    # Report Hessian stats
+    for li in [0, 1, len(engine.layers) - 1]:
+        ffn = engine.layers[li].ffn
+        h_gu, h_dn = ffn.get_hessian_diag()
+        if h_gu is not None:
+            print(f"  Layer {li}: Hessian gu samples={ffn._hessian_samples_gu}, "
+                  f"mean={h_gu.mean():.4f}, max/min ratio={h_gu.max()/h_gu.min().clamp(min=1e-10):.1f}")
+    # Learn Hessian-weighted codebooks for each layer
+    print(f"\n--- Phase 2: Learning Hessian-weighted codebooks ---")
+    codebooks = {}
+    t_start = time.time()
+    for layer_idx, layer in enumerate(engine.layers):
+        ffn = layer.ffn
+        if not hasattr(ffn, 'packed_gu_w'):
+            continue
+        goliath_K_gu = ffn.packed_gu_w.shape[1] * 2
+        goliath_N_gu = ffn.packed_gu_w.shape[2]
+        goliath_K_dn = ffn.packed_dn_w.shape[1] * 2
+        goliath_N_dn = ffn.packed_dn_w.shape[2]
+        h_gu, h_dn = ffn.get_hessian_diag()
+        t0 = time.time()
+        # gate_up codebooks (Hessian-weighted)
+        perm = torch.randperm(ffn.num_experts)[:1]
+        gu_ref = GoliathFEXVQWeights.from_float(
+            GoliathFP4Weights(
+                packed=ffn.packed_gu_w[perm[0]],
+                block_scales=ffn.packed_gu_s[perm[0]],
+                tensor_scale=ffn.packed_gu_ts[perm[0]].item(),
+                shape=(goliath_K_gu, goliath_N_gu),
+            ).to_float().T.contiguous().cpu(),
+            hessian_diag=h_gu.cpu() if h_gu is not None else None,
+            n_iters=args.n_iters)
+        # down codebooks (Hessian-weighted)
+        dn_ref = GoliathFEXVQWeights.from_float(
+            GoliathFP4Weights(
+                packed=ffn.packed_dn_w[perm[0]],
+                block_scales=ffn.packed_dn_s[perm[0]],
+                tensor_scale=ffn.packed_dn_ts[perm[0]].item(),
+                shape=(goliath_K_dn, goliath_N_dn),
+            ).to_float().T.contiguous().cpu(),
+            hessian_diag=h_dn.cpu() if h_dn is not None else None,
+            n_iters=args.n_iters)
+        codebooks[layer_idx] = {
+            'gate_up': gu_ref.codebooks.cpu(),
+            'down': dn_ref.codebooks.cpu(),
+            'hessian_diag_gu': h_gu.cpu() if h_gu is not None else None,
+            'hessian_diag_dn': h_dn.cpu() if h_dn is not None else None,
+        }
+        elapsed = time.time() - t0
+        if layer_idx % 8 == 0 or layer_idx == len(engine.layers) - 1:
+            print(f"  Layer {layer_idx}/{len(engine.layers)}: {elapsed:.1f}s")
+    total_time = time.time() - t_start
+    print(f"\nCodebook learning complete: {len(codebooks)} layers in {total_time:.1f}s")
+    # Save
+    output_path = args.output
+    if not os.path.isabs(output_path):
+        output_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                                   output_path)
+    torch.save({
+        'codebooks': codebooks,
+        'config': {
+            'n_prompts': args.n_prompts,
+            'max_tokens': args.max_tokens,
+            'n_iters': args.n_iters,
+            'n_centroids': 256,
+            'group_size': 8,
+            'num_codebooks': 2,
+            'method': 'fexvq_hessian_weighted',
+        },
+        'num_layers': len(codebooks),
+    }, output_path)
+    print(f"Saved codebooks to: {output_path}")
+    print(f"File size: {os.path.getsize(output_path) / 1024:.1f} KB")
+if __name__ == '__main__':
+    main()

FireEcho Engine/csrc/cluster_launch.cpp ADDED Viewed

	@@ -0,0 +1,53 @@

+/**
+ * FireEcho Kernel - SM120 Cluster Launch Implementation
+ *
+ * Compile with:
+ *   nvcc -shared -o libfireecho_cluster.so cluster_launch.cpp \
+ *        -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -lcudart \
+ *        --compiler-options '-fPIC' -arch=sm_120
+ */
+#include "cluster_launch.h"
+#include <stdio.h>
+namespace fireecho {
+// Implementation of helper functions that need compilation
+void print_cluster_info() {
+    if (!supports_clusters()) {
+        printf("Thread Block Clusters: NOT SUPPORTED\n");
+        return;
+    }
+    ClusterProperties props = get_cluster_properties();
+    printf("=== SM120 Thread Block Cluster Info ===\n");
+    printf("Max Cluster Size: %d\n", props.max_cluster_size);
+    printf("Max Blocks/SM: %d\n", props.max_blocks_per_sm);
+    printf("Shared Memory/Block: %d KB\n", props.shared_memory_per_block / 1024);
+    printf("Registers/Block: %d\n", props.registers_per_block);
+    printf("Distributed SMEM: %s\n", props.supports_dshem ? "YES" : "NO");
+    printf("========================================\n");
+}
+}  // namespace fireecho
+// Standalone test
+#ifdef TEST_CLUSTER_LAUNCH
+int main() {
+    // Initialize CUDA
+    cudaSetDevice(0);
+    fireecho::print_cluster_info();
+    if (fireecho::supports_clusters()) {
+        printf("\n✅ This GPU supports Thread Block Clusters!\n");
+        printf("   Max cluster size: %d CTAs\n", fireecho::get_max_cluster_size());
+    } else {
+        printf("\n❌ This GPU does NOT support Thread Block Clusters.\n");
+    }
+    return 0;
+}
+#endif

FireEcho Engine/csrc/cluster_launch.h ADDED Viewed

	@@ -0,0 +1,194 @@

+/**
+ * FireEcho Kernel - SM120 Thread Block Cluster Launcher
+ *
+ * Exposes true Thread Block Cluster APIs for Blackwell (SM 12.0)
+ * using the CUDA Driver API's cuLaunchKernelEx with cluster attributes.
+ *
+ * Requirements:
+ *   - CUDA 12.8+ (for SM 12.0 support)
+ *   - Triton 3.6.0+ compiled kernel (CUfunction)
+ *   - Blackwell GPU (RTX 5090, SM 12.0)
+ *
+ * Features:
+ *   - True hardware cluster launch (not just num_ctas hint)
+ *   - Distributed Shared Memory (dSMEM) access
+ *   - Cluster barriers for synchronization
+ */
+#ifndef FIREECHO_CLUSTER_LAUNCH_H
+#define FIREECHO_CLUSTER_LAUNCH_H
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <stdexcept>
+#include <string>
+namespace fireecho {
+/**
+ * Cluster configuration for SM120 kernels.
+ */
+struct ClusterConfig {
+    int cluster_dim_x = 2;  // Cluster size in X (typically 2 for 2-CTA MMA)
+    int cluster_dim_y = 1;
+    int cluster_dim_z = 1;
+    int max_registers = 240;  // Cap for cluster occupancy
+    bool enable_dshem = true;  // Enable distributed shared memory
+};
+/**
+ * Launch a Triton-compiled kernel with true SM120 cluster support.
+ *
+ * @param func      The compiled CUfunction from Triton
+ * @param grid      Grid dimensions (in clusters, not blocks)
+ * @param block     Block dimensions
+ * @param args      Kernel arguments
+ * @param config    Cluster configuration
+ * @param stream    CUDA stream (0 for default)
+ */
+inline CUresult launch_with_cluster(
+    CUfunction func,
+    dim3 grid,
+    dim3 block,
+    void** args,
+    const ClusterConfig& config = ClusterConfig(),
+    CUstream stream = 0
+) {
+    // Set up cluster launch attributes for SM120
+    CUlaunchAttribute attrs[2];
+    int num_attrs = 0;
+    // Cluster dimension attribute
+    attrs[num_attrs].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION;
+    attrs[num_attrs].value.clusterDim.x = config.cluster_dim_x;
+    attrs[num_attrs].value.clusterDim.y = config.cluster_dim_y;
+    attrs[num_attrs].value.clusterDim.z = config.cluster_dim_z;
+    num_attrs++;
+    // Cluster scheduling policy (optional, for better occupancy)
+    attrs[num_attrs].id = CU_LAUNCH_ATTRIBUTE_CLUSTER_SCHEDULING_POLICY_PREFERENCE;
+    attrs[num_attrs].value.clusterSchedulingPolicyPreference =
+        CU_CLUSTER_SCHEDULING_POLICY_SPREAD;  // or CU_CLUSTER_SCHEDULING_POLICY_LOAD_BALANCING
+    num_attrs++;
+    // Configure the launch
+    CUlaunchConfig launch_config = {};
+    launch_config.gridDimX = grid.x;
+    launch_config.gridDimY = grid.y;
+    launch_config.gridDimZ = grid.z;
+    launch_config.blockDimX = block.x;
+    launch_config.blockDimY = block.y;
+    launch_config.blockDimZ = block.z;
+    launch_config.sharedMemBytes = 0;  // Triton manages shared memory
+    launch_config.hStream = stream;
+    launch_config.attrs = attrs;
+    launch_config.numAttrs = num_attrs;
+    // Launch with cluster configuration
+    return cuLaunchKernelEx(&launch_config, func, args, nullptr);
+}
+/**
+ * Check if the current GPU supports Thread Block Clusters.
+ */
+inline bool supports_clusters() {
+    int device;
+    cudaGetDevice(&device);
+    cudaDeviceProp props;
+    cudaGetDeviceProperties(&props, device);
+    // Clusters require SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
+    return (props.major >= 9) || (props.major == 12);
+}
+/**
+ * Get maximum cluster size for the current GPU.
+ */
+inline int get_max_cluster_size() {
+    int device;
+    cudaGetDevice(&device);
+    int max_cluster_size = 1;
+    cudaDeviceGetAttribute(&max_cluster_size,
+                           cudaDevAttrClusterLaunch, device);
+    return max_cluster_size;
+}
+/**
+ * Query cluster properties for SM120.
+ */
+struct ClusterProperties {
+    int max_cluster_size;
+    int max_blocks_per_sm;
+    int shared_memory_per_block;
+    int registers_per_block;
+    bool supports_dshem;
+};
+inline ClusterProperties get_cluster_properties() {
+    ClusterProperties props = {};
+    int device;
+    cudaGetDevice(&device);
+    cudaDeviceProp dev_props;
+    cudaGetDeviceProperties(&dev_props, device);
+    props.max_cluster_size = get_max_cluster_size();
+    props.max_blocks_per_sm = dev_props.maxBlocksPerMultiProcessor;
+    props.shared_memory_per_block = dev_props.sharedMemPerBlock;
+    props.registers_per_block = dev_props.regsPerBlock;
+    props.supports_dshem = (dev_props.major >= 9);  // SM 9.0+ has dSMEM
+    return props;
+}
+/**
+ * Python-compatible wrapper for cluster launch.
+ * Can be called from Python via ctypes or pybind11.
+ */
+extern "C" {
+int fireecho_launch_cluster(
+    void* func_ptr,
+    int grid_x, int grid_y, int grid_z,
+    int block_x, int block_y, int block_z,
+    void** args,
+    int cluster_x, int cluster_y, int cluster_z,
+    void* stream_ptr
+) {
+    CUfunction func = (CUfunction)func_ptr;
+    CUstream stream = (CUstream)stream_ptr;
+    ClusterConfig config;
+    config.cluster_dim_x = cluster_x;
+    config.cluster_dim_y = cluster_y;
+    config.cluster_dim_z = cluster_z;
+    CUresult result = launch_with_cluster(
+        func,
+        dim3(grid_x, grid_y, grid_z),
+        dim3(block_x, block_y, block_z),
+        args,
+        config,
+        stream
+    );
+    return (int)result;
+}
+int fireecho_supports_clusters() {
+    return supports_clusters() ? 1 : 0;
+}
+int fireecho_max_cluster_size() {
+    return get_max_cluster_size();
+}
+}  // extern "C"
+}  // namespace fireecho
+#endif  // FIREECHO_CLUSTER_LAUNCH_H

FireEcho Engine/csrc/dsmem_cluster.cuh ADDED Viewed

	@@ -0,0 +1,344 @@

+/**
+ * FireEcho Kernel - Distributed Shared Memory & Cluster Barriers
+ *
+ * Implements:
+ *   1. DSMEM via mapa PTX instruction
+ *   2. Cluster barriers via mbarrier PTX
+ *   3. Cooperative Groups cluster API
+ *
+ * Requirements:
+ *   - CUDA 12.0+ (for Hopper cluster support)
+ *   - CUDA 12.8+ (for Blackwell SM 12.0)
+ *   - SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
+ */
+#ifndef FIREECHO_DSMEM_CLUSTER_CUH
+#define FIREECHO_DSMEM_CLUSTER_CUH
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <cooperative_groups.h>
+#include <cooperative_groups/memcpy_async.h>
+namespace cg = cooperative_groups;
+namespace fireecho {
+namespace dsmem {
+// ============================================================================
+// 1. DISTRIBUTED SHARED MEMORY (DSMEM) via mapa PTX
+// ============================================================================
+/**
+ * Map a local shared memory address to a remote block's shared memory.
+ * Uses the mapa PTX instruction for cluster-wide SMEM access.
+ *
+ * @param local_smem_ptr  Local shared memory pointer
+ * @param target_rank     Target block rank within the cluster (0-indexed)
+ * @return                Generic pointer accessible across cluster
+ */
+__device__ __forceinline__ void* map_shared_to_rank(void* local_smem_ptr, int target_rank) {
+    void* remote_ptr;
+    // mapa.shared::cluster.u32 maps local SMEM to cluster-wide address
+    asm volatile(
+        "mapa.shared::cluster.u32 %0, %1, %2;"
+        : "=r"(remote_ptr)
+        : "r"(local_smem_ptr), "r"(target_rank)
+    );
+    return remote_ptr;
+}
+/**
+ * Map shared memory using cooperative_groups (higher-level API).
+ * Preferred over raw PTX when available.
+ */
+template<typename T>
+__device__ __forceinline__ T* map_shared_rank_cg(T* local_ptr, int target_rank) {
+    auto cluster = cg::this_cluster();
+    return cluster.map_shared_rank(local_ptr, target_rank);
+}
+/**
+ * Get the current block's rank within the cluster.
+ */
+__device__ __forceinline__ int get_cluster_rank() {
+    auto cluster = cg::this_cluster();
+    return cluster.block_rank();
+}
+/**
+ * Get the total number of blocks in the cluster.
+ */
+__device__ __forceinline__ int get_cluster_size() {
+    auto cluster = cg::this_cluster();
+    return cluster.num_blocks();
+}
+// ============================================================================
+// 2. CLUSTER BARRIERS via mbarrier PTX
+// ============================================================================
+/**
+ * Cluster-wide barrier object.
+ * Uses mbarrier for hardware-accelerated synchronization.
+ */
+struct ClusterBarrier {
+    uint64_t barrier_state;  // mbarrier state (64-bit)
+    /**
+     * Initialize the barrier for a given number of threads.
+     * Must be called by exactly one thread per cluster.
+     */
+    __device__ __forceinline__ void init(int expected_count) {
+        asm volatile(
+            "mbarrier.init.shared::cluster.b64 [%0], %1;"
+            :
+            : "r"(&barrier_state), "r"(expected_count)
+            : "memory"
+        );
+    }
+    /**
+     * Arrive at the barrier (signal completion).
+     * Returns the phase for try_wait.
+     */
+    __device__ __forceinline__ uint64_t arrive() {
+        uint64_t phase;
+        asm volatile(
+            "mbarrier.arrive.shared::cluster.b64 %0, [%1];"
+            : "=l"(phase)
+            : "r"(&barrier_state)
+            : "memory"
+        );
+        return phase;
+    }
+    /**
+     * Arrive and expect additional arrivals from remote blocks.
+     * Used when data is being sent to this block's SMEM.
+     */
+    __device__ __forceinline__ uint64_t arrive_expect_tx(int tx_count) {
+        uint64_t phase;
+        asm volatile(
+            "mbarrier.arrive.expect_tx.shared::cluster.b64 %0, [%1], %2;"
+            : "=l"(phase)
+            : "r"(&barrier_state), "r"(tx_count)
+            : "memory"
+        );
+        return phase;
+    }
+    /**
+     * Try to wait on the barrier (non-blocking check).
+     */
+    __device__ __forceinline__ bool try_wait(uint64_t phase) {
+        int complete;
+        asm volatile(
+            "{"
+            ".reg .pred P;"
+            "mbarrier.try_wait.shared::cluster.b64 P, [%1], %2;"
+            "selp.s32 %0, 1, 0, P;"
+            "}"
+            : "=r"(complete)
+            : "r"(&barrier_state), "l"(phase)
+            : "memory"
+        );
+        return complete != 0;
+    }
+    /**
+     * Wait on the barrier (blocking).
+     * Spins until all arrivals complete.
+     */
+    __device__ __forceinline__ void wait(uint64_t phase) {
+        while (!try_wait(phase)) {
+            // Yield to reduce power consumption while spinning
+            __nanosleep(100);
+        }
+    }
+};
+/**
+ * Simple cluster-wide synchronization.
+ * Synchronizes all threads across all blocks in the cluster.
+ */
+__device__ __forceinline__ void cluster_sync() {
+    auto cluster = cg::this_cluster();
+    cluster.sync();
+}
+/**
+ * Cluster sync with memory fence.
+ * Ensures all DSMEM operations are visible.
+ */
+__device__ __forceinline__ void cluster_sync_fence() {
+    // Memory fence at cluster scope
+    asm volatile("fence.acq_rel.cluster;");
+    cluster_sync();
+    asm volatile("fence.acq_rel.cluster;");
+}
+// ============================================================================
+// 3. DSMEM DATA TRANSFER PRIMITIVES
+// ============================================================================
+/**
+ * Async copy from local SMEM to remote block's SMEM.
+ * Uses cp.async with cluster scope.
+ */
+template<typename T, int SIZE>
+__device__ __forceinline__ void async_copy_to_rank(
+    T* dst_smem,      // Local destination pointer
+    T* src_smem,      // Local source pointer
+    int target_rank   // Target block rank
+) {
+    // Map source to target's address space
+    T* remote_dst = (T*)map_shared_to_rank(dst_smem, target_rank);
+    // Async copy with cluster scope
+    asm volatile(
+        "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2;"
+        :
+        : "r"(remote_dst), "l"(src_smem), "r"(SIZE * sizeof(T))
+        : "memory"
+    );
+}
+/**
+ * Load from remote block's shared memory.
+ */
+template<typename T>
+__device__ __forceinline__ T load_remote_smem(T* local_smem, int target_rank) {
+    T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
+    return *remote;
+}
+/**
+ * Store to remote block's shared memory.
+ */
+template<typename T>
+__device__ __forceinline__ void store_remote_smem(T* local_smem, T value, int target_rank) {
+    T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
+    *remote = value;
+}
+/**
+ * Atomic add to remote block's shared memory.
+ */
+template<typename T>
+__device__ __forceinline__ T atomic_add_remote_smem(T* local_smem, T value, int target_rank) {
+    T* remote = (T*)map_shared_to_rank(local_smem, target_rank);
+    return atomicAdd(remote, value);
+}
+// ============================================================================
+// 4. HIGH-LEVEL CLUSTER MATMUL PRIMITIVES
+// ============================================================================
+/**
+ * 2-CTA Cooperative Matrix Multiply using DSMEM.
+ *
+ * Block 0: Loads A tiles, shares via DSMEM
+ * Block 1: Loads B tiles, shares via DSMEM
+ * Both: Compute partial C, reduce via DSMEM
+ */
+template<int BLOCK_M, int BLOCK_N, int BLOCK_K>
+struct ClusterMatmul {
+    // Shared memory layout for 2-CTA cooperative multiply
+    struct SharedStorage {
+        __align__(128) float A_tile[BLOCK_M][BLOCK_K];
+        __align__(128) float B_tile[BLOCK_K][BLOCK_N];
+        __align__(128) float C_partial[BLOCK_M][BLOCK_N];
+        ClusterBarrier barrier;
+    };
+    __device__ static void compute(
+        SharedStorage& smem,
+        const float* A, const float* B, float* C,
+        int M, int N, int K
+    ) {
+        int rank = get_cluster_rank();
+        int tid = threadIdx.x;
+        // Initialize barrier (only rank 0, thread 0)
+        if (rank == 0 && tid == 0) {
+            smem.barrier.init(BLOCK_M * 2);  // 2 blocks participating
+        }
+        cluster_sync();
+        // Each block loads different data
+        if (rank == 0) {
+            // Load A tile
+            // ... (tile loading logic)
+        } else {
+            // Load B tile
+            // ... (tile loading logic)
+        }
+        // Synchronize and share via DSMEM
+        uint64_t phase = smem.barrier.arrive();
+        smem.barrier.wait(phase);
+        // Access partner's data via DSMEM
+        auto partner_smem = (SharedStorage*)map_shared_to_rank(&smem, 1 - rank);
+        // Compute using both tiles
+        // ... (matrix multiply accumulate)
+        // Final reduction
+        cluster_sync_fence();
+    }
+};
+}  // namespace dsmem
+// ============================================================================
+// 5. SUPER-CLUSTER FORWARD DECLARATIONS (Vera Rubin / NVL72+)
+// ============================================================================
+namespace supercluster {
+/**
+ * Super-Cluster configuration for Vera Rubin NVL72/NVL144.
+ *
+ * Note: This is a forward-looking API. Full implementation requires:
+ *   - Vera Rubin hardware (2H 2026)
+ *   - CUDA 13.0+ with NVLink 6 support
+ *   - GB200/GR200 NVL72 or NVL144 system
+ */
+struct SuperClusterConfig {
+    int num_gpus = 72;              // NVL72 default
+    int gpus_per_node = 8;          // Grace-Rubin configuration
+    int nvlink_bandwidth_tb_s = 3;  // 3.6 TB/s per GPU
+    bool use_coherent_memory = true;
+};
+/**
+ * Placeholder for Super-Cluster initialization.
+ * Will use NCCL + NVLink 6 for rack-scale coherent memory.
+ */
+inline void init_super_cluster(const SuperClusterConfig& config) {
+    // Vera Rubin: NVL72 acts as single coherent memory space
+    // Implementation pending hardware availability
+    (void)config;
+}
+/**
+ * Super-Cluster all-reduce (rack-scale).
+ * Leverages 3.6 TB/s NVLink 6 bandwidth.
+ */
+template<typename T>
+void all_reduce_super_cluster(T* data, size_t count) {
+    // Future: Direct NVLink 6 all-reduce without host involvement
+    // For now, falls back to NCCL
+    (void)data;
+    (void)count;
+}
+}  // namespace supercluster
+}  // namespace fireecho
+#endif  // FIREECHO_DSMEM_CLUSTER_CUH

FireEcho Engine/csrc/femx_bindings.cpp ADDED Viewed

	@@ -0,0 +1,48 @@

+// FE-MX CUDA Kernels — pybind11 bindings
+// JIT-compiled via torch.utils.cpp_extension.load()
+#include <torch/extension.h>
+// Forward declarations from femx_kernels.cu
+void femx_quantize_impl(
+    torch::Tensor master,
+    torch::Tensor tier,
+    torch::Tensor packed,
+    torch::Tensor scales,
+    bool stochastic,
+    int64_t seed
+);
+torch::Tensor femx_dequantize_impl(
+    torch::Tensor packed,
+    torch::Tensor scales,
+    torch::Tensor tier,
+    int64_t block_size
+);
+void femx_sync_impl(
+    torch::Tensor master,
+    torch::Tensor tier,
+    torch::Tensor packed,
+    torch::Tensor scales,
+    torch::Tensor fast_weight,
+    int64_t seed
+);
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.doc() = "FE-MX CUDA kernels: fused quantize/dequantize for Hebbian memory";
+    m.def("femx_quantize", &femx_quantize_impl,
+          "Quantize FP32 master to packed uint8 + E8M0 scales (stochastic rounding)",
+          py::arg("master"), py::arg("tier"),
+          py::arg("packed"), py::arg("scales"),
+          py::arg("stochastic"), py::arg("seed"));
+    m.def("femx_dequantize", &femx_dequantize_impl,
+          "Dequantize packed uint8 + E8M0 scales to FP32",
+          py::arg("packed"), py::arg("scales"),
+          py::arg("tier"), py::arg("block_size"));
+    m.def("femx_sync", &femx_sync_impl,
+          "Fused quantize + dequantize: master FP32 -> packed + BF16 fast_weight",
+          py::arg("master"), py::arg("tier"),
+          py::arg("packed"), py::arg("scales"),
+          py::arg("fast_weight"), py::arg("seed"));
+}

FireEcho Engine/csrc/femx_kernels.cu ADDED Viewed

	@@ -0,0 +1,422 @@

+// FE-MX CUDA Kernels — Fused quantize/dequantize for Hebbian memory
+// Block Floating Point with E8M0 shared exponents, stochastic rounding,
+// and age-adaptive precision tiers (FEMX4/FEMX6/FEMX8).
+//
+// JIT-compiled via torch.utils.cpp_extension.load()
+//
+// Kernel 1: femx_quantize_kernel  — master FP32 → packed uint8 + E8M0 scales
+// Kernel 2: femx_dequantize_kernel — packed uint8 + E8M0 scales → FP32
+// Kernel 3: femx_sync_kernel      — fused quantize + dequantize → BF16 writeback
+#include <torch/extension.h>
+#include <cuda_runtime.h>
+#include <cuda_bf16.h>
+#include <curand_kernel.h>
+#include <math.h>
+// ============================================================================
+// Constants
+// ============================================================================
+// Tier mantissa bits: FEMX4=3, FEMX6=5, FEMX8=7
+__constant__ int TIER_MBITS[3] = {3, 5, 7};
+#define CUDA_CHECK(call) do { \
+    cudaError_t err = (call); \
+    TORCH_CHECK(err == cudaSuccess, "CUDA error: ", cudaGetErrorString(err)); \
+} while(0)
+// ============================================================================
+// Device helpers
+// ============================================================================
+// Get mantissa bits and levels from tier
+__device__ __forceinline__ void tier_params(int tier, int& mantissa_bits, int& levels) {
+    mantissa_bits = (tier == 0) ? 3 : (tier == 1) ? 5 : 7;
+    levels = 1 << mantissa_bits;
+}
+// Compute E8M0 shared exponent: ceil(log2(abs_max)) + 127
+// Returns 0 for zero blocks.
+__device__ __forceinline__ uint8_t compute_e8m0(float abs_max) {
+    if (abs_max == 0.0f) return 0;
+    int exp = (int)ceilf(log2f(abs_max)) + 127;
+    return (uint8_t)max(0, min(254, exp));
+}
+// Warp-level max reduction (full warp, 32 threads)
+__device__ __forceinline__ float warp_reduce_max(float val) {
+    #pragma unroll
+    for (int offset = 16; offset > 0; offset >>= 1) {
+        val = fmaxf(val, __shfl_down_sync(0xFFFFFFFF, val, offset));
+    }
+    return __shfl_sync(0xFFFFFFFF, val, 0);  // broadcast from lane 0
+}
+// ============================================================================
+// Kernel 1: Fused quantize (master FP32 → packed uint8 + E8M0 scales)
+//
+// Grid:  (num_slots, 1, 1)  — one CUDA block per memory slot
+// Block: (256, 1, 1)        — 8 warps of 32 threads
+//
+// Each warp processes one block of 32 elements at a time.
+// With 96 blocks per slot (dim=3072, block_size=32) and 8 warps,
+// each warp handles 12 iterations.
+//
+// Stochastic rounding uses Philox4_32_10 PRNG (one state per thread).
+// ============================================================================
+__global__ void femx_quantize_kernel(
+    const float* __restrict__ master,   // [S, D] FP32 master copy
+    const uint8_t* __restrict__ tier,   // [S] per-slot precision tier
+    uint8_t* __restrict__ packed,       // [S, D] output packed uint8
+    uint8_t* __restrict__ scales,       // [S, B] output E8M0 exponents
+    int num_slots,
+    int dim,
+    int block_size,                     // 32
+    int num_blocks,                     // dim / block_size
+    bool stochastic,
+    unsigned long long seed
+) {
+    int slot_idx = blockIdx.x;
+    if (slot_idx >= num_slots) return;
+    int tid = threadIdx.x;
+    int warp_id = tid / 32;
+    int lane_id = tid % 32;
+    int num_warps = blockDim.x / 32;
+    // Read tier for this slot (uniform across all threads in block)
+    int t = (int)tier[slot_idx];
+    int mantissa_bits, levels;
+    tier_params(t, mantissa_bits, levels);
+    // Init Philox PRNG per thread (only if stochastic)
+    curandStatePhilox4_32_10_t rng;
+    if (stochastic) {
+        curand_init(seed,
+                    (unsigned long long)(slot_idx * blockDim.x + tid),
+                    0, &rng);
+    }
+    // Base pointers for this slot
+    const float* slot_m = master + (long long)slot_idx * dim;
+    uint8_t* slot_p = packed + (long long)slot_idx * dim;
+    uint8_t* slot_s = scales + (long long)slot_idx * num_blocks;
+    // Each warp handles blocks in strided order
+    for (int blk = warp_id; blk < num_blocks; blk += num_warps) {
+        int elem_off = blk * block_size + lane_id;
+        // 1. Load one element per lane
+        float val = (elem_off < dim) ? slot_m[elem_off] : 0.0f;
+        float abs_val = fabsf(val);
+        // 2. Warp-level reduction for block abs_max
+        float block_max = warp_reduce_max(abs_val);
+        // 3. E8M0 shared exponent
+        uint8_t e8m0 = compute_e8m0(block_max);
+        // 4. Normalize: val / 2^(e8m0 - 127)
+        float scale = exp2f((float)e8m0 - 127.0f);
+        scale = fmaxf(scale, 1e-38f);  // avoid div-by-zero
+        float normalized = val / scale;
+        // 5. Scale to integer range
+        float scaled = normalized * (float)levels;
+        // 6. Round (stochastic or deterministic)
+        float rounded;
+        if (stochastic) {
+            float noise = curand_uniform(&rng);
+            rounded = floorf(scaled + noise);
+        } else {
+            rounded = roundf(scaled);
+        }
+        // 7. Clamp to representable range: [-levels, levels-1]
+        rounded = fmaxf((float)(-levels), fminf((float)(levels - 1), rounded));
+        int rounded_int = (int)rounded;
+        // 8. Sign-magnitude packing
+        uint8_t sign_bit = (rounded_int < 0) ? 1 : 0;
+        int abs_rounded = abs(rounded_int);
+        uint8_t mag = (uint8_t)min(abs_rounded, levels - 1);
+        uint8_t packed_val = (sign_bit << mantissa_bits) | mag;
+        // 9. Write packed element
+        if (elem_off < dim) {
+            slot_p[elem_off] = packed_val;
+        }
+        // 10. Lane 0 writes the shared exponent for this block
+        if (lane_id == 0) {
+            slot_s[blk] = e8m0;
+        }
+    }
+}
+// ============================================================================
+// Kernel 2: Fused dequantize (packed uint8 + E8M0 → FP32)
+//
+// Grid:  ceil(total_elements / 256)
+// Block: 256 threads
+//
+// Simple element-parallel kernel. Each thread dequantizes one element.
+// Bandwidth-bound — no shared memory or reductions needed.
+// ============================================================================
+__global__ void femx_dequantize_kernel(
+    const uint8_t* __restrict__ packed,   // [S, D] packed mantissa+sign
+    const uint8_t* __restrict__ scales_buf, // [S, B] E8M0 shared exponents
+    const uint8_t* __restrict__ tier,     // [S] per-slot tier
+    float* __restrict__ output,           // [S, D] FP32 output
+    int num_slots,
+    int dim,
+    int block_size,
+    int num_blocks
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int total = num_slots * dim;
+    if (idx >= total) return;
+    int slot_idx = idx / dim;
+    int elem_idx = idx % dim;
+    int blk_idx = elem_idx / block_size;
+    // Tier → mantissa bits
+    int t = (int)tier[slot_idx];
+    int mantissa_bits, levels;
+    tier_params(t, mantissa_bits, levels);
+    int mask = levels - 1;
+    // Unpack sign-magnitude
+    uint8_t p = packed[idx];
+    uint8_t sign = (p >> mantissa_bits) & 1;
+    float mag = (float)(p & mask);
+    // Reconstruct normalized value
+    float val = sign ? -mag : mag;
+    val /= (float)levels;
+    // Apply shared exponent
+    uint8_t e8m0 = scales_buf[(long long)slot_idx * num_blocks + blk_idx];
+    float scale = exp2f((float)e8m0 - 127.0f);
+    output[idx] = val * scale;
+}
+// ============================================================================
+// Kernel 3: Fused sync (quantize master → packed + dequantize → BF16 writeback)
+//
+// Combines quantize and dequantize in a single pass:
+//   1. Read master FP32
+//   2. Quantize to packed uint8 (with stochastic rounding)
+//   3. Immediately dequantize the quantized value (no memory round-trip)
+//   4. Write BF16 to fast_weight
+//
+// This avoids a separate read pass over packed+scales, saving bandwidth.
+// Same grid/block layout as the quantize kernel.
+// ============================================================================
+__global__ void femx_sync_kernel(
+    const float* __restrict__ master,      // [S, D] FP32 input
+    const uint8_t* __restrict__ tier,      // [S] per-slot tier
+    uint8_t* __restrict__ packed,          // [S, D] packed output
+    uint8_t* __restrict__ scales,          // [S, B] E8M0 output
+    __nv_bfloat16* __restrict__ fast_weight, // [S, D] BF16 output
+    int num_slots,
+    int dim,
+    int block_size,
+    int num_blocks,
+    unsigned long long seed
+) {
+    int slot_idx = blockIdx.x;
+    if (slot_idx >= num_slots) return;
+    int tid = threadIdx.x;
+    int warp_id = tid / 32;
+    int lane_id = tid % 32;
+    int num_warps = blockDim.x / 32;
+    // Tier params (uniform across block)
+    int t = (int)tier[slot_idx];
+    int mantissa_bits, levels;
+    tier_params(t, mantissa_bits, levels);
+    int mask = levels - 1;
+    // Init Philox PRNG (always stochastic for sync)
+    curandStatePhilox4_32_10_t rng;
+    curand_init(seed,
+                (unsigned long long)(slot_idx * blockDim.x + tid),
+                0, &rng);
+    // Base pointers
+    const float* slot_m = master + (long long)slot_idx * dim;
+    uint8_t* slot_p = packed + (long long)slot_idx * dim;
+    uint8_t* slot_s = scales + (long long)slot_idx * num_blocks;
+    __nv_bfloat16* slot_fw = fast_weight + (long long)slot_idx * dim;
+    for (int blk = warp_id; blk < num_blocks; blk += num_warps) {
+        int elem_off = blk * block_size + lane_id;
+        // === QUANTIZE PASS ===
+        // 1. Load master
+        float val = (elem_off < dim) ? slot_m[elem_off] : 0.0f;
+        float abs_val = fabsf(val);
+        // 2. Block abs_max via warp reduction
+        float block_max = warp_reduce_max(abs_val);
+        // 3. E8M0
+        uint8_t e8m0 = compute_e8m0(block_max);
+        float scale = exp2f((float)e8m0 - 127.0f);
+        scale = fmaxf(scale, 1e-38f);
+        // 4. Normalize + quantize with SR
+        float normalized = val / scale;
+        float scaled = normalized * (float)levels;
+        float noise = curand_uniform(&rng);
+        float rounded = floorf(scaled + noise);
+        rounded = fmaxf((float)(-levels), fminf((float)(levels - 1), rounded));
+        int rounded_int = (int)rounded;
+        // 5. Pack sign-magnitude
+        uint8_t sign_bit = (rounded_int < 0) ? 1 : 0;
+        uint8_t mag = (uint8_t)min(abs(rounded_int), levels - 1);
+        uint8_t packed_val = (sign_bit << mantissa_bits) | mag;
+        // === DEQUANTIZE PASS (in-register, no memory round-trip) ===
+        // 6. Unpack what we just packed
+        float dq_mag = (float)(packed_val & mask);
+        float dq_val = sign_bit ? -dq_mag : dq_mag;
+        dq_val /= (float)levels;
+        float result = dq_val * scale;  // same scale, still in register
+        // === WRITE ALL OUTPUTS ===
+        if (elem_off < dim) {
+            slot_p[elem_off] = packed_val;
+            slot_fw[elem_off] = __float2bfloat16(result);
+        }
+        if (lane_id == 0) {
+            slot_s[blk] = e8m0;
+        }
+    }
+}
+// ============================================================================
+// Host wrapper functions (called from pybind11 bindings)
+// ============================================================================
+void femx_quantize_impl(
+    torch::Tensor master,       // [S, D] float32 CUDA
+    torch::Tensor tier,         // [S] uint8 CUDA
+    torch::Tensor packed,       // [S, D] uint8 CUDA (output, pre-allocated)
+    torch::Tensor scales,       // [S, B] uint8 CUDA (output, pre-allocated)
+    bool stochastic,
+    int64_t seed
+) {
+    TORCH_CHECK(master.is_cuda(), "master must be on CUDA");
+    TORCH_CHECK(master.dtype() == torch::kFloat32, "master must be float32");
+    TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
+    TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
+    TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
+    master = master.contiguous();
+    tier = tier.contiguous();
+    int num_slots = master.size(0);
+    int dim = master.size(1);
+    int num_blocks = scales.size(1);
+    int block_size = dim / num_blocks;
+    TORCH_CHECK(dim % block_size == 0, "dim must be divisible by block_size");
+    int threads = 256;
+    femx_quantize_kernel<<<num_slots, threads>>>(
+        master.data_ptr<float>(),
+        tier.data_ptr<uint8_t>(),
+        packed.data_ptr<uint8_t>(),
+        scales.data_ptr<uint8_t>(),
+        num_slots, dim, block_size, num_blocks,
+        stochastic, (unsigned long long)seed
+    );
+}
+torch::Tensor femx_dequantize_impl(
+    torch::Tensor packed,       // [S, D] uint8 CUDA
+    torch::Tensor scales,       // [S, B] uint8 CUDA
+    torch::Tensor tier,         // [S] uint8 CUDA
+    int64_t block_size
+) {
+    TORCH_CHECK(packed.is_cuda(), "packed must be on CUDA");
+    TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
+    TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
+    TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
+    packed = packed.contiguous();
+    scales = scales.contiguous();
+    tier = tier.contiguous();
+    int num_slots = packed.size(0);
+    int dim = packed.size(1);
+    int num_blocks = dim / block_size;
+    auto output = torch::empty({num_slots, dim},
+        torch::TensorOptions().dtype(torch::kFloat32).device(packed.device()));
+    int total = num_slots * dim;
+    int threads = 256;
+    int blocks = (total + threads - 1) / threads;
+    femx_dequantize_kernel<<<blocks, threads>>>(
+        packed.data_ptr<uint8_t>(),
+        scales.data_ptr<uint8_t>(),
+        tier.data_ptr<uint8_t>(),
+        output.data_ptr<float>(),
+        num_slots, dim, block_size, num_blocks
+    );
+    return output;
+}
+void femx_sync_impl(
+    torch::Tensor master,       // [S, D] float32 CUDA
+    torch::Tensor tier,         // [S] uint8 CUDA
+    torch::Tensor packed,       // [S, D] uint8 CUDA (output)
+    torch::Tensor scales,       // [S, B] uint8 CUDA (output)
+    torch::Tensor fast_weight,  // [S, D] bfloat16 CUDA (output)
+    int64_t seed
+) {
+    TORCH_CHECK(master.is_cuda(), "master must be on CUDA");
+    TORCH_CHECK(master.dtype() == torch::kFloat32, "master must be float32");
+    TORCH_CHECK(fast_weight.dtype() == torch::kBFloat16, "fast_weight must be bfloat16");
+    TORCH_CHECK(tier.dtype() == torch::kUInt8, "tier must be uint8");
+    TORCH_CHECK(packed.dtype() == torch::kUInt8, "packed must be uint8");
+    TORCH_CHECK(scales.dtype() == torch::kUInt8, "scales must be uint8");
+    master = master.contiguous();
+    tier = tier.contiguous();
+    int num_slots = master.size(0);
+    int dim = master.size(1);
+    int num_blocks = scales.size(1);
+    int block_size = dim / num_blocks;
+    int threads = 256;
+    femx_sync_kernel<<<num_slots, threads>>>(
+        master.data_ptr<float>(),
+        tier.data_ptr<uint8_t>(),
+        packed.data_ptr<uint8_t>(),
+        scales.data_ptr<uint8_t>(),
+        reinterpret_cast<__nv_bfloat16*>(fast_weight.data_ptr<at::BFloat16>()),
+        num_slots, dim, block_size, num_blocks,
+        (unsigned long long)seed
+    );
+}

FireEcho Engine/csrc/fireecho_preproc.cpp ADDED Viewed

	@@ -0,0 +1,54 @@

+// FireEcho Preprocessing — pybind11 bindings (SpeechLib-matched)
+// JIT-compiled via torch.utils.cpp_extension.load()
+#include <torch/extension.h>
+// Forward declarations from fireecho_preproc_cuda.cu
+torch::Tensor cuda_stft_impl(
+    torch::Tensor audio,
+    torch::Tensor window,
+    int64_t n_fft,
+    int64_t win_length,
+    int64_t hop_length,
+    double preemph_coeff
+);
+torch::Tensor cuda_mel_filterbank_impl(
+    torch::Tensor power_spec,
+    torch::Tensor mel_matrix
+);
+torch::Tensor cuda_audio_pipeline_impl(
+    torch::Tensor audio,
+    torch::Tensor window,
+    torch::Tensor mel_matrix,
+    int64_t n_fft,
+    int64_t win_length,
+    int64_t hop_length,
+    double preemph_coeff
+);
+torch::Tensor cuda_image_preprocess_impl(
+    torch::Tensor image_rgb,
+    int64_t crop_size
+);
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.doc() = "FireEcho CUDA-accelerated preprocessing (Phase 5, SpeechLib-matched)";
+    m.def("cuda_stft", &cuda_stft_impl,
+          "Batched STFT with per-frame pre-emphasis + 32768 scaling via cuFFT",
+          py::arg("audio"), py::arg("window"),
+          py::arg("n_fft"), py::arg("win_length"), py::arg("hop_length"),
+          py::arg("preemph_coeff") = 0.97);
+    m.def("cuda_mel_filterbank", &cuda_mel_filterbank_impl,
+          "Mel filterbank with pre-computed SpeechLib matrix + fused clip+log",
+          py::arg("power_spec"), py::arg("mel_matrix"));
+    m.def("cuda_audio_pipeline", &cuda_audio_pipeline_impl,
+          "Full audio pipeline: STFT + mel in single call",
+          py::arg("audio"), py::arg("window"), py::arg("mel_matrix"),
+          py::arg("n_fft"), py::arg("win_length"), py::arg("hop_length"),
+          py::arg("preemph_coeff") = 0.97);
+    m.def("cuda_image_preprocess", &cuda_image_preprocess_impl,
+          "Fused bicubic resize + normalize [-1,1] + bf16",
+          py::arg("image_rgb"), py::arg("crop_size"));
+}

FireEcho Engine/csrc/fireecho_preproc_cuda.cu ADDED Viewed

	@@ -0,0 +1,316 @@

+// FireEcho Preprocessing CUDA Kernels — Phase 5 (SpeechLib-matched)
+// Accelerated audio STFT, mel filterbank, and image preprocessing
+// JIT-compiled via torch.utils.cpp_extension
+//
+// Audio pipeline exactly replicates Phi-4 processing_phi4mm.py:
+//   Per-frame pre-emphasis with roll (prev[0]=frame[0]) + scale 32768
+//   Hamming window → cuFFT R2C → |z|^2 → mel matmul → clip(1.0) → ln()
+#include <torch/extension.h>
+#include <cuda_runtime.h>
+#include <cufft.h>
+#include <math.h>
+// ============================================================================
+// CUDA error checking
+// ============================================================================
+#define CUDA_CHECK(call) do { \
+    cudaError_t err = (call); \
+    TORCH_CHECK(err == cudaSuccess, "CUDA error: ", cudaGetErrorString(err)); \
+} while(0)
+#define CUFFT_CHECK(call) do { \
+    cufftResult err = (call); \
+    TORCH_CHECK(err == CUFFT_SUCCESS, "cuFFT error: ", (int)err); \
+} while(0)
+// ============================================================================
+// Kernel 1: Frame extraction + per-frame pre-emphasis + scaling + windowing
+//
+// Matches SpeechLib / Phi-4 processing_phi4mm.py exactly:
+//   prev[0] = frame[0]       (NOT zero — SpeechLib sets prev[:,0] = prev[:,1])
+//   prev[i] = frame[i-1]     for i > 0
+//   output[i] = (frame[i] - coeff * prev[i]) * 32768.0 * window[i]
+//
+// Each thread-block handles one frame using shared memory.
+// ============================================================================
+__global__ void frame_extract_preemph_kernel(
+    const float* __restrict__ audio,    // [N] raw samples
+    const float* __restrict__ window,   // [win_length]
+    float* __restrict__ frames,         // [num_frames, n_fft] output
+    int N,
+    int n_fft,
+    int win_length,
+    int hop_length,
+    int num_frames,
+    float preemph_coeff                 // 0.97
+) {
+    extern __shared__ float sframe[];   // [win_length] shared per block
+    int frame_idx = blockIdx.x;
+    if (frame_idx >= num_frames) return;
+    int start = frame_idx * hop_length;
+    // Phase 1: Load raw samples into shared memory
+    for (int i = threadIdx.x; i < win_length; i += blockDim.x) {
+        int sample_idx = start + i;
+        sframe[i] = (sample_idx < N) ? audio[sample_idx] : 0.0f;
+    }
+    __syncthreads();
+    // Phase 2: Per-frame pre-emphasis + 32768 scaling + windowing + zero-pad
+    for (int i = threadIdx.x; i < n_fft; i += blockDim.x) {
+        float val = 0.0f;
+        if (i < win_length) {
+            float curr = sframe[i];
+            // SpeechLib: prev[0] = frame[0], prev[i] = frame[i-1] for i > 0
+            float prev = (i > 0) ? sframe[i - 1] : curr;
+            val = (curr - preemph_coeff * prev) * 32768.0f * window[i];
+        }
+        // Beyond win_length: val stays 0.0 (zero-pad to n_fft)
+        frames[frame_idx * n_fft + i] = val;
+    }
+}
+// ============================================================================
+// Kernel 2: Power spectrum from complex FFT output
+// |z|^2 = re^2 + im^2
+// ============================================================================
+__global__ void power_spectrum_kernel(
+    const cufftComplex* __restrict__ spectrum,  // [num_frames, n_fft/2+1]
+    float* __restrict__ power,                   // [num_frames, n_fft/2+1]
+    int total_elements
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx >= total_elements) return;
+    float re = spectrum[idx].x;
+    float im = spectrum[idx].y;
+    power[idx] = re * re + im * im;
+}
+// ============================================================================
+// Kernel 3: Fused clip(1.0) + natural log
+// Applied element-wise to mel-filtered power spectrum
+// Matches: np.log(np.clip(spec_power.dot(mel_matrix), 1.0, None))
+// ============================================================================
+__global__ void clip_log_kernel(
+    float* __restrict__ data,   // [T * n_mels] in-place
+    int total_elements
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx >= total_elements) return;
+    data[idx] = logf(fmaxf(data[idx], 1.0f));
+}
+// ============================================================================
+// Kernel 4: Fused bicubic resize + normalize for images
+// Each thread computes one output pixel (c, y, x).
+// Catmull-Rom spline (a = -0.5) matching TorchVision bicubic.
+// Output: normalized to [-1, 1] range.
+// ============================================================================
+__device__ float cubic_weight(float x, float a = -0.5f) {
+    x = fabsf(x);
+    if (x <= 1.0f) {
+        return (a + 2.0f) * x * x * x - (a + 3.0f) * x * x + 1.0f;
+    } else if (x < 2.0f) {
+        return a * x * x * x - 5.0f * a * x * x + 8.0f * a * x - 4.0f * a;
+    }
+    return 0.0f;
+}
+__global__ void image_resize_normalize_kernel(
+    const unsigned char* __restrict__ image,  // [H_in, W_in, 3] uint8
+    float* __restrict__ output,               // [3, H_out, W_out] float
+    int H_in, int W_in,
+    int H_out, int W_out
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int total = 3 * H_out * W_out;
+    if (idx >= total) return;
+    int c = idx / (H_out * W_out);
+    int rem = idx % (H_out * W_out);
+    int y_out = rem / W_out;
+    int x_out = rem % W_out;
+    // Map output coordinate to input coordinate
+    float scale_y = (float)H_in / (float)H_out;
+    float scale_x = (float)W_in / (float)W_out;
+    float y_in = ((float)y_out + 0.5f) * scale_y - 0.5f;
+    float x_in = ((float)x_out + 0.5f) * scale_x - 0.5f;
+    int y0 = (int)floorf(y_in) - 1;
+    int x0 = (int)floorf(x_in) - 1;
+    float sum = 0.0f;
+    float weight_sum = 0.0f;
+    // 4x4 bicubic kernel
+    for (int dy = 0; dy < 4; dy++) {
+        int yy = y0 + dy;
+        float wy = cubic_weight(y_in - (float)yy);
+        // Clamp to image bounds
+        yy = max(0, min(yy, H_in - 1));
+        for (int dx = 0; dx < 4; dx++) {
+            int xx = x0 + dx;
+            float wx = cubic_weight(x_in - (float)xx);
+            xx = max(0, min(xx, W_in - 1));
+            float pixel = (float)image[yy * W_in * 3 + xx * 3 + c];
+            float w = wy * wx;
+            sum += pixel * w;
+            weight_sum += w;
+        }
+    }
+    // Normalize weights, convert to [0, 1], then to [-1, 1]
+    float val = sum / fmaxf(weight_sum, 1e-6f);
+    val = val / 255.0f;                  // [0, 1]
+    val = (val - 0.5f) / 0.5f;          // [-1, 1]
+    output[idx] = val;
+}
+// ============================================================================
+// Host functions called from C++ bindings
+// ============================================================================
+torch::Tensor cuda_stft_impl(
+    torch::Tensor audio,      // [N] float32 on CUDA
+    torch::Tensor window,     // [win_length] float32 on CUDA
+    int64_t n_fft,
+    int64_t win_length,
+    int64_t hop_length,
+    double preemph_coeff       // 0.97
+) {
+    TORCH_CHECK(audio.is_cuda(), "audio must be on CUDA");
+    TORCH_CHECK(window.is_cuda(), "window must be on CUDA");
+    TORCH_CHECK(audio.dtype() == torch::kFloat32, "audio must be float32");
+    audio = audio.contiguous();
+    window = window.contiguous();
+    int N = audio.size(0);
+    int num_frames = (N - win_length) / hop_length + 1;
+    if (num_frames <= 0) num_frames = 1;
+    int freq_bins = n_fft / 2 + 1;
+    // Allocate frames buffer [num_frames, n_fft]
+    auto frames = torch::zeros({num_frames, n_fft},
+        torch::TensorOptions().dtype(torch::kFloat32).device(audio.device()));
+    // Frame extraction + per-frame pre-emphasis + 32768 scaling + windowing
+    // Shared memory: win_length floats for the raw frame
+    int threads = 256;
+    int smem = win_length * sizeof(float);
+    frame_extract_preemph_kernel<<<num_frames, threads, smem>>>(
+        audio.data_ptr<float>(),
+        window.data_ptr<float>(),
+        frames.data_ptr<float>(),
+        N, n_fft, win_length, hop_length, num_frames,
+        (float)preemph_coeff
+    );
+    // Batched real-to-complex FFT via cuFFT
+    cufftHandle plan;
+    CUFFT_CHECK(cufftPlan1d(&plan, n_fft, CUFFT_R2C, num_frames));
+    auto spectrum = torch::empty({num_frames, freq_bins},
+        torch::TensorOptions().dtype(torch::kComplexFloat).device(audio.device()));
+    CUFFT_CHECK(cufftExecR2C(plan,
+        frames.data_ptr<float>(),
+        reinterpret_cast<cufftComplex*>(spectrum.data_ptr<c10::complex<float>>())
+    ));
+    cufftDestroy(plan);
+    // Power spectrum: |z|^2 = re^2 + im^2
+    int total = num_frames * freq_bins;
+    int blocks = (total + 255) / 256;
+    auto power = torch::empty({num_frames, freq_bins},
+        torch::TensorOptions().dtype(torch::kFloat32).device(audio.device()));
+    power_spectrum_kernel<<<blocks, 256>>>(
+        reinterpret_cast<const cufftComplex*>(spectrum.data_ptr<c10::complex<float>>()),
+        power.data_ptr<float>(),
+        total
+    );
+    return power;  // [num_frames, n_fft/2+1]
+}
+torch::Tensor cuda_mel_filterbank_impl(
+    torch::Tensor power_spec,     // [T, F] float32 on CUDA
+    torch::Tensor mel_matrix      // [F, n_mels] float32 on CUDA (pre-computed SpeechLib, transposed)
+) {
+    TORCH_CHECK(power_spec.is_cuda(), "power_spec must be on CUDA");
+    TORCH_CHECK(mel_matrix.is_cuda(), "mel_matrix must be on CUDA");
+    TORCH_CHECK(power_spec.dtype() == torch::kFloat32, "power_spec must be float32");
+    TORCH_CHECK(mel_matrix.dtype() == torch::kFloat32, "mel_matrix must be float32");
+    power_spec = power_spec.contiguous();
+    mel_matrix = mel_matrix.contiguous();
+    // mel_out = power_spec @ mel_matrix  →  [T, n_mels]
+    // mel_matrix is already [F, n_mels] (transposed for dot product)
+    auto mel_out = torch::mm(power_spec, mel_matrix);
+    // Fused clip(1.0) + log in-place
+    int total = mel_out.numel();
+    int threads = 256;
+    int blocks = (total + threads - 1) / threads;
+    clip_log_kernel<<<blocks, threads>>>(mel_out.data_ptr<float>(), total);
+    return mel_out;  // [T, n_mels]
+}
+torch::Tensor cuda_audio_pipeline_impl(
+    torch::Tensor audio,          // [N] float32 on CUDA
+    torch::Tensor window,         // [win_length] float32 on CUDA
+    torch::Tensor mel_matrix,     // [F, n_mels] float32 on CUDA (pre-computed SpeechLib)
+    int64_t n_fft,
+    int64_t win_length,
+    int64_t hop_length,
+    double preemph_coeff           // 0.97
+) {
+    // Full pipeline: audio → frames → FFT → power → mel → clip → log
+    // Single call minimizes Python ↔ CUDA round-trips
+    auto power = cuda_stft_impl(audio, window, n_fft, win_length, hop_length, preemph_coeff);
+    auto mel = cuda_mel_filterbank_impl(power, mel_matrix);
+    return mel;  // [T, n_mels]
+}
+torch::Tensor cuda_image_preprocess_impl(
+    torch::Tensor image_rgb,   // [H, W, 3] uint8 on CUDA
+    int64_t crop_size
+) {
+    TORCH_CHECK(image_rgb.is_cuda(), "image must be on CUDA");
+    TORCH_CHECK(image_rgb.dtype() == torch::kUInt8, "image must be uint8");
+    image_rgb = image_rgb.contiguous();
+    int H_in = image_rgb.size(0);
+    int W_in = image_rgb.size(1);
+    int H_out = crop_size;
+    int W_out = crop_size;
+    // Output: [3, H_out, W_out] float32, then we'll convert to bf16
+    auto output = torch::empty({3, H_out, W_out},
+        torch::TensorOptions().dtype(torch::kFloat32).device(image_rgb.device()));
+    int total = 3 * H_out * W_out;
+    int threads = 256;
+    int blocks = (total + threads - 1) / threads;
+    image_resize_normalize_kernel<<<blocks, threads>>>(
+        image_rgb.data_ptr<unsigned char>(),
+        output.data_ptr<float>(),
+        H_in, W_in, H_out, W_out
+    );
+    // Add batch dimension and convert to bfloat16
+    return output.unsqueeze(0).to(torch::kBFloat16);  // [1, 3, H_out, W_out]
+}

FireEcho Engine/cutlass_kernels.py ADDED Viewed

	@@ -0,0 +1,2418 @@

+"""
+FireEcho CUTLASS — Self-Contained CUTLASS-Compatible Kernels
+=============================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+Pure Python/Triton/PyTorch implementations — no .so binary required.
+  1. TMA MatMul  — Triton block-pointer kernel with multi-stage pipelining
+  2. TMA Attention — PyTorch SDPA (dispatches to Flash Attention 2 on HW)
+  3. NVFP4 GEMM  — Fused dequant-matmul Triton kernel (Blackwell native format)
+       16-element blocks, E4M3 scales, per-tensor FP32 scale.
+       Multi-tier dispatch: native cuBLAS _scaled_mm → fused Triton → CPU.
+       Vectorized O(K*N) activation quantization via torch.bucketize.
+  4. MXFP4 GEMM  — Fused dequant-matmul Triton kernel (OCP MXFP4 format)
+       32-element blocks, E8M0 power-of-two scales.
+       Kept for backward compatibility.
+  5. L2 Cache Control — ctypes/libcudart.so cudaAccessPolicyWindow
+Usage:
+    from fireecho_kernel.cutlass_kernels import (
+        tma_matmul,
+        tma_attention,
+        nvfp4_gemm,        # New: NVFP4 (recommended)
+        mxfp4_gemm,        # Legacy: MXFP4
+        fp4_gemm,           # Alias -> nvfp4_gemm
+        NVFP4Weights,
+        MXFP4Weights,
+        L2CacheManager,
+    )
+    # TMA MatMul (Triton block-pointer)
+    c = tma_matmul(a, b)
+    # NVFP4 GEMM (recommended — fused dequant-matmul, 16-element blocks)
+    w_q = quantize_to_nvfp4(weights)
+    out = nvfp4_gemm(activations, w_q)
+    # MXFP4 GEMM (legacy — fused dequant-matmul, 32-element blocks)
+    w_q = quantize_to_mxfp4(weights)
+    out = mxfp4_gemm(activations, w_q)
+    # L2 Cache pinning (hardware-backed via cudart)
+    l2 = L2CacheManager()
+    l2.pin(embedding_table)
+"""
+import torch
+import torch.nn.functional as F
+import triton
+import triton.language as tl
+from typing import Optional, Tuple, Dict, Any
+from dataclasses import dataclass
+import ctypes
+import ctypes.util
+# =============================================================================
+# Triton TMA MatMul Kernel (block-pointer, multi-stage)
+# =============================================================================
+@triton.autotune(
+    configs=[
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
+        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
+        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=4),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_stages=4, num_warps=8),
+        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_stages=5, num_warps=4),
+    ],
+    key=['M', 'N', 'K'],
+)
+@triton.jit
+def _tma_matmul_kernel(
+    a_ptr, b_ptr, c_ptr, d_ptr,
+    M, N, K,
+    stride_am, stride_ak,
+    stride_bk, stride_bn,
+    stride_cm, stride_cn,
+    stride_dm, stride_dn,
+    alpha, beta,
+    HAS_C: tl.constexpr,
+    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
+):
+    """
+    TMA-style MatMul using block pointers for async memory access.
+    D = alpha * (A @ B) + beta * C
+    Block pointers enable hardware-managed address generation and
+    async DDR7/HBM -> SMEM loads overlapped with compute.
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    a_block_ptr = tl.make_block_ptr(
+        base=a_ptr,
+        shape=(M, K),
+        strides=(stride_am, stride_ak),
+        offsets=(pid_m * BLOCK_M, 0),
+        block_shape=(BLOCK_M, BLOCK_K),
+        order=(1, 0),
+    )
+    b_block_ptr = tl.make_block_ptr(
+        base=b_ptr,
+        shape=(K, N),
+        strides=(stride_bk, stride_bn),
+        offsets=(0, pid_n * BLOCK_N),
+        block_shape=(BLOCK_K, BLOCK_N),
+        order=(1, 0),
+    )
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    for _ in range(0, tl.cdiv(K, BLOCK_K)):
+        a = tl.load(a_block_ptr, boundary_check=(0, 1))
+        b = tl.load(b_block_ptr, boundary_check=(0, 1))
+        acc += tl.dot(a, b)
+        a_block_ptr = tl.advance(a_block_ptr, (0, BLOCK_K))
+        b_block_ptr = tl.advance(b_block_ptr, (BLOCK_K, 0))
+    # Apply alpha
+    if alpha != 1.0:
+        acc = acc * alpha
+    # Apply beta * C
+    if HAS_C:
+        c_block_ptr = tl.make_block_ptr(
+            base=c_ptr,
+            shape=(M, N),
+            strides=(stride_cm, stride_cn),
+            offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
+            block_shape=(BLOCK_M, BLOCK_N),
+            order=(1, 0),
+        )
+        c_val = tl.load(c_block_ptr, boundary_check=(0, 1)).to(tl.float32)
+        acc = acc + beta * c_val
+    # Store result
+    d_block_ptr = tl.make_block_ptr(
+        base=d_ptr,
+        shape=(M, N),
+        strides=(stride_dm, stride_dn),
+        offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
+        block_shape=(BLOCK_M, BLOCK_N),
+        order=(1, 0),
+    )
+    tl.store(d_block_ptr, acc.to(tl.bfloat16), boundary_check=(0, 1))
+# =============================================================================
+# TMA MatMul (public API)
+# =============================================================================
+def tma_matmul(
+    a: torch.Tensor,
+    b: torch.Tensor,
+    alpha: float = 1.0,
+    beta: float = 0.0,
+    c: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    TMA-style matrix multiplication via Triton block-pointer kernel.
+    Uses async memory transfers (block pointers + multi-stage pipelining)
+    for compute/memory overlap on SM90+ GPUs.
+    Args:
+        a: Input matrix A [M, K] in BF16/FP16
+        b: Input matrix B [K, N] in BF16/FP16
+        alpha: Scale for A @ B
+        beta: Scale for C
+        c: Optional input C for D = alpha * A @ B + beta * C
+    Returns:
+        Output matrix D [M, N]
+    """
+    M, K = a.shape
+    K2, N = b.shape
+    assert K == K2, f"K dimension mismatch: {K} vs {K2}"
+    if a.dtype not in (torch.bfloat16, torch.float16):
+        a = a.to(torch.bfloat16)
+    if b.dtype != a.dtype:
+        b = b.to(a.dtype)
+    d = torch.empty(M, N, device=a.device, dtype=a.dtype)
+    if c is not None and beta != 0:
+        if c.dtype != a.dtype:
+            c = c.to(a.dtype)
+        c_contiguous = c.contiguous()
+        has_c = True
+    else:
+        c_contiguous = d  # dummy — not read when HAS_C=False
+        beta = 0.0
+        has_c = False
+    a = a.contiguous()
+    b = b.contiguous()
+    # Fall back to torch.matmul on CPU
+    if not a.is_cuda:
+        result = alpha * torch.matmul(a.float(), b.float()).to(a.dtype)
+        if has_c:
+            result = result + beta * c_contiguous
+        return result
+    grid = lambda META: (
+        triton.cdiv(M, META['BLOCK_M']),
+        triton.cdiv(N, META['BLOCK_N']),
+    )
+    _tma_matmul_kernel[grid](
+        a, b, c_contiguous, d,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        b.stride(0), b.stride(1),
+        c_contiguous.stride(0), c_contiguous.stride(1),
+        d.stride(0), d.stride(1),
+        alpha, beta,
+        HAS_C=has_c,
+    )
+    return d
+# =============================================================================
+# TMA Attention (SDPA-backed)
+# =============================================================================
+def tma_attention(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    scale: Optional[float] = None,
+    is_causal: bool = False,
+    dropout_p: float = 0.0,
+) -> torch.Tensor:
+    """
+    Attention via PyTorch SDPA (dispatches to Flash Attention 2 on supported HW).
+    Args:
+        q: Query tensor [batch, heads, seq_q, head_dim]
+        k: Key tensor [batch, heads, seq_kv, head_dim]
+        v: Value tensor [batch, heads, seq_kv, head_dim]
+        scale: Attention scale (default: 1/sqrt(head_dim))
+        is_causal: Apply causal mask
+        dropout_p: Dropout probability
+    Returns:
+        Output tensor [batch, heads, seq_q, head_dim]
+    """
+    if scale is None:
+        scale = q.shape[-1] ** -0.5
+    orig_dtype = q.dtype
+    if q.dtype not in (torch.bfloat16, torch.float16):
+        q = q.to(torch.bfloat16)
+        k = k.to(torch.bfloat16)
+        v = v.to(torch.bfloat16)
+    o = F.scaled_dot_product_attention(
+        q, k, v,
+        attn_mask=None,
+        dropout_p=dropout_p if q.requires_grad else 0.0,
+        is_causal=is_causal,
+        scale=scale,
+    )
+    return o.to(orig_dtype)
+def tma_gqa_attention(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    num_kv_heads: int,
+    scale: Optional[float] = None,
+    is_causal: bool = False,
+) -> torch.Tensor:
+    """
+    Grouped Query Attention via SDPA.
+    Expands KV heads to match Q heads then delegates to tma_attention.
+    Args:
+        q: Query [batch, num_q_heads, seq, head_dim]
+        k: Key [batch, num_kv_heads, seq, head_dim]
+        v: Value [batch, num_kv_heads, seq, head_dim]
+        num_kv_heads: Number of KV heads
+        scale: Attention scale
+        is_causal: Apply causal mask
+    Returns:
+        Output [batch, num_q_heads, seq, head_dim]
+    """
+    batch, num_q_heads, seq_q, head_dim = q.shape
+    heads_per_group = num_q_heads // num_kv_heads
+    if heads_per_group > 1:
+        k = k.repeat_interleave(heads_per_group, dim=1)
+        v = v.repeat_interleave(heads_per_group, dim=1)
+    return tma_attention(q, k, v, scale=scale, is_causal=is_causal)
+# =============================================================================
+# Native MXFP4 GEMM (Quartet Algorithm)
+# =============================================================================
+# Reference: "Quartet: Native FP4 Training Can Be Optimal for LLMs"
+#            https://arxiv.org/html/2505.14669v1
+# E2M1 quantization grid (MXFP4/NVFP4)
+_E2M1_VALUES = torch.tensor(
+    [0, 0.5, 1, 1.5, 2, 3, 4, 6, 0, -0.5, -1, -1.5, -2, -3, -4, -6],
+    dtype=torch.float32,
+)
+# Bucketize boundaries for vectorized E2M1 quantization.
+# Midpoints between adjacent unsigned E2M1 values [0, 0.5, 1, 1.5, 2, 3, 4, 6].
+# torch.bucketize gives index 0-7 for unsigned magnitude, then sign is applied.
+_E2M1_BOUNDARIES = torch.tensor([0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0])
+# QuEST optimal clipping factor (empirically derived)
+_QUEST_CLIP_FACTOR = 0.88
+@dataclass
+class MXFP4Weights:
+    """
+    MXFP4 quantized weights following OCP Microscaling Spec v1.0.
+    Format: 32 E2M1 values share 1 E8M0 power-of-two scale.
+    Memory layout:
+      - packed: [K//2, N] uint8 (2 nibbles per byte)
+      - scales: [K//32, N] uint8 (E8M0 exponent-only)
+    Total size: K*N/2 + K*N/32 = K*N * (1/2 + 1/32) ~ 0.53 * original
+    """
+    packed: torch.Tensor       # [K//2, N] uint8
+    scales: torch.Tensor       # [K//32, N] uint8 (E8M0)
+    shape: Tuple[int, int]     # Original (K, N)
+    clip_mask: Optional[torch.Tensor] = None  # For QuEST gradient masking
+    @classmethod
+    def from_float(cls, weights: torch.Tensor, use_quest: bool = True) -> 'MXFP4Weights':
+        """
+        Quantize FP16/FP32 weights to MXFP4 with block scaling.
+        Args:
+            weights: Input tensor [K, N]
+            use_quest: Use QuEST optimal clipping (recommended for forward)
+        Returns:
+            MXFP4Weights with packed values and E8M0 scales
+        """
+        K, N = weights.shape
+        assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
+        device = weights.device
+        weights = weights.float()
+        # Reshape to blocks of 32
+        reshaped = weights.view(K // 32, 32, N)
+        # Find block-wise absmax
+        absmax = reshaped.abs().amax(dim=1)  # [K//32, N]
+        absmax = absmax.clamp(min=1e-10)
+        # Apply QuEST clipping factor
+        if use_quest:
+            clip_bound = absmax * _QUEST_CLIP_FACTOR
+        else:
+            clip_bound = absmax
+        # Compute E8M0 scales (power-of-two)
+        # E8M0: value = 2^(exponent - 127), exponent in [0, 255]
+        # We want scale * 6.0 >= clip_bound, so scale >= clip_bound / 6
+        scale_float = clip_bound / 6.0  # 6.0 is E2M1 max
+        # Convert to E8M0 (find nearest power of 2)
+        log2_scale = torch.log2(scale_float.clamp(min=2**-126))
+        exponent = (log2_scale.round() + 127).clamp(1, 254).to(torch.uint8)
+        # Reconstruct actual scale from E8M0
+        actual_scale = torch.pow(2.0, exponent.float() - 127)  # [K//32, N]
+        # Normalize by scale
+        normalized = reshaped / actual_scale.unsqueeze(1)  # [K//32, 32, N]
+        # Clamp to E2M1 range [-6, 6]
+        normalized = normalized.clamp(-6.0, 6.0)
+        # Generate clip mask for gradient (QuEST)
+        if use_quest:
+            clip_mask = (reshaped.abs() > clip_bound.unsqueeze(1) * 6.0).view(K, N)
+        else:
+            clip_mask = None
+        # Quantize to nearest E2M1 value via vectorized bucketize.
+        # O(K*N) instead of O(K*N*16) distance matrix — eliminates 1GB temp alloc.
+        boundaries = _E2M1_BOUNDARIES.to(device)
+        abs_norm = normalized.abs().reshape(-1)               # [K * N]
+        unsigned_idx = torch.bucketize(abs_norm, boundaries)  # [K * N], values 0-7
+        # Sign bit only when magnitude > 0 (±0 both decode to 0.0, use index 0)
+        sign_bit = ((normalized.reshape(-1) < 0) & (unsigned_idx > 0)).to(torch.uint8) << 3
+        indices = (sign_bit | unsigned_idx.to(torch.uint8)).reshape(K, N)
+        packed = (indices[0::2] | (indices[1::2] << 4))  # [K//2, N]
+        return cls(
+            packed=packed,
+            scales=exponent,
+            shape=(K, N),
+            clip_mask=clip_mask,
+        )
+    def to_float(self) -> torch.Tensor:
+        """Dequantize MXFP4 back to float."""
+        K, N = self.shape
+        device = self.packed.device
+        e2m1_grid = _E2M1_VALUES.to(device)
+        # Unpack nibbles
+        low = (self.packed & 0xF).long()
+        high = (self.packed >> 4).long()
+        # Decode E2M1 values
+        low_vals = e2m1_grid[low.flatten()].view(K // 2, N)
+        high_vals = e2m1_grid[high.flatten()].view(K // 2, N)
+        # Interleave
+        unpacked = torch.zeros(K, N, device=device, dtype=torch.float32)
+        unpacked[0::2] = low_vals
+        unpacked[1::2] = high_vals
+        # Apply E8M0 scales
+        scale_float = torch.pow(2.0, self.scales.float() - 127)  # [K//32, N]
+        unpacked = unpacked.view(K // 32, 32, N)
+        unpacked = unpacked * scale_float.unsqueeze(1)
+        return unpacked.view(K, N)
+    @property
+    def compression_ratio(self) -> float:
+        """Memory compression ratio vs FP16."""
+        K, N = self.shape
+        fp16_bytes = K * N * 2
+        mxfp4_bytes = self.packed.numel() + self.scales.numel()
+        return fp16_bytes / mxfp4_bytes
+    def to_native(self) -> 'NativeMXFP4':
+        """
+        Convert to native FP4 format for tl.dot_scaled (SM100+).
+        One-time conversion that:
+          1. Transposes packed weights: [K//2, N] -> [N, K//2]
+          2. Converts E8M0 scales to 5D preshuffled MXScaleTensor layout:
+             [N//128, K//32//4, 32, 4, 4]
+          3. Caches the result so subsequent calls return immediately.
+        Returns:
+            NativeMXFP4 with preshuffled layout for hardware MMA.
+        """
+        if hasattr(self, '_native_cache') and self._native_cache is not None:
+            return self._native_cache
+        K, N = self.shape
+        # Transpose packed weights for column-major access pattern
+        packed_t = self.packed.T.contiguous()  # [N, K//2]
+        # Build 5D preshuffled scale tensor for MXScaleTensor layout
+        # Hardware expects: [N//128, K//32//4, 32, 4, 4]
+        # This arranges scales so tensor core warps can load them directly.
+        num_scale_k = K // 32
+        num_scale_n = N
+        # Pad N to multiple of 128 for the 5D layout
+        n_blocks = (N + 127) // 128
+        # Reshape scales [K//32, N] -> 5D preshuffled
+        scales_flat = self.scales.contiguous()  # [K//32, N]
+        # Group K scales into groups of 4
+        k_groups = (num_scale_k + 3) // 4
+        scales_5d = torch.zeros(
+            n_blocks, k_groups, 32, 4, 4,
+            dtype=torch.uint8, device=self.packed.device,
+        )
+        # Fill the 5D tensor: map (k_scale_idx, n_idx) -> 5D position
+        for nb in range(n_blocks):
+            for kg in range(k_groups):
+                for inner_n in range(min(128, N - nb * 128)):
+                    n_idx = nb * 128 + inner_n
+                    if n_idx >= N:
+                        break
+                    # Map inner_n into (d2, d4) where d2 is in [0,32), d4 in [0,4)
+                    d4 = inner_n % 4
+                    d2 = (inner_n // 4) % 32
+                    for d3 in range(min(4, num_scale_k - kg * 4)):
+                        k_idx = kg * 4 + d3
+                        if k_idx < num_scale_k:
+                            scales_5d[nb, kg, d2, d3, d4] = scales_flat[k_idx, n_idx]
+        native = NativeMXFP4(
+            packed_t=packed_t,
+            scales_5d=scales_5d,
+            shape=(K, N),
+        )
+        self._native_cache = native
+        return native
+@dataclass
+class NativeMXFP4:
+    """
+    Native FP4 format for tl.dot_scaled hardware path (SM100+).
+    Preshuffled layout matching MXScaleTensor requirements:
+      - packed_t: [N, K//2] uint8 — transposed packed weights
+      - scales_5d: [N//128, K//32//4, 32, 4, 4] uint8 — preshuffled E8M0
+    Created via MXFP4Weights.to_native(). Cached so conversion is one-time.
+    """
+    packed_t: torch.Tensor     # [N, K//2] uint8
+    scales_5d: torch.Tensor    # [N//128, K//32//4, 32, 4, 4] uint8
+    shape: Tuple[int, int]     # Original (K, N)
+# =============================================================================
+# E4M3 (FP8) Encode/Decode Helpers
+# =============================================================================
+def _encode_e4m3(values: torch.Tensor) -> torch.Tensor:
+    """Encode FP32 values to E4M3 (FP8) as uint8. Vectorized."""
+    if hasattr(torch, 'float8_e4m3fn'):
+        return values.clamp(-448.0, 448.0).to(torch.float8_e4m3fn).view(torch.uint8)
+    # Manual fallback: clamp to representable range and use bit manipulation
+    v = values.float().clamp(-448.0, 448.0)
+    sign = (v < 0).to(torch.uint8) << 7
+    av = v.abs().clamp(min=0.0)
+    # E4M3: bias=7, subnormal threshold = 2^-6
+    # Normal: (1 + mant/8) * 2^(exp-7)
+    # Subnormal (exp=0): (mant/8) * 2^-6
+    log2_av = torch.log2(av.clamp(min=2**-9))  # min subnormal = 2^-9
+    exp_raw = torch.floor(log2_av).clamp(-6, 8)
+    exp_biased = (exp_raw + 7).clamp(0, 15)
+    # For normal values
+    mantissa_f = (av / torch.pow(2.0, exp_raw) - 1.0) * 8.0
+    mantissa = mantissa_f.round().clamp(0, 7).to(torch.uint8)
+    # For subnormal (exp_biased == 0)
+    sub_mant = (av / (2**-6) * 8.0).round().clamp(0, 7).to(torch.uint8)
+    is_sub = exp_biased == 0
+    final_mant = torch.where(is_sub, sub_mant, mantissa)
+    return sign | (exp_biased.to(torch.uint8) << 3) | final_mant
+def _decode_e4m3(encoded: torch.Tensor) -> torch.Tensor:
+    """Decode E4M3 uint8 back to FP32. Vectorized."""
+    if hasattr(torch, 'float8_e4m3fn'):
+        return encoded.view(torch.float8_e4m3fn).float()
+    # Manual fallback
+    sign = ((encoded >> 7) & 1).float()
+    exp = ((encoded >> 3) & 0xF).long()
+    mant = (encoded & 0x7).long()
+    is_normal = exp > 0
+    normal_val = (8 + mant).float() * torch.pow(2.0, (exp - 10).float())
+    subnormal_val = mant.float() * (2.0 ** -9)
+    unsigned = torch.where(is_normal, normal_val, subnormal_val)
+    return torch.where(sign != 0, -unsigned, unsigned)
+# =============================================================================
+# NVFP4 Weights (NVIDIA Blackwell native format)
+# =============================================================================
+@dataclass
+class NVFP4Weights:
+    """
+    NVFP4 quantized weights — NVIDIA Blackwell native format.
+    Format: 16 E2M1 values share 1 E4M3 (FP8) scale + per-tensor FP32 scale.
+    Two-level hierarchical scaling enables native 5th-gen Tensor Core support.
+    Memory layout:
+      - packed: [K//2, N] uint8 (2 nibbles per byte, same E2M1 encoding)
+      - block_scales: [K//16, N] uint8 (E4M3 per-block scale)
+      - tensor_scale: float (FP32 per-tensor global scale)
+    Optional FP8 residual correction (double-buff):
+      - residual: [K, N] uint8 (E4M3 encoded quantization error)
+      - residual_scales: [K//16, N] float32 (per-block scales for residual)
+      When present, the fused kernel adds the decoded residual to recover
+      near-FP16 accuracy at 1.625 B/elem (vs 2.0 for FP16).
+    Total size without residual: K*N/2 + K*N/16 ~ 0.5625 * original
+    Total size with residual: ~1.625 * original (75% of FP16)
+    """
+    packed: torch.Tensor          # [K//2, N] uint8 — E2M1 nibble packing
+    block_scales: torch.Tensor    # [K//16, N] uint8 — E4M3 per-block scale
+    tensor_scale: float           # FP32 per-tensor global scale
+    shape: Tuple[int, int]        # (K, N)
+    clip_mask: Optional[torch.Tensor] = None
+    # FP8 residual correction (optional, "double-buff")
+    residual: Optional[torch.Tensor] = None        # [K, N] uint8 — E4M3 encoded
+    residual_scales: Optional[torch.Tensor] = None  # [K//16, N] float32 per-block
+    @classmethod
+    def from_float(cls, weights: torch.Tensor, use_quest: bool = True,
+                   compute_residual: bool = False) -> 'NVFP4Weights':
+        """
+        Quantize FP16/FP32 weights to NVFP4 with hierarchical scaling.
+        Two-level scaling:
+          1. Per-tensor FP32 scale (global_absmax / 448)
+          2. Per-block E4M3 scale (block_absmax / (tensor_scale * 6.0))
+        Args:
+            weights: Input tensor [K, N]
+            use_quest: Use QuEST optimal clipping (recommended)
+            compute_residual: Compute FP8 residual correction (double-buff).
+                When True, the quantization error (original - FP4 dequant) is
+                quantized to E4M3 FP8 with per-block scaling and stored alongside
+                the FP4 weights. The fused kernel adds this residual for near-FP16
+                accuracy at 1.625 B/elem.
+        Returns:
+            NVFP4Weights with packed values, E4M3 block scales, and FP32 tensor scale
+            (plus optional residual and residual_scales when compute_residual=True)
+        """
+        K, N = weights.shape
+        assert K % 16 == 0, f"K ({K}) must be multiple of 16 for NVFP4"
+        device = weights.device
+        weights_f = weights.float()
+        # Reshape to blocks of 16
+        reshaped = weights_f.view(K // 16, 16, N)
+        # Block-wise absmax
+        absmax = reshaped.abs().amax(dim=1)  # [K//16, N]
+        absmax = absmax.clamp(min=1e-10)
+        # Apply QuEST clipping
+        if use_quest:
+            clip_bound = absmax * _QUEST_CLIP_FACTOR
+        else:
+            clip_bound = absmax
+        # Level 1: per-tensor scale
+        global_absmax = clip_bound.max().clamp(min=1e-10)
+        tensor_scale = (global_absmax / 448.0).item()  # 448 = E4M3 max
+        # Level 2: per-block E4M3 scale
+        target_scale = clip_bound / (tensor_scale * 6.0)  # 6.0 = E2M1 max
+        target_scale = target_scale.clamp(min=1e-10)
+        block_scales_fp8 = _encode_e4m3(target_scale)  # [K//16, N] uint8
+        # Actual scale per block = decode(block_scales_fp8) * tensor_scale
+        actual_block_scale = _decode_e4m3(block_scales_fp8) * tensor_scale  # [K//16, N]
+        actual_block_scale = actual_block_scale.clamp(min=1e-10)
+        # Normalize and clamp
+        normalized = reshaped / actual_block_scale.unsqueeze(1)  # [K//16, 16, N]
+        normalized = normalized.clamp(-6.0, 6.0)
+        # Generate clip mask for gradient (QuEST)
+        if use_quest:
+            clip_mask = (reshaped.abs() > clip_bound.unsqueeze(1) * 6.0).view(K, N)
+        else:
+            clip_mask = None
+        # Quantize via vectorized bucketize (same as MXFP4 Step 1)
+        boundaries = _E2M1_BOUNDARIES.to(device)
+        abs_norm = normalized.abs().reshape(-1)
+        unsigned_idx = torch.bucketize(abs_norm, boundaries)
+        sign_bit = ((normalized.reshape(-1) < 0) & (unsigned_idx > 0)).to(torch.uint8) << 3
+        indices = (sign_bit | unsigned_idx.to(torch.uint8)).reshape(K, N)
+        # Pack pairs of nibbles
+        packed = (indices[0::2] | (indices[1::2] << 4))  # [K//2, N]
+        # --- FP8 residual correction (double-buff) ---
+        residual_e4m3 = None
+        residual_scales = None
+        if compute_residual:
+            # Dequant the FP4 approximation
+            fp4_approx = cls(
+                packed=packed, block_scales=block_scales_fp8,
+                tensor_scale=tensor_scale, shape=(K, N),
+            ).to_float()
+            # Residual = original - FP4 approximation
+            residual_float = weights_f - fp4_approx  # [K, N]
+            # Quantize residual to FP8 E4M3 with per-block scaling (blocks of 16)
+            res_blocks = residual_float.view(K // 16, 16, N)
+            res_absmax = res_blocks.abs().amax(dim=1).clamp(min=1e-10)  # [K//16, N]
+            res_scale = res_absmax / 448.0  # E4M3 max value
+            res_normalized = res_blocks / res_scale.unsqueeze(1)
+            res_normalized = res_normalized.clamp(-448.0, 448.0)
+            # Encode to E4M3 using native PyTorch path
+            residual_e4m3 = res_normalized.view(K, N).to(torch.float8_e4m3fn).view(torch.uint8)
+            residual_scales = res_scale  # [K//16, N] float32
+        return cls(
+            packed=packed,
+            block_scales=block_scales_fp8,
+            tensor_scale=tensor_scale,
+            shape=(K, N),
+            clip_mask=clip_mask,
+            residual=residual_e4m3,
+            residual_scales=residual_scales,
+        )
+    def to_float(self) -> torch.Tensor:
+        """Dequantize NVFP4 back to float with two-level scaling."""
+        K, N = self.shape
+        device = self.packed.device
+        e2m1_grid = _E2M1_VALUES.to(device)
+        # Unpack nibbles
+        low = (self.packed & 0xF).long()
+        high = (self.packed >> 4).long()
+        # Decode E2M1 values
+        low_vals = e2m1_grid[low.flatten()].view(K // 2, N)
+        high_vals = e2m1_grid[high.flatten()].view(K // 2, N)
+        # Interleave
+        unpacked = torch.zeros(K, N, device=device, dtype=torch.float32)
+        unpacked[0::2] = low_vals
+        unpacked[1::2] = high_vals
+        # Two-level scale: E4M3 block scale * FP32 tensor scale
+        block_sf = _decode_e4m3(self.block_scales)  # [K//16, N]
+        scale = block_sf * self.tensor_scale
+        unpacked = unpacked.view(K // 16, 16, N) * scale.unsqueeze(1)
+        return unpacked.view(K, N)
+    @property
+    def compression_ratio(self) -> float:
+        """Memory compression ratio vs FP16."""
+        K, N = self.shape
+        fp16_bytes = K * N * 2
+        nvfp4_bytes = self.packed.numel() + self.block_scales.numel()
+        if self.residual is not None:
+            nvfp4_bytes += self.residual.numel()  # [K, N] uint8
+        if self.residual_scales is not None:
+            nvfp4_bytes += self.residual_scales.numel() * 4  # float32
+        return fp16_bytes / nvfp4_bytes
+# Alias: FP4Weights now points to NVFP4 (the better format)
+FP4Weights = NVFP4Weights
+def mxfp4_gemm(
+    activations: torch.Tensor,
+    weights: MXFP4Weights,
+    bias: Optional[torch.Tensor] = None,
+    use_hadamard: bool = True,
+) -> torch.Tensor:
+    """
+    MXFP4 GEMM using the Quartet algorithm with fused dequant-matmul.
+    Implements the forward pass:
+      1. Apply Hadamard transform for outlier mitigation
+      2. Quantize activations with QuEST optimal clipping
+      3. Fused dequant-matmul (weight tile dequantized in registers, never in global memory)
+    Two-tier dispatch:
+      - If native FP4 tensor cores are available (tl.dot_scaled, future SM fix):
+        use hardware FP4 MMA
+      - Otherwise: use fused dequant-matmul Triton kernel (our implementation)
+    Reference: "Quartet: Native FP4 Training Can Be Optimal for LLMs"
+               https://arxiv.org/html/2505.14669v1
+    Args:
+        activations: Input [M, K] in BF16/FP16
+        weights: MXFP4Weights with packed E2M1 values and E8M0 scales
+        bias: Optional bias [N]
+        use_hadamard: Apply Hadamard transform (recommended)
+    Returns:
+        Output [M, N] in BF16
+    """
+    M, K = activations.shape
+    K_w, N = weights.shape
+    assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
+    assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
+    # Step 1: Hadamard transform on activations (outlier mitigation)
+    if use_hadamard and K >= 32:
+        x = activations.float().view(M, K // 32, 32)
+        x = _hadamard_transform_32(x)
+        x = x.view(M, K)
+    else:
+        x = activations.float()
+    # Step 2: Quantize activations to MXFP4 with QuEST, then dequant back
+    # (activations need to go through quantize->dequantize to simulate FP4 noise)
+    x_for_quant = x.T.contiguous()  # [K, M]
+    x_quant = MXFP4Weights.from_float(x_for_quant, use_quest=True)
+    x_dequant = x_quant.to_float().T.contiguous()  # [M, K]
+    # Step 3: Dispatch to fused kernel or native FP4
+    if not activations.is_cuda:
+        # CPU fallback: full dequant + torch.matmul
+        w_dequant = weights.to_float()
+        d = torch.matmul(x_dequant, w_dequant)
+        if bias is not None:
+            d = d + bias.float()
+        return d.to(torch.bfloat16)
+    if _can_use_native_fp4():
+        return _native_fp4_matmul(x_dequant, weights.to_native(), bias)
+    else:
+        return _fused_fp4_matmul(x_dequant, weights, bias)
+def mxfp4_gemm_legacy(
+    activations: torch.Tensor,
+    weights: MXFP4Weights,
+    bias: Optional[torch.Tensor] = None,
+    use_hadamard: bool = True,
+) -> torch.Tensor:
+    """
+    Legacy MXFP4 GEMM: full dequant to global memory + torch.matmul.
+    Kept for benchmarking comparison against the fused kernel.
+    """
+    M, K = activations.shape
+    K_w, N = weights.shape
+    assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
+    assert K % 32 == 0, f"K ({K}) must be multiple of 32 for MXFP4"
+    if use_hadamard and K >= 32:
+        x = activations.float().view(M, K // 32, 32)
+        x = _hadamard_transform_32(x)
+        x = x.view(M, K)
+    else:
+        x = activations.float()
+    x_for_quant = x.T.contiguous()
+    x_quant = MXFP4Weights.from_float(x_for_quant, use_quest=True)
+    x_dequant = x_quant.to_float().T.contiguous()
+    w_dequant = weights.to_float()
+    d = torch.matmul(x_dequant, w_dequant)
+    if bias is not None:
+        d = d + bias.float()
+    return d.to(torch.bfloat16)
+def _hadamard_transform_32(x: torch.Tensor) -> torch.Tensor:
+    """
+    Fast Hadamard Transform on last dimension (size 32).
+    Applies orthonormal Hadamard rotation to spread outliers.
+    Uses radix-2 butterfly operations.
+    """
+    assert x.shape[-1] == 32
+    def hadamard_matrix(n):
+        if n == 1:
+            return torch.ones(1, 1, device=x.device, dtype=x.dtype)
+        h = hadamard_matrix(n // 2)
+        return torch.cat([
+            torch.cat([h, h], dim=1),
+            torch.cat([h, -h], dim=1),
+        ], dim=0) / (2 ** 0.5)
+    H = hadamard_matrix(32)
+    return x @ H
+# =============================================================================
+# Arithmetic E2M1 Decoder (Triton JIT helper)
+# =============================================================================
+# Decode 4-bit E2M1 index -> float32 using pure register arithmetic.
+# No LUT needed — bitfield extraction + tl.exp2() computes the value.
+#
+# E2M1 encoding (OCP Microscaling Spec v1.0):
+#   bit[3] = sign, bit[2:1] = exponent (2 bits), bit[0] = mantissa (1 bit)
+#   Subnormal (exp==0): value = mantissa * 0.5  ->  {0.0, 0.5}
+#   Normal   (exp>0):   value = (2 + mantissa) * 2^(exp - 2)
+#   Values: 0, 0.5, 1, 1.5, 2, 3, 4, 6, -0, -0.5, -1, -1.5, -2, -3, -4, -6
+@triton.jit
+def _e2m1_decode(idx):
+    """Decode 4-bit E2M1 index -> float32. Register-only, no LUT."""
+    sign = (idx >> 3) & 1
+    exp = (idx >> 1) & 3
+    mant = idx & 1
+    is_normal = exp > 0  # bool
+    subnormal_val = mant.to(tl.float32) * 0.5
+    normal_val = (2 + mant).to(tl.float32) * tl.exp2((exp - 2).to(tl.float32))
+    unsigned_val = tl.where(is_normal, normal_val, subnormal_val)
+    return tl.where(sign != 0, -unsigned_val, unsigned_val)
+# =============================================================================
+# Fused FP4 Dequant-MatMul Triton Kernel (Tier 2)
+# =============================================================================
+# Instead of materializing the full dequantized weight matrix in global memory,
+# this kernel loads packed FP4 tiles, dequantizes in registers via arithmetic
+# E2M1 decode, applies E8M0 block scales, and feeds BF16 into tl.dot().
+# The full dequantized matrix NEVER exists in global memory.
+# ~16x less memory traffic on the weight side vs the old full-dequant path.
+@triton.autotune(
+    configs=[
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_stages=4, num_warps=8),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 128, 'BLOCK_K': 64}, num_stages=4, num_warps=4),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64}, num_stages=5, num_warps=4),
+    ],
+    key=['M', 'N', 'K'],
+)
+@triton.jit
+def _fused_fp4_dequant_matmul_kernel(
+    a_ptr,           # [M, K] BF16 activations
+    w_packed_ptr,    # [K//2, N] uint8 packed FP4 weights
+    w_scales_ptr,    # [K//32, N] uint8 E8M0 scales
+    out_ptr,         # [M, N] BF16 output
+    bias_ptr,        # [N] optional bias
+    M, N, K,
+    stride_am, stride_ak,
+    stride_wk, stride_wn,     # strides for packed [K//2, N]
+    stride_sk, stride_sn,     # strides for scales [K//32, N]
+    stride_om, stride_on,
+    HAS_BIAS: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+):
+    """
+    Fused dequant-matmul: loads packed FP4, dequantizes in registers, matmuls.
+    Inner loop per K-tile:
+      1. Load A tile [BLOCK_M, BLOCK_K] BF16 via pointer arithmetic
+      2. Load packed weight tile [BLOCK_K//2, BLOCK_N] uint8
+      3. Unpack nibbles: low = packed & 0xF, high = packed >> 4
+      4. Arithmetic E2M1 decode via _e2m1_decode() — pure register ops, no LUT
+      5. Load scale tile [BLOCK_K//32, BLOCK_N] uint8, compute 2^(s-127)
+      6. Apply per-group scale, interleave even/odd → [BLOCK_K, BLOCK_N] BF16
+      7. acc += tl.dot(a_tile, w_tile)
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
+    SCALES_PER_TILE: tl.constexpr = BLOCK_K // 32
+    for k_start in range(0, K, BLOCK_K):
+        # --- Load A as even/odd column halves ---
+        # Even columns (0, 2, 4, ...) correspond to low nibbles
+        # Odd columns (1, 3, 5, ...) correspond to high nibbles
+        # This avoids building a full [BLOCK_K, BLOCK_N] interleaved tile.
+        even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2      # 0,2,4,...
+        odd_k_offs  = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1  # 1,3,5,...
+        a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
+        a_odd_ptrs  = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :]  * stride_ak
+        mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
+        mask_a_odd  = (offs_m[:, None] < M) & (odd_k_offs[None, :]  < K)
+        a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)  # [BLOCK_M, HALF_BLOCK_K]
+        a_odd  = tl.load(a_odd_ptrs,  mask=mask_a_odd,  other=0.0)  # [BLOCK_M, HALF_BLOCK_K]
+        # --- Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8 ---
+        packed_row_start = k_start // 2
+        offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
+        w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
+        mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
+        packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
+        # --- Unpack nibbles + arithmetic E2M1 decode ---
+        low_f  = _e2m1_decode(packed & 0xF)          # [HALF_BLOCK_K, BLOCK_N] even rows
+        high_f = _e2m1_decode((packed >> 4) & 0xF)    # [HALF_BLOCK_K, BLOCK_N] odd rows
+        # --- Load E8M0 scales and broadcast per 32-element group ---
+        # Each scale covers 32 original K rows = 16 packed rows.
+        scale_row_start = k_start // 32
+        offs_local_packed = tl.arange(0, HALF_BLOCK_K)
+        group_idx = offs_local_packed // 16  # which scale group each packed row belongs to
+        scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
+        for sg in tl.static_range(0, SCALES_PER_TILE):
+            sg_row = scale_row_start + sg
+            sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
+            sg_load_mask = (sg_row < (K // 32)) & (offs_n < N)
+            sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=127).to(tl.float32)
+            sg_val = tl.exp2(sg_raw - 127.0)  # [BLOCK_N]
+            sg_match = (group_idx == sg)  # [HALF_BLOCK_K] bool
+            scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
+        # Apply scales
+        w_even = (low_f  * scale_broadcast).to(tl.bfloat16)  # [HALF_BLOCK_K, BLOCK_N]
+        w_odd  = (high_f * scale_broadcast).to(tl.bfloat16)  # [HALF_BLOCK_K, BLOCK_N]
+        # --- Two half-sized dot products instead of interleaved full tile ---
+        # A @ W = A_even_cols @ W_even_rows + A_odd_cols @ W_odd_rows
+        acc += tl.dot(a_even.to(tl.bfloat16), w_even)
+        acc += tl.dot(a_odd.to(tl.bfloat16),  w_odd)
+    # --- Bias ---
+    if HAS_BIAS:
+        bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
+        acc += bias_vals[None, :]
+    # --- Store ---
+    out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
+    mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+    tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
+# =============================================================================
+# Native FP4 dot_scaled Kernel (Tier 1, SM100+)
+# =============================================================================
+# Uses tl.dot_scaled for hardware FP4 tensor core support (tcgen05.mma.mxf4).
+# Follows Triton tutorial #10 pattern with TMA loads.
+# Currently falls back to BF16 MMA on SM120 (RTX 5090) due to Triton #7550.
+# This path activates only when runtime probe confirms real FP4 execution.
+#
+# Config: BLOCK_M=128, BLOCK_N=256, BLOCK_K=128, VEC_SIZE=32, stages=4
+@triton.jit
+def _native_fp4_matmul_kernel(
+    a_ptr,           # [M, K] BF16 activations
+    b_packed_ptr,    # [N, K//2] uint8 packed FP4 (transposed)
+    b_scales_ptr,    # [N//128, K//32//4, 32, 4, 4] uint8 preshuffled E8M0
+    out_ptr,         # [M, N] BF16 output
+    bias_ptr,        # [N] optional
+    M, N, K,
+    stride_am, stride_ak,
+    stride_bn, stride_bk,    # strides for packed_t [N, K//2]
+    stride_om, stride_on,
+    HAS_BIAS: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+):
+    """
+    Native FP4 matmul using tl.dot_scaled (SM100+ hardware path).
+    Uses transposed packed weights and preshuffled 5D scale tensor
+    matching MXScaleTensor layout for direct tensor core consumption.
+    When tl.dot_scaled maps to real tcgen05.mma.mxf4 instructions,
+    this achieves native FP4 throughput.
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
+    SCALES_PER_TILE: tl.constexpr = BLOCK_K // 32
+    for k_start in range(0, K, BLOCK_K):
+        # --- Load A as even/odd column halves ---
+        even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
+        odd_k_offs  = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
+        a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
+        a_odd_ptrs  = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :]  * stride_ak
+        mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
+        mask_a_odd  = (offs_m[:, None] < M) & (odd_k_offs[None, :]  < K)
+        a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
+        a_odd  = tl.load(a_odd_ptrs,  mask=mask_a_odd,  other=0.0)
+        # --- Load packed B tile [BLOCK_N, HALF_BLOCK_K] from transposed layout ---
+        packed_col_start = k_start // 2
+        offs_pk = packed_col_start + tl.arange(0, HALF_BLOCK_K)
+        b_ptrs = b_packed_ptr + offs_n[:, None] * stride_bn + offs_pk[None, :] * stride_bk
+        mask_b = (offs_n[:, None] < N) & (offs_pk[None, :] < (K // 2))
+        b_packed_tile = tl.load(b_ptrs, mask=mask_b, other=0).to(tl.int32)
+        # Unpack + decode
+        low_f  = _e2m1_decode(b_packed_tile & 0xF)
+        high_f = _e2m1_decode((b_packed_tile >> 4) & 0xF)
+        # --- Load scales from 5D layout, broadcast per group ---
+        scale_row_start = k_start // 32
+        offs_local_pk = tl.arange(0, HALF_BLOCK_K)
+        group_idx = offs_local_pk // 16
+        scale_broadcast = tl.zeros((BLOCK_N, HALF_BLOCK_K), dtype=tl.float32)
+        for sg in tl.static_range(0, SCALES_PER_TILE):
+            k_idx = scale_row_start + sg
+            nb = offs_n // 128
+            inner_n = offs_n % 128
+            d4 = inner_n % 4
+            d2 = (inner_n // 4) % 32
+            kg = k_idx // 4
+            d3 = k_idx % 4
+            kg_total = (K // 32 + 3) // 4
+            s_offset = (nb * kg_total * 32 * 4 * 4 +
+                       kg * 32 * 4 * 4 +
+                       d2 * 4 * 4 +
+                       d3 * 4 +
+                       d4)
+            s_val_raw = tl.load(b_scales_ptr + s_offset, mask=offs_n < N, other=127).to(tl.float32)
+            s_val = tl.exp2(s_val_raw - 127.0)  # [BLOCK_N]
+            sg_match = (group_idx == sg)
+            scale_broadcast = tl.where(sg_match[None, :], s_val[:, None], scale_broadcast)
+        # Apply scales: [BLOCK_N, HALF_BLOCK_K]
+        w_low  = (low_f  * scale_broadcast).to(tl.bfloat16)
+        w_high = (high_f * scale_broadcast).to(tl.bfloat16)
+        # Transpose weight halves: [BLOCK_N, HALF_BLOCK_K] -> [HALF_BLOCK_K, BLOCK_N]
+        w_low_t  = tl.trans(w_low)
+        w_high_t = tl.trans(w_high)
+        # Two half-sized dot products
+        acc += tl.dot(a_even.to(tl.bfloat16), w_low_t)
+        acc += tl.dot(a_odd.to(tl.bfloat16),  w_high_t)
+    if HAS_BIAS:
+        bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
+        acc += bias_vals[None, :]
+    out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
+    mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+    tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
+# =============================================================================
+# E4M3 Decode (Triton JIT helper for NVFP4 kernel)
+# =============================================================================
+@triton.jit
+def _decode_e4m3_triton(raw_uint8):
+    """Decode E4M3 FP8 in Triton registers. No LUT, pure bitfield arithmetic."""
+    sign = (raw_uint8 >> 7) & 1
+    exp = (raw_uint8 >> 3) & 0xF
+    mant = raw_uint8 & 0x7
+    is_normal = exp > 0
+    normal_val = (8 + mant).to(tl.float32) * tl.exp2((exp - 10).to(tl.float32))
+    subnormal_val = mant.to(tl.float32) * tl.exp2(tl.full(mant.shape, -9.0, tl.float32))
+    unsigned = tl.where(is_normal, normal_val, subnormal_val)
+    return tl.where(sign != 0, -unsigned, unsigned)
+# =============================================================================
+# Fused NVFP4 Dequant-MatMul Triton Kernel
+# =============================================================================
+# NVFP4 variant: 16-element blocks with E4M3 scales + per-tensor FP32 scale.
+# Two-level hierarchical scaling for native Blackwell tensor core format.
+# Scale groups every 16 elements (8 packed rows) instead of 32.
+@triton.autotune(
+    configs=[
+        # --- Blackwell 5090 prefill configs (high throughput, 170 SMs) ---
+        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 256}, num_stages=7, num_warps=8),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=4, num_warps=8),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        # --- Decode-optimized (small M, maximize N-parallelism across SMs) ---
+        triton.Config({'BLOCK_M': 16,  'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 16,  'BLOCK_N': 256, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        triton.Config({'BLOCK_M': 16,  'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        triton.Config({'BLOCK_M': 32,  'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 32,  'BLOCK_N': 256, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+    ],
+    key=['M', 'N', 'K'],
+)
+@triton.jit
+def _fused_nvfp4_dequant_matmul_kernel(
+    a_ptr,           # [M, K] BF16 activations
+    w_packed_ptr,    # [K//2, N] uint8 packed FP4 weights
+    w_scales_ptr,    # [K//16, N] uint8 E4M3 scales
+    out_ptr,         # [M, N] BF16 output
+    bias_ptr,        # [N] optional bias
+    tensor_scale,    # FP32 per-tensor global scale
+    M, N, K,
+    stride_am, stride_ak,
+    stride_wk, stride_wn,     # strides for packed [K//2, N]
+    stride_sk, stride_sn,     # strides for scales [K//16, N]
+    stride_om, stride_on,
+    HAS_BIAS: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+):
+    """
+    Fused NVFP4 dequant-matmul: 16-element blocks, E4M3 scales, tensor scale.
+    Inner loop per K-tile:
+      1. Load A tile as even/odd column halves
+      2. Load packed weight tile, unpack nibbles
+      3. Arithmetic E2M1 decode via _e2m1_decode()
+      4. Load E4M3 scale tile [BLOCK_K//16, BLOCK_N], decode via _decode_e4m3_triton()
+      5. Apply two-level scale: decoded_e4m3 * tensor_scale
+      6. acc += tl.dot(a_half, w_half) for even and odd halves
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
+    SCALES_PER_TILE: tl.constexpr = BLOCK_K // 16  # 16-element blocks (not 32)
+    for k_start in range(0, K, BLOCK_K):
+        # --- Load A as even/odd column halves ---
+        even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
+        odd_k_offs  = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
+        a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
+        a_odd_ptrs  = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :]  * stride_ak
+        mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
+        mask_a_odd  = (offs_m[:, None] < M) & (odd_k_offs[None, :]  < K)
+        a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
+        a_odd  = tl.load(a_odd_ptrs,  mask=mask_a_odd,  other=0.0)
+        # --- Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8 ---
+        packed_row_start = k_start // 2
+        offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
+        w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
+        mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
+        packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
+        # --- Unpack nibbles + arithmetic E2M1 decode ---
+        low_f  = _e2m1_decode(packed & 0xF)
+        high_f = _e2m1_decode((packed >> 4) & 0xF)
+        # --- Load E4M3 scales and broadcast per 16-element group ---
+        # Each scale covers 16 original K rows = 8 packed rows.
+        scale_row_start = k_start // 16
+        offs_local_packed = tl.arange(0, HALF_BLOCK_K)
+        group_idx = offs_local_packed // 8  # 8 packed rows per scale group
+        scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
+        for sg in tl.static_range(0, SCALES_PER_TILE):
+            sg_row = scale_row_start + sg
+            sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
+            sg_load_mask = (sg_row < (K // 16)) & (offs_n < N)
+            sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=0).to(tl.int32)
+            # Decode E4M3 and apply tensor_scale
+            sg_val = _decode_e4m3_triton(sg_raw) * tensor_scale  # [BLOCK_N]
+            sg_match = (group_idx == sg)
+            scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
+        # Apply scales
+        w_even = (low_f  * scale_broadcast).to(tl.bfloat16)
+        w_odd  = (high_f * scale_broadcast).to(tl.bfloat16)
+        # Two half-sized dot products
+        acc += tl.dot(a_even.to(tl.bfloat16), w_even)
+        acc += tl.dot(a_odd.to(tl.bfloat16),  w_odd)
+    # --- Bias ---
+    if HAS_BIAS:
+        bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
+        acc += bias_vals[None, :]
+    # --- Store ---
+    out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
+    mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+    tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
+# =============================================================================
+# Fused NVFP4 + FP8 Residual Dequant-MatMul Triton Kernel ("Double Buff")
+# =============================================================================
+# Same structure as _fused_nvfp4_dequant_matmul_kernel, but each K-tile also
+# loads the FP8 E4M3 residual and its per-block scales, decodes, and adds
+# a third tl.dot for the residual correction. Three dots per tile:
+#   acc += dot(a_even, w_fp4_even) + dot(a_odd, w_fp4_odd) + dot(a_full, w_residual)
+@triton.autotune(
+    configs=[
+        # --- Blackwell 5090 prefill configs ---
+        triton.Config({'BLOCK_M': 256, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=16),
+        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=4, num_warps=8),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        # --- Decode-optimized ---
+        triton.Config({'BLOCK_M': 16,  'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 16,  'BLOCK_N': 128, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+        triton.Config({'BLOCK_M': 32,  'BLOCK_N': 256, 'BLOCK_K': 128}, num_stages=5, num_warps=8),
+        triton.Config({'BLOCK_M': 32,  'BLOCK_N': 256, 'BLOCK_K': 64},  num_stages=5, num_warps=4),
+    ],
+    key=['M', 'N', 'K'],
+)
+@triton.jit
+def _fused_nvfp4_residual_matmul_kernel(
+    a_ptr,           # [M, K] BF16 activations
+    w_packed_ptr,    # [K//2, N] uint8 packed FP4 weights
+    w_scales_ptr,    # [K//16, N] uint8 E4M3 scales
+    res_ptr,         # [K, N] uint8 E4M3 residual
+    res_scales_ptr,  # [K//16, N] float32 residual scales
+    out_ptr,         # [M, N] BF16 output
+    bias_ptr,        # [N] optional bias
+    tensor_scale,    # FP32 per-tensor global scale
+    M, N, K,
+    stride_am, stride_ak,
+    stride_wk, stride_wn,     # strides for packed [K//2, N]
+    stride_sk, stride_sn,     # strides for scales [K//16, N]
+    stride_rk, stride_rn,     # strides for residual [K, N]
+    stride_rsk, stride_rsn,   # strides for residual_scales [K//16, N]
+    stride_om, stride_on,
+    HAS_BIAS: tl.constexpr,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+):
+    """
+    Fused NVFP4 + FP8 residual dequant-matmul (double-buff).
+    Per K-tile:
+      1. FP4 path: unpack nibbles, decode E2M1, apply two-level scale (same as base kernel)
+      2. FP8 residual path: load E4M3 residual, decode, apply per-block residual_scales
+      3. Three dots: a_even * w_fp4_even + a_odd * w_fp4_odd + a_full * w_residual
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
+    offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    HALF_BLOCK_K: tl.constexpr = BLOCK_K // 2
+    SCALES_PER_TILE: tl.constexpr = BLOCK_K // 16
+    for k_start in range(0, K, BLOCK_K):
+        # ===== FP4 path (identical to base kernel) =====
+        # Load A as even/odd column halves
+        even_k_offs = k_start + tl.arange(0, HALF_BLOCK_K) * 2
+        odd_k_offs  = k_start + tl.arange(0, HALF_BLOCK_K) * 2 + 1
+        a_even_ptrs = a_ptr + offs_m[:, None] * stride_am + even_k_offs[None, :] * stride_ak
+        a_odd_ptrs  = a_ptr + offs_m[:, None] * stride_am + odd_k_offs[None, :]  * stride_ak
+        mask_a_even = (offs_m[:, None] < M) & (even_k_offs[None, :] < K)
+        mask_a_odd  = (offs_m[:, None] < M) & (odd_k_offs[None, :]  < K)
+        a_even = tl.load(a_even_ptrs, mask=mask_a_even, other=0.0)
+        a_odd  = tl.load(a_odd_ptrs,  mask=mask_a_odd,  other=0.0)
+        # Load packed weight tile [HALF_BLOCK_K, BLOCK_N] uint8
+        packed_row_start = k_start // 2
+        offs_packed_k = packed_row_start + tl.arange(0, HALF_BLOCK_K)
+        w_ptrs = w_packed_ptr + offs_packed_k[:, None] * stride_wk + offs_n[None, :] * stride_wn
+        mask_w = (offs_packed_k[:, None] < (K // 2)) & (offs_n[None, :] < N)
+        packed = tl.load(w_ptrs, mask=mask_w, other=0).to(tl.int32)
+        # Unpack nibbles + arithmetic E2M1 decode
+        low_f  = _e2m1_decode(packed & 0xF)
+        high_f = _e2m1_decode((packed >> 4) & 0xF)
+        # Load E4M3 scales and broadcast per 16-element group
+        scale_row_start = k_start // 16
+        offs_local_packed = tl.arange(0, HALF_BLOCK_K)
+        group_idx = offs_local_packed // 8
+        scale_broadcast = tl.zeros((HALF_BLOCK_K, BLOCK_N), dtype=tl.float32)
+        for sg in tl.static_range(0, SCALES_PER_TILE):
+            sg_row = scale_row_start + sg
+            sg_ptrs = w_scales_ptr + sg_row * stride_sk + offs_n * stride_sn
+            sg_load_mask = (sg_row < (K // 16)) & (offs_n < N)
+            sg_raw = tl.load(sg_ptrs, mask=sg_load_mask, other=0).to(tl.int32)
+            sg_val = _decode_e4m3_triton(sg_raw) * tensor_scale
+            sg_match = (group_idx == sg)
+            scale_broadcast = tl.where(sg_match[:, None], sg_val[None, :], scale_broadcast)
+        # Apply FP4 scales and accumulate
+        w_even = (low_f  * scale_broadcast).to(tl.bfloat16)
+        w_odd  = (high_f * scale_broadcast).to(tl.bfloat16)
+        acc += tl.dot(a_even.to(tl.bfloat16), w_even)
+        acc += tl.dot(a_odd.to(tl.bfloat16),  w_odd)
+        # ===== FP8 residual correction path =====
+        # Load full contiguous activation tile [BLOCK_M, BLOCK_K]
+        full_k_offs = k_start + tl.arange(0, BLOCK_K)
+        a_full_ptrs = a_ptr + offs_m[:, None] * stride_am + full_k_offs[None, :] * stride_ak
+        mask_a_full = (offs_m[:, None] < M) & (full_k_offs[None, :] < K)
+        a_full = tl.load(a_full_ptrs, mask=mask_a_full, other=0.0)
+        # Load residual [BLOCK_K, BLOCK_N] uint8 E4M3
+        res_k_offs = k_start + tl.arange(0, BLOCK_K)
+        res_ptrs = res_ptr + res_k_offs[:, None] * stride_rk + offs_n[None, :] * stride_rn
+        mask_res = (res_k_offs[:, None] < K) & (offs_n[None, :] < N)
+        res_raw = tl.load(res_ptrs, mask=mask_res, other=0).to(tl.int32)
+        res_decoded = _decode_e4m3_triton(res_raw)  # [BLOCK_K, BLOCK_N] float32
+        # Load residual per-block scales [SCALES_PER_TILE, BLOCK_N] float32
+        # and broadcast to [BLOCK_K, BLOCK_N]
+        offs_full_k = tl.arange(0, BLOCK_K)
+        res_group_idx = offs_full_k // 16  # 16 elements per scale group
+        res_scale_broadcast = tl.zeros((BLOCK_K, BLOCK_N), dtype=tl.float32)
+        for rsg in tl.static_range(0, SCALES_PER_TILE):
+            rsg_row = scale_row_start + rsg
+            rsg_ptrs = res_scales_ptr + rsg_row * stride_rsk + offs_n * stride_rsn
+            rsg_load_mask = (rsg_row < (K // 16)) & (offs_n < N)
+            rsg_val = tl.load(rsg_ptrs, mask=rsg_load_mask, other=0.0)  # [BLOCK_N] float32
+            rsg_match = (res_group_idx == rsg)
+            res_scale_broadcast = tl.where(rsg_match[:, None], rsg_val[None, :], res_scale_broadcast)
+        # Apply residual scales and accumulate
+        res_scaled = (res_decoded * res_scale_broadcast).to(tl.bfloat16)
+        acc += tl.dot(a_full.to(tl.bfloat16), res_scaled)
+    # --- Bias ---
+    if HAS_BIAS:
+        bias_vals = tl.load(bias_ptr + offs_n, mask=offs_n < N, other=0.0).to(tl.float32)
+        acc += bias_vals[None, :]
+    # --- Store ---
+    out_ptrs = out_ptr + offs_m[:, None] * stride_om + offs_n[None, :] * stride_on
+    mask_out = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+    tl.store(out_ptrs, acc.to(tl.bfloat16), mask=mask_out)
+# =============================================================================
+# Native FP4 capability probe (cached)
+# =============================================================================
+_native_fp4_probe_result: Optional[bool] = None
+def _can_use_native_fp4() -> bool:
+    """
+    One-time probe to determine if tl.dot_scaled produces real FP4 results.
+    Checks:
+      1. CUDA available with SM >= 10.0 (Blackwell+)
+      2. tl.dot_scaled API exists in current Triton
+      3. Small test matmul via our native kernel produces results that
+         differ from what pure BF16 dequant+matmul would give.
+         If they match exactly, Triton is falling back to BF16 MMA
+         (Triton #7550) and the native path offers no benefit.
+    Result is cached in module global _native_fp4_probe_result.
+    """
+    global _native_fp4_probe_result
+    if _native_fp4_probe_result is not None:
+        return _native_fp4_probe_result
+    _native_fp4_probe_result = False
+    if not torch.cuda.is_available():
+        return False
+    # SM >= 10.0 required (Blackwell architecture)
+    major, _ = torch.cuda.get_device_capability()
+    if major < 10:
+        return False
+    # Check Triton API availability
+    if not hasattr(tl, 'dot_scaled'):
+        return False
+    # Runtime correctness probe: run a small matmul and compare
+    # native kernel output vs BF16 reference
+    try:
+        test_m, test_n, test_k = 32, 32, 64
+        a_test = torch.randn(test_m, test_k, device='cuda', dtype=torch.bfloat16)
+        w_test = torch.randn(test_k, test_n, device='cuda', dtype=torch.float32)
+        w_quant = MXFP4Weights.from_float(w_test, use_quest=False)
+        w_deq = w_quant.to_float()
+        # BF16 reference (what fallback would give)
+        ref_bf16 = torch.matmul(a_test.float(), w_deq).bfloat16()
+        # Run our native kernel path
+        native_w = w_quant.to_native()
+        native_out = _native_fp4_matmul(a_test, native_w, bias=None)
+        # If native output matches BF16 reference EXACTLY (all elements equal),
+        # Triton is silently falling back to BF16 MMA — no benefit.
+        # Real FP4 tensor cores produce different rounding patterns.
+        if torch.equal(native_out, ref_bf16):
+            _native_fp4_probe_result = False
+        else:
+            # Verify native output is reasonable (not garbage)
+            rel_err = (native_out.float() - ref_bf16.float()).abs().mean() / ref_bf16.float().abs().mean()
+            _native_fp4_probe_result = rel_err.item() < 0.1
+    except Exception:
+        _native_fp4_probe_result = False
+    return _native_fp4_probe_result
+# =============================================================================
+# Fused FP4 matmul wrapper (internal)
+# =============================================================================
+def _fused_fp4_matmul(
+    activations: torch.Tensor,
+    weights: MXFP4Weights,
+    bias: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    Fused dequant-matmul via Triton kernel.
+    The full dequantized weight matrix never exists in global memory —
+    each tile is unpacked from uint8, looked up in the E2M1 table,
+    scaled by E8M0, and fed directly into tl.dot().
+    """
+    M, K = activations.shape
+    _, N = weights.shape
+    # Ensure inputs are contiguous and on CUDA
+    a = activations.contiguous()
+    if a.dtype != torch.bfloat16:
+        a = a.to(torch.bfloat16)
+    w_packed = weights.packed.contiguous()
+    w_scales = weights.scales.contiguous()
+    out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
+    # Bias setup
+    has_bias = bias is not None
+    if has_bias:
+        bias = bias.contiguous().float()
+    else:
+        bias = torch.empty(0, device=a.device, dtype=torch.float32)
+    grid = lambda META: (
+        triton.cdiv(M, META['BLOCK_M']),
+        triton.cdiv(N, META['BLOCK_N']),
+    )
+    _fused_fp4_dequant_matmul_kernel[grid](
+        a, w_packed, w_scales, out, bias,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        w_packed.stride(0), w_packed.stride(1),
+        w_scales.stride(0), w_scales.stride(1),
+        out.stride(0), out.stride(1),
+        HAS_BIAS=has_bias,
+    )
+    return out
+# =============================================================================
+# Native FP4 matmul wrapper (internal, future path)
+# =============================================================================
+def _native_fp4_matmul(
+    activations: torch.Tensor,
+    weights: 'NativeMXFP4',
+    bias: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    Native FP4 matmul using the Tier 1 kernel with transposed/preshuffled layout.
+    Args:
+        activations: [M, K] BF16 tensor
+        weights: NativeMXFP4 with packed_t and scales_5d
+        bias: Optional [N] bias
+    """
+    M, K = activations.shape
+    K_w, N = weights.shape
+    a = activations.contiguous()
+    if a.dtype != torch.bfloat16:
+        a = a.to(torch.bfloat16)
+    packed_t = weights.packed_t.contiguous()
+    scales_5d = weights.scales_5d.contiguous()
+    out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
+    has_bias = bias is not None
+    if has_bias:
+        bias = bias.contiguous().float()
+    else:
+        bias = torch.empty(0, device=a.device, dtype=torch.float32)
+    # Use fixed config matching plan spec
+    BLOCK_M = 128
+    BLOCK_N = 128
+    BLOCK_K = 128
+    grid = (triton.cdiv(M, BLOCK_M), triton.cdiv(N, BLOCK_N))
+    _native_fp4_matmul_kernel[grid](
+        a, packed_t, scales_5d, out, bias,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        packed_t.stride(0), packed_t.stride(1),
+        out.stride(0), out.stride(1),
+        HAS_BIAS=has_bias,
+        BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N, BLOCK_K=BLOCK_K,
+    )
+    return out
+def quantize_to_mxfp4(weights: torch.Tensor, use_quest: bool = True) -> MXFP4Weights:
+    """
+    Quantize weights to MXFP4 format.
+    Args:
+        weights: Input tensor [K, N]
+        use_quest: Use QuEST optimal clipping
+    Returns:
+        MXFP4Weights ready for native GEMM
+    """
+    return MXFP4Weights.from_float(weights, use_quest=use_quest)
+# =============================================================================
+# Fused NVFP4 matmul wrapper (internal)
+# =============================================================================
+def _fused_nvfp4_matmul(
+    activations: torch.Tensor,
+    weights: 'NVFP4Weights',
+    bias: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    Fused NVFP4 dequant-matmul via Triton kernel.
+    Uses 16-element blocks with E4M3 scales and per-tensor FP32 scale.
+    The full dequantized weight matrix never exists in global memory.
+    Automatically dispatches to the double-buff (FP4+FP8 residual) kernel
+    when weights have residual data, for near-FP16 accuracy.
+    """
+    # Dispatch to residual kernel when weights have FP8 correction data
+    if weights.residual is not None and weights.residual_scales is not None:
+        return _fused_nvfp4_residual_matmul(activations, weights, bias)
+    M, K = activations.shape
+    _, N = weights.shape
+    a = activations.contiguous()
+    if a.dtype != torch.bfloat16:
+        a = a.to(torch.bfloat16)
+    w_packed = weights.packed.contiguous()
+    w_scales = weights.block_scales.contiguous()
+    out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
+    has_bias = bias is not None
+    if has_bias:
+        bias = bias.contiguous().float()
+    else:
+        bias = torch.empty(0, device=a.device, dtype=torch.float32)
+    grid = lambda META: (
+        triton.cdiv(M, META['BLOCK_M']),
+        triton.cdiv(N, META['BLOCK_N']),
+    )
+    _fused_nvfp4_dequant_matmul_kernel[grid](
+        a, w_packed, w_scales, out, bias,
+        weights.tensor_scale,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        w_packed.stride(0), w_packed.stride(1),
+        w_scales.stride(0), w_scales.stride(1),
+        out.stride(0), out.stride(1),
+        HAS_BIAS=has_bias,
+    )
+    return out
+def _fused_nvfp4_residual_matmul(
+    activations: torch.Tensor,
+    weights: 'NVFP4Weights',
+    bias: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    Fused NVFP4 + FP8 residual dequant-matmul (double-buff).
+    Same as _fused_nvfp4_matmul but passes FP8 residual and per-block
+    residual_scales to the residual kernel for near-FP16 accuracy.
+    Requires weights.residual and weights.residual_scales to be set.
+    """
+    M, K = activations.shape
+    _, N = weights.shape
+    a = activations.contiguous()
+    if a.dtype != torch.bfloat16:
+        a = a.to(torch.bfloat16)
+    w_packed = weights.packed.contiguous()
+    w_scales = weights.block_scales.contiguous()
+    res = weights.residual.contiguous()
+    res_scales = weights.residual_scales.contiguous()
+    out = torch.empty(M, N, device=a.device, dtype=torch.bfloat16)
+    has_bias = bias is not None
+    if has_bias:
+        bias = bias.contiguous().float()
+    else:
+        bias = torch.empty(0, device=a.device, dtype=torch.float32)
+    grid = lambda META: (
+        triton.cdiv(M, META['BLOCK_M']),
+        triton.cdiv(N, META['BLOCK_N']),
+    )
+    _fused_nvfp4_residual_matmul_kernel[grid](
+        a, w_packed, w_scales, res, res_scales, out, bias,
+        weights.tensor_scale,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        w_packed.stride(0), w_packed.stride(1),
+        w_scales.stride(0), w_scales.stride(1),
+        res.stride(0), res.stride(1),
+        res_scales.stride(0), res_scales.stride(1),
+        out.stride(0), out.stride(1),
+        HAS_BIAS=has_bias,
+    )
+    return out
+# =============================================================================
+# PyTorch _scaled_mm FP4 Probe (Native Tensor Core Path)
+# =============================================================================
+_scaled_mm_fp4_probe_result: Optional[bool] = None
+def _can_use_scaled_mm_fp4() -> bool:
+    """
+    Probe for PyTorch native FP4 scaled matmul (cuBLAS NVFP4 path).
+    Uses 1x16 blockwise scaling: FP4 packed as uint8.view(float4_e2m1fn_x2),
+    E4M3 flat scale tensors with ceil(rows/128)*128 * max(K/16, 4) elements.
+    DISABLED: cuBLAS 1x16 blockwise FP4 has correctness issues with non-128-aligned
+    dimensions (cos_sim drops to 0.30-0.50 for M=1 decode). The Triton fused
+    dequant kernel achieves cos_sim=0.999+ for all shapes. Re-enable when PyTorch
+    exposes a proper NVFP4 GEMM API with 2D scale tensors + SwizzleType support.
+    """
+    return False
+def _scaled_mm_fp4(
+    activations: torch.Tensor,
+    weights: 'NVFP4Weights',
+    bias: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    """
+    Native cuBLAS NVFP4 matmul via torch._scaled_mm.
+    Activations: BF16 [M, K] — quantized to FP4 on-the-fly.
+    Weights: NVFP4Weights with packed [K//2, N] uint8, block_scales [K//16, N] E4M3.
+    Uses 1x16 blockwise scaling on Blackwell 5th-gen tensor cores.
+    Scale layout: flat 1D, ceil(rows/128)*128 * max(K//16, 4) elements.
+    Output is multiplied by both tensor_scales (activation + weight).
+    """
+    import math
+    M, K = activations.shape
+    K_w, N = weights.shape
+    # --- Quantize activations to FP4 ---
+    act_q = NVFP4Weights.from_float(activations.T.contiguous().float(), use_quest=True)
+    a_packed = act_q.packed.T.contiguous()  # [M, K//2]
+    a_fp4 = a_packed.view(torch.float4_e2m1fn_x2)
+    b_packed = weights.packed.T.contiguous()  # [N, K//2]
+    b_fp4 = b_packed.view(torch.float4_e2m1fn_x2)
+    # --- Build flat scale tensors (1x16 blockwise, padded) ---
+    # cuBLAS requires minimum 4 scale groups per row along K
+    k_groups = max(K // 16, 4)
+    # scale_a: [M, K//16] -> pad rows to 128, pad K groups to min 4
+    sa_2d = act_q.block_scales.T.contiguous().view(torch.float8_e4m3fn)  # [M, K//16]
+    sa_padded_rows = math.ceil(M / 128) * 128
+    # Pad K dimension if needed (fill with 1.0 = 0x3C in E4M3)
+    if k_groups > K // 16:
+        k_pad = torch.full((sa_2d.shape[0], k_groups - K // 16), 0x3C,
+                           dtype=torch.uint8, device=sa_2d.device).view(torch.float8_e4m3fn)
+        sa_2d = torch.cat([sa_2d, k_pad], dim=1)
+    if sa_padded_rows > M:
+        row_pad = torch.full((sa_padded_rows - M, k_groups), 0x3C,
+                             dtype=torch.uint8, device=sa_2d.device).view(torch.float8_e4m3fn)
+        sa_2d = torch.cat([sa_2d, row_pad], dim=0)
+    sa_flat = sa_2d.contiguous().view(-1)
+    # scale_b: [N, K//16] -> same padding
+    sb_2d = weights.block_scales.T.contiguous().view(torch.float8_e4m3fn)  # [N, K//16]
+    sb_padded_rows = math.ceil(N / 128) * 128
+    if k_groups > K // 16:
+        k_pad = torch.full((sb_2d.shape[0], k_groups - K // 16), 0x3C,
+                           dtype=torch.uint8, device=sb_2d.device).view(torch.float8_e4m3fn)
+        sb_2d = torch.cat([sb_2d, k_pad], dim=1)
+    if sb_padded_rows > N:
+        row_pad = torch.full((sb_padded_rows - N, k_groups), 0x3C,
+                             dtype=torch.uint8, device=sb_2d.device).view(torch.float8_e4m3fn)
+        sb_2d = torch.cat([sb_2d, row_pad], dim=0)
+    sb_flat = sb_2d.contiguous().view(-1)
+    # --- cuBLAS native FP4 matmul ---
+    out = torch._scaled_mm(a_fp4, b_fp4.T, scale_a=sa_flat, scale_b=sb_flat,
+                            out_dtype=torch.bfloat16)
+    # Apply per-tensor scales (cuBLAS only handles block scales)
+    ts = act_q.tensor_scale * weights.tensor_scale
+    out = out.float() * ts
+    if bias is not None:
+        out = out + bias.float()
+    return out.to(torch.bfloat16)
+# =============================================================================
+# NVFP4 GEMM (public API)
+# =============================================================================
+def nvfp4_gemm(
+    activations: torch.Tensor,
+    weights: 'NVFP4Weights',
+    bias: Optional[torch.Tensor] = None,
+    use_hadamard: bool = True,
+) -> torch.Tensor:
+    """
+    NVFP4 GEMM with hierarchical dispatch.
+    Pipeline:
+      1. Apply Hadamard transform for outlier mitigation
+      2. Quantize activations with bucketize (O(K*N) instead of O(K*N*16))
+      3. Dispatch to best available kernel:
+         - Tier 0: Native cuBLAS via torch._scaled_mm (if PyTorch supports FP4)
+         - Tier 1: Fused NVFP4 Triton kernel (16-element blocks, E4M3 scales)
+         - Tier 2: CPU fallback
+    Args:
+        activations: Input [M, K] in BF16/FP16
+        weights: NVFP4Weights with packed E2M1 values, E4M3 scales, tensor scale
+        bias: Optional bias [N]
+        use_hadamard: Apply Hadamard transform (recommended)
+    Returns:
+        Output [M, N] in BF16
+    """
+    M, K = activations.shape
+    K_w, N = weights.shape
+    assert K == K_w, f"K dimension mismatch: {K} vs {K_w}"
+    assert K % 16 == 0, f"K ({K}) must be multiple of 16 for NVFP4"
+    # Step 1: Hadamard transform on activations
+    if use_hadamard and K >= 32:
+        x = activations.float().view(M, K // 32, 32)
+        x = _hadamard_transform_32(x)
+        x = x.view(M, K)
+    else:
+        x = activations.float()
+    # Step 2: Dispatch
+    if not activations.is_cuda:
+        # CPU fallback: quant/dequant round-trip + matmul
+        x_for_quant = x.T.contiguous()
+        x_quant = NVFP4Weights.from_float(x_for_quant, use_quest=True)
+        x_dequant = x_quant.to_float().T.contiguous()
+        w_dequant = weights.to_float()
+        d = torch.matmul(x_dequant, w_dequant)
+        if bias is not None:
+            d = d + bias.float()
+        return d.to(torch.bfloat16)
+    # Tier 0: Native cuBLAS FP4 (quantizes activations to FP4 internally)
+    if _can_use_scaled_mm_fp4():
+        return _scaled_mm_fp4(x.to(torch.bfloat16), weights, bias)
+    # Tier 1: Triton kernel (BF16 activations with FP4 noise pre-applied)
+    x_for_quant = x.T.contiguous()
+    x_quant = NVFP4Weights.from_float(x_for_quant, use_quest=True)
+    x_dequant = x_quant.to_float().T.contiguous()
+    return _fused_nvfp4_matmul(x_dequant, weights, bias)
+def quantize_to_nvfp4(weights: torch.Tensor, use_quest: bool = True) -> NVFP4Weights:
+    """
+    Quantize weights to NVFP4 format.
+    Args:
+        weights: Input tensor [K, N]
+        use_quest: Use QuEST optimal clipping
+    Returns:
+        NVFP4Weights ready for NVFP4 GEMM
+    """
+    return NVFP4Weights.from_float(weights, use_quest=use_quest)
+# Updated aliases: FP4 now points to NVFP4 (the better format)
+fp4_gemm = nvfp4_gemm
+quantize_to_fp4 = quantize_to_nvfp4
+# =============================================================================
+# L2 Cache Control (ctypes / libcudart.so)
+# =============================================================================
+# --- ctypes structures for cudaAccessPolicyWindow -------------------------
+class _AccessPolicyWindow(ctypes.Structure):
+    """Maps to cudaAccessPolicyWindow (CUDA Runtime API)."""
+    _fields_ = [
+        ("base_ptr", ctypes.c_void_p),
+        ("num_bytes", ctypes.c_size_t),
+        ("hitRatio", ctypes.c_float),
+        ("hitProp", ctypes.c_int),
+        ("missProp", ctypes.c_int),
+    ]
+class _StreamAttrValue(ctypes.Union):
+    """Maps to cudaStreamAttrValue (union)."""
+    _fields_ = [
+        ("accessPolicyWindow", _AccessPolicyWindow),
+        ("syncPolicy", ctypes.c_int),
+    ]
+# cudaAccessProperty enum
+_CUDA_ACCESS_PROPERTY_NORMAL     = 0
+_CUDA_ACCESS_PROPERTY_STREAMING  = 1
+_CUDA_ACCESS_PROPERTY_PERSISTING = 2
+# cudaStreamAttrID enum
+_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW = 1
+# cudaLimit enum
+_CUDA_LIMIT_PERSISTING_L2_CACHE_SIZE = 0x06
+# cudaDeviceAttr enum
+_CUDA_DEV_ATTR_L2_CACHE_SIZE                = 89
+_CUDA_DEV_ATTR_MAX_PERSISTING_L2_CACHE_SIZE = 108
+def _load_cudart():
+    """Load the CUDA runtime shared library, return handle or None."""
+    for name in ("libcudart.so", "libcudart.so.12", "libcudart.so.11.0"):
+        try:
+            return ctypes.CDLL(name)
+        except OSError:
+            continue
+    try:
+        path = ctypes.util.find_library("cudart")
+        if path:
+            return ctypes.CDLL(path)
+    except (OSError, TypeError):
+        pass
+    return None
+_cudart = _load_cudart()
+# =============================================================================
+# L2CacheManager (public API)
+# =============================================================================
+class L2CacheManager:
+    """
+    L2 Cache Manager for SM90+ GPUs.
+    Uses ctypes/libcudart.so cudaAccessPolicyWindow to pin hot data
+    (embeddings, weights, KV cache) in L2 for 10-20% inference speedup.
+    When libcudart is not loadable the manager degrades to no-op stubs
+    so the rest of the engine remains functional.
+    Usage:
+        l2 = L2CacheManager()
+        # Pin embedding table
+        l2.pin(embedding_table)
+        # Configure for inference
+        l2.configure_inference(
+            embedding=embedding_table,
+            attention_weights=attn_weights,
+            kv_cache=kv_cache,
+        )
+        # Reset between batches
+        l2.reset()
+    """
+    def __init__(self, device: int = 0):
+        self.device = device
+        self._hw_available = False
+        self._l2_size = 0
+        self._max_persisting = 0
+        self._initialize()
+    def _initialize(self):
+        """Query device L2 geometry via cudart."""
+        if not torch.cuda.is_available():
+            return
+        # Start with PyTorch device properties
+        props = torch.cuda.get_device_properties(self.device)
+        self._l2_size = getattr(props, 'l2_cache_size', 0)
+        if _cudart is not None:
+            try:
+                # Total L2
+                val = ctypes.c_int(0)
+                if (_cudart.cudaDeviceGetAttribute(
+                    ctypes.byref(val),
+                    ctypes.c_int(_CUDA_DEV_ATTR_L2_CACHE_SIZE),
+                    ctypes.c_int(self.device),
+                ) == 0 and val.value > 0):
+                    self._l2_size = val.value
+                # Max persisting
+                val2 = ctypes.c_int(0)
+                if (_cudart.cudaDeviceGetAttribute(
+                    ctypes.byref(val2),
+                    ctypes.c_int(_CUDA_DEV_ATTR_MAX_PERSISTING_L2_CACHE_SIZE),
+                    ctypes.c_int(self.device),
+                ) == 0 and val2.value > 0):
+                    self._max_persisting = val2.value
+                else:
+                    self._max_persisting = int(self._l2_size * 0.75)
+                self._hw_available = True
+            except Exception:
+                pass
+        if self._max_persisting == 0:
+            self._max_persisting = int(self._l2_size * 0.75)
+        # Apply persisting limit
+        self._set_persisting_limit(self._max_persisting)
+    # ------------------------------------------------------------------
+    # Internal CUDA helpers
+    # ------------------------------------------------------------------
+    def _set_persisting_limit(self, num_bytes: int) -> bool:
+        if not self._hw_available or _cudart is None:
+            return False
+        return _cudart.cudaDeviceSetLimit(
+            ctypes.c_int(_CUDA_LIMIT_PERSISTING_L2_CACHE_SIZE),
+            ctypes.c_size_t(num_bytes),
+        ) == 0
+    def _apply_access_policy(self, tensor: torch.Tensor, hit_ratio: float,
+                             stream_ptr: int) -> bool:
+        if not self._hw_available or _cudart is None:
+            return False
+        window = _AccessPolicyWindow()
+        window.base_ptr = tensor.data_ptr()
+        window.num_bytes = min(
+            tensor.numel() * tensor.element_size(),
+            self._max_persisting,
+        )
+        window.hitRatio = hit_ratio
+        window.hitProp = _CUDA_ACCESS_PROPERTY_PERSISTING
+        window.missProp = _CUDA_ACCESS_PROPERTY_STREAMING
+        attr = _StreamAttrValue()
+        attr.accessPolicyWindow = window
+        return _cudart.cudaStreamSetAttribute(
+            ctypes.c_void_p(stream_ptr),
+            ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
+            ctypes.byref(attr),
+        ) == 0
+    def _reset_stream_policy(self, stream_ptr: int) -> bool:
+        if not self._hw_available or _cudart is None:
+            return False
+        attr = _StreamAttrValue()
+        attr.accessPolicyWindow = _AccessPolicyWindow()
+        return _cudart.cudaStreamSetAttribute(
+            ctypes.c_void_p(stream_ptr),
+            ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
+            ctypes.byref(attr),
+        ) == 0
+    def _reset_persisting_l2(self) -> bool:
+        if not self._hw_available or _cudart is None:
+            return False
+        return _cudart.cudaCtxResetPersistingL2Cache() == 0
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+    @property
+    def l2_size(self) -> int:
+        """Total L2 cache size in bytes."""
+        return self._l2_size
+    @property
+    def max_persisting(self) -> int:
+        """Maximum persisting L2 size in bytes."""
+        return self._max_persisting
+    def pin(
+        self,
+        tensor: torch.Tensor,
+        hit_ratio: float = 1.0,
+        stream: Optional[torch.cuda.Stream] = None,
+    ) -> bool:
+        """
+        Pin a tensor in L2 cache via cudaAccessPolicyWindow.
+        Args:
+            tensor: Tensor to pin (must be on CUDA)
+            hit_ratio: Fraction of accesses to persist (0.0-1.0)
+            stream: CUDA stream (default: current)
+        Returns:
+            True on success (or no-op when HW unavailable)
+        """
+        if not tensor.is_cuda:
+            return False
+        if self._hw_available:
+            stream_ptr = (
+                stream.cuda_stream if stream is not None
+                else torch.cuda.current_stream(self.device).cuda_stream
+            )
+            return self._apply_access_policy(tensor, hit_ratio, stream_ptr)
+        return True  # no-op fallback
+    def set_streaming(
+        self,
+        tensor: torch.Tensor,
+        stream: Optional[torch.cuda.Stream] = None,
+    ) -> bool:
+        """
+        Mark tensor as streaming (bypass L2 cache).
+        Use for one-time access data to avoid L2 pollution.
+        """
+        if not tensor.is_cuda:
+            return False
+        if self._hw_available:
+            stream_ptr = (
+                stream.cuda_stream if stream is not None
+                else torch.cuda.current_stream(self.device).cuda_stream
+            )
+            window = _AccessPolicyWindow()
+            window.base_ptr = tensor.data_ptr()
+            window.num_bytes = tensor.numel() * tensor.element_size()
+            window.hitRatio = 0.0
+            window.hitProp = _CUDA_ACCESS_PROPERTY_STREAMING
+            window.missProp = _CUDA_ACCESS_PROPERTY_STREAMING
+            attr = _StreamAttrValue()
+            attr.accessPolicyWindow = window
+            return _cudart.cudaStreamSetAttribute(
+                ctypes.c_void_p(stream_ptr),
+                ctypes.c_int(_CUDA_STREAM_ATTR_ACCESS_POLICY_WINDOW),
+                ctypes.byref(attr),
+            ) == 0
+        return True  # no-op fallback
+    def reset(self) -> bool:
+        """Reset persisting L2 cache. Call between inference batches."""
+        return self._reset_persisting_l2() if self._hw_available else True
+    def configure_inference(
+        self,
+        embedding: Optional[torch.Tensor] = None,
+        attention_weights: Optional[torch.Tensor] = None,
+        kv_cache: Optional[torch.Tensor] = None,
+        stream: Optional[torch.cuda.Stream] = None,
+    ) -> bool:
+        """
+        Configure L2 cache for transformer inference.
+        Pins tensors with appropriate priorities:
+          1. Embedding table (highest — hit_ratio=1.0)
+          2. Attention weights (hit_ratio=0.9)
+          3. KV cache (lowest — hit_ratio=0.7)
+        Args:
+            embedding: Embedding table tensor
+            attention_weights: Combined attention weights
+            kv_cache: KV cache tensor
+            stream: CUDA stream
+        """
+        success = True
+        if embedding is not None:
+            success = success and self.pin(embedding, 1.0, stream)
+        if attention_weights is not None:
+            success = success and self.pin(attention_weights, 0.9, stream)
+        if kv_cache is not None:
+            success = success and self.pin(kv_cache, 0.7, stream)
+        return success
+# =============================================================================
+# Benchmark Utilities
+# =============================================================================
+def benchmark_tma_vs_cublas(sizes=None, warmup=10, iters=100):
+    """Benchmark TMA MatMul vs cuBLAS."""
+    import time
+    if sizes is None:
+        sizes = [(2048, 2048, 2048), (4096, 4096, 4096), (8192, 8192, 8192)]
+    print("=" * 60)
+    print("TMA MatMul vs cuBLAS Benchmark")
+    print("=" * 60)
+    for M, N, K in sizes:
+        a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
+        b = torch.randn(K, N, device='cuda', dtype=torch.bfloat16)
+        # Warmup
+        for _ in range(warmup):
+            _ = tma_matmul(a, b)
+            _ = torch.matmul(a, b)
+        torch.cuda.synchronize()
+        # TMA MatMul
+        start = time.perf_counter()
+        for _ in range(iters):
+            _ = tma_matmul(a, b)
+        torch.cuda.synchronize()
+        tma_time = (time.perf_counter() - start) / iters
+        # cuBLAS
+        start = time.perf_counter()
+        for _ in range(iters):
+            _ = torch.matmul(a, b)
+        torch.cuda.synchronize()
+        cublas_time = (time.perf_counter() - start) / iters
+        flops = 2 * M * N * K
+        tma_tflops = flops / tma_time / 1e12
+        cublas_tflops = flops / cublas_time / 1e12
+        speedup = cublas_time / tma_time
+        print(f"{M}x{N}x{K}:")
+        print(f"  TMA:    {tma_tflops:.1f} TFLOPS ({tma_time*1000:.2f}ms)")
+        print(f"  cuBLAS: {cublas_tflops:.1f} TFLOPS ({cublas_time*1000:.2f}ms)")
+        print(f"  Speedup: {speedup:.2f}x")
+        print()
+def benchmark_fp4_vs_fp16(M=4096, N=4096, K=4096, warmup=10, iters=100):
+    """Benchmark NVFP4, MXFP4, and FP16 GEMM paths."""
+    import time
+    print("=" * 60)
+    print("FP4 vs FP16 GEMM Benchmark")
+    print("=" * 60)
+    # Create weights in both formats
+    w_fp16 = torch.randn(K, N, device='cuda', dtype=torch.float16)
+    w_mxfp4 = quantize_to_mxfp4(w_fp16)
+    w_nvfp4 = quantize_to_nvfp4(w_fp16)
+    a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
+    # Memory usage
+    fp16_bytes = w_fp16.numel() * 2
+    mxfp4_bytes = w_mxfp4.packed.numel() + w_mxfp4.scales.numel()
+    nvfp4_bytes = w_nvfp4.packed.numel() + w_nvfp4.block_scales.numel()
+    print(f"Weight memory:")
+    print(f"  FP16:  {fp16_bytes / 1e6:.1f} MB")
+    print(f"  MXFP4: {mxfp4_bytes / 1e6:.1f} MB ({fp16_bytes / mxfp4_bytes:.1f}x smaller)")
+    print(f"  NVFP4: {nvfp4_bytes / 1e6:.1f} MB ({fp16_bytes / nvfp4_bytes:.1f}x smaller)")
+    print()
+    # ---- Kernel-only benchmark (isolates kernel from activation quant) ----
+    print(f"{M}x{N}x{K} Kernel-only (no activation quant overhead):")
+    for _ in range(warmup):
+        _fused_nvfp4_matmul(a, w_nvfp4)
+        _fused_fp4_matmul(a, w_mxfp4)
+        torch.matmul(a.half(), w_fp16)
+    torch.cuda.synchronize()
+    start = time.perf_counter()
+    for _ in range(iters):
+        _fused_nvfp4_matmul(a, w_nvfp4)
+    torch.cuda.synchronize()
+    nvfp4_kern_time = (time.perf_counter() - start) / iters
+    start = time.perf_counter()
+    for _ in range(iters):
+        _fused_fp4_matmul(a, w_mxfp4)
+    torch.cuda.synchronize()
+    mxfp4_kern_time = (time.perf_counter() - start) / iters
+    start = time.perf_counter()
+    for _ in range(iters):
+        torch.matmul(a.half(), w_fp16)
+    torch.cuda.synchronize()
+    fp16_time = (time.perf_counter() - start) / iters
+    flops = 2 * M * N * K
+    print(f"  NVFP4 kernel:  {flops/nvfp4_kern_time/1e12:.1f} TFLOPS ({nvfp4_kern_time*1000:.2f}ms)")
+    print(f"  MXFP4 kernel:  {flops/mxfp4_kern_time/1e12:.1f} TFLOPS ({mxfp4_kern_time*1000:.2f}ms)")
+    print(f"  BF16 cuBLAS:   {flops/fp16_time/1e12:.1f} TFLOPS ({fp16_time*1000:.2f}ms)")
+    print()
+    # ---- Full pipeline benchmark (includes Hadamard + activation quant) ----
+    print(f"{M}x{N}x{K} Full pipeline (Hadamard + act quant + kernel):")
+    for _ in range(warmup):
+        nvfp4_gemm(a, w_nvfp4)
+        mxfp4_gemm(a, w_mxfp4)
+        mxfp4_gemm_legacy(a, w_mxfp4)
+    torch.cuda.synchronize()
+    start = time.perf_counter()
+    for _ in range(iters):
+        nvfp4_gemm(a, w_nvfp4)
+    torch.cuda.synchronize()
+    nvfp4_pipe_time = (time.perf_counter() - start) / iters
+    start = time.perf_counter()
+    for _ in range(iters):
+        mxfp4_gemm(a, w_mxfp4)
+    torch.cuda.synchronize()
+    mxfp4_pipe_time = (time.perf_counter() - start) / iters
+    start = time.perf_counter()
+    for _ in range(iters):
+        mxfp4_gemm_legacy(a, w_mxfp4)
+    torch.cuda.synchronize()
+    legacy_time = (time.perf_counter() - start) / iters
+    print(f"  NVFP4 pipeline:  {flops/nvfp4_pipe_time/1e12:.1f} TFLOPS ({nvfp4_pipe_time*1000:.2f}ms)")
+    print(f"  MXFP4 pipeline:  {flops/mxfp4_pipe_time/1e12:.1f} TFLOPS ({mxfp4_pipe_time*1000:.2f}ms)")
+    print(f"  MXFP4 legacy:    {flops/legacy_time/1e12:.1f} TFLOPS ({legacy_time*1000:.2f}ms)")
+    act_overhead_nv = nvfp4_pipe_time - nvfp4_kern_time
+    act_overhead_mx = mxfp4_pipe_time - mxfp4_kern_time
+    print(f"  Act quant overhead: NVFP4={act_overhead_nv*1000:.2f}ms  MXFP4={act_overhead_mx*1000:.2f}ms")
+    print()
+    # ---- Probes ----
+    print(f"  Native FP4 probe:     {_can_use_native_fp4()}")
+    print(f"  Scaled MM FP4 probe:  {_can_use_scaled_mm_fp4()}")
+    # ---- Accuracy (kernel-only, apples-to-apples) ----
+    # Compare fused kernel output vs torch.matmul with same dequantized weights
+    # using the SAME activations (no Hadamard/quant noise difference)
+    out_nv_kern = _fused_nvfp4_matmul(a, w_nvfp4)
+    out_nv_ref = torch.matmul(a.float(), w_nvfp4.to_float()).bfloat16()
+    rel_err_nv = (out_nv_kern.float() - out_nv_ref.float()).abs().mean() / out_nv_ref.float().abs().mean()
+    out_mx_kern = _fused_fp4_matmul(a, w_mxfp4)
+    out_mx_ref = torch.matmul(a.float(), w_mxfp4.to_float()).bfloat16()
+    rel_err_mx = (out_mx_kern.float() - out_mx_ref.float()).abs().mean() / out_mx_ref.float().abs().mean()
+    # MXFP4 fused vs legacy (both use same pipeline, should match exactly)
+    out_mxfp4_fused = mxfp4_gemm(a, w_mxfp4)
+    out_legacy = mxfp4_gemm_legacy(a, w_mxfp4)
+    rel_err_mx_pipe = (out_mxfp4_fused - out_legacy).abs().mean() / out_legacy.abs().mean()
+    print(f"  NVFP4 kernel rel_err (vs matmul):  {rel_err_nv:.6f}")
+    print(f"  MXFP4 kernel rel_err (vs matmul):  {rel_err_mx:.6f}")
+    print(f"  MXFP4 fused vs legacy rel_err:     {rel_err_mx_pipe:.6f}")
+if __name__ == "__main__":
+    print("FireEcho CUTLASS-Compatible Kernels (self-contained)")
+    print("=" * 60)
+    print(f"Triton available: True")
+    print(f"cudart loaded: {_cudart is not None}")
+    if torch.cuda.is_available():
+        l2 = L2CacheManager()
+        print(f"L2 Cache size: {l2.l2_size / 1e6:.0f} MB")
+        print(f"Max persisting: {l2.max_persisting / 1e6:.0f} MB")
+        print(f"HW L2 pinning: {l2._hw_available}")
+        print(f"Native FP4 (dot_scaled): {_can_use_native_fp4()}")
+        print(f"Scaled MM FP4: {_can_use_scaled_mm_fp4()}")
+        print()
+        benchmark_tma_vs_cublas(sizes=[(2048, 2048, 2048)])
+        benchmark_fp4_vs_fp16(M=2048, N=2048, K=2048)

FireEcho Engine/debug_acceptance.log ADDED Viewed

	@@ -0,0 +1,92 @@

+nohup: ignoring input
+Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+Warmup...
+============================================================
+  Testing D=2 (D=2 baseline)
+============================================================
+  [EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
+  [EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
+  Target prefill logits: has_nan=True, min=nan, max=nan
+  First decoded token: 0 = '!'
+  Target predicts next: 0 = '!'
+  Feature layer 8: has_nan=True, min=nan, max=nan
+  Feature layer 24: has_nan=True, min=nan, max=nan
+  Feature layer 47: has_nan=True, min=nan, max=nan
+  Draft tokens:
+    [0] 0 = '!'
+    [1] 0 = '!'
+    [2] 0 = '!'
+    [3] 0 = '!'
+    [4] 0 = '!'
+  Draft logits[0]: has_nan=True, min=nan, max=nan
+  Target verify predictions:
+    [1] target=0 ('!'), draft=0 ('!') → MATCH
+    [2] target=0 ('!'), draft=0 ('!') → MATCH
+    [3] target=0 ('!'), draft=0 ('!') → MATCH
+    [4] target=0 ('!'), draft=0 ('!') → MATCH
+  Accepted: 5/5
+  --- Full speculative_generate (max_new=30) ---
+  [EAGLE-3] 5 rounds, 21 drafted, 21 accepted (100%), avg 4.2/round
+  Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+============================================================
+  Testing D=8 (D=8 with random layers 2-7)
+============================================================
+  [EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+  Target prefill logits: has_nan=True, min=nan, max=nan
+  First decoded token: 0 = '!'
+  Target predicts next: 0 = '!'
+  Feature layer 8: has_nan=True, min=nan, max=nan
+  Feature layer 24: has_nan=True, min=nan, max=nan
+  Feature layer 47: has_nan=True, min=nan, max=nan
+  Draft tokens:
+    [0] 0 = '!'
+    [1] 0 = '!'
+    [2] 0 = '!'
+    [3] 0 = '!'
+    [4] 0 = '!'
+  Draft logits[0]: has_nan=True, min=nan, max=nan
+  Target verify predictions:
+    [1] target=0 ('!'), draft=0 ('!') → MATCH
+    [2] target=0 ('!'), draft=0 ('!') → MATCH
+    [3] target=0 ('!'), draft=0 ('!') → MATCH
+    [4] target=0 ('!'), draft=0 ('!') → MATCH
+  Accepted: 5/5
+  --- Full speculative_generate (max_new=30) ---
+  [EAGLE-3] 5 rounds, 21 drafted, 21 accepted (100%), avg 4.2/round
+  Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+============================================================
+  D=2 accepted: 5/5
+  D=8 accepted: 5/5
+============================================================

FireEcho Engine/debug_acceptance.py ADDED Viewed

	@@ -0,0 +1,152 @@

+#!/usr/bin/env python3
+"""Debug: Why does D=8 eagle head show 100% acceptance?
+Compare draft tokens vs target predictions for D=2 and D=8.
+ROOT CAUSE FOUND: Missing torch.no_grad() caused NaN logits (Goliath FP4
+Triton kernels don't support autograd). argmax(NaN)=0 for both draft and
+target → fake 100% acceptance. This version fixes that.
+"""
+import sys, os, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+@torch.no_grad()
+def test_acceptance(engine, tokenizer, num_layers, label):
+    """Enable eagle with given D, run one round of draft+verify, print details."""
+    print(f"\n{'='*60}")
+    print(f"  Testing D={num_layers} ({label})")
+    print(f"{'='*60}")
+    # Enable eagle
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47),
+        num_head_layers=num_layers,
+        checkpoint_path=EAGLE_CKPT if os.path.exists(EAGLE_CKPT) else None)
+    engine.eval()
+    prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
+    ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+    prompt_len = ids.shape[1]
+    # Prefill
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    current_pos = prompt_len
+    # Check for NaN in target logits
+    has_nan = logits.isnan().any().item()
+    print(f"  Target prefill logits: has_nan={has_nan}, "
+          f"min={logits[:,-1,:].min().item():.2f}, max={logits[:,-1,:].max().item():.2f}")
+    # Decode first token
+    next_token = logits[:, -1:, :].argmax(dim=-1)
+    print(f"  First decoded token: {next_token.item()} = '{tokenizer.decode([next_token.item()])}'")
+    # Forward it (stores KV, captures hidden states)
+    logits = engine.forward(next_token, use_cache=True, position=current_pos)
+    current_pos += 1
+    # Target model's prediction
+    main_pred = logits[:, -1, :].argmax(dim=-1).item()
+    print(f"  Target predicts next: {main_pred} = '{tokenizer.decode([main_pred])}'")
+    # Draft 5 tokens
+    features = [engine._eagle_hidden_states[l]
+                for l in engine._eagle_capture_layers]
+    # Check features for NaN
+    for li, f in zip(engine._eagle_capture_layers, features):
+        print(f"  Feature layer {li}: has_nan={f.isnan().any().item()}, "
+              f"min={f.min().item():.4f}, max={f.max().item():.4f}")
+    memory_ctx = engine._get_eagle_memory_context(
+        engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
+    draft_tokens, draft_logits = engine.eagle_head.generate_draft(
+        features, next_token, engine.embed, depth=5,
+        memory_context=memory_ctx)
+    print(f"  Draft tokens:")
+    for i, dt in enumerate(draft_tokens):
+        tok_id = dt.item()
+        print(f"    [{i}] {tok_id} = '{tokenizer.decode([tok_id])}'")
+    # Check draft logits for NaN
+    dl0 = draft_logits[0][0, 0, :]
+    print(f"  Draft logits[0]: has_nan={dl0.isnan().any().item()}, "
+          f"min={dl0.min().item():.2f}, max={dl0.max().item():.2f}")
+    # Verify: forward draft tokens through target
+    draft_input = torch.cat(draft_tokens, dim=1)
+    verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
+    print(f"  Target verify predictions:")
+    accepted = 0
+    if draft_tokens[0].item() == main_pred:
+        accepted = 1
+        for i in range(1, len(draft_tokens)):
+            target_pred = verify_logits[:, i - 1, :].argmax(dim=-1).item()
+            match = "MATCH" if draft_tokens[i].item() == target_pred else "MISS"
+            print(f"    [{i}] target={target_pred} ('{tokenizer.decode([target_pred])}'), "
+                  f"draft={draft_tokens[i].item()} ('{tokenizer.decode([draft_tokens[i].item()])}') → {match}")
+            if draft_tokens[i].item() == target_pred:
+                accepted += 1
+            else:
+                break
+    else:
+        print(f"    [0] MISS: draft[0]={draft_tokens[0].item()} "
+              f"('{tokenizer.decode([draft_tokens[0].item()])}') "
+              f"!= main_pred={main_pred} ('{tokenizer.decode([main_pred])}')")
+    print(f"  Accepted: {accepted}/{len(draft_tokens)}")
+    # Also run full speculative_generate to match training eval
+    print(f"\n  --- Full speculative_generate (max_new=30) ---")
+    engine.reset_cache()
+    ids2 = tokenizer.encode(prompt, return_tensors='pt').cuda()
+    out = engine.speculative_generate(
+        ids2, max_new_tokens=30, temperature=0.0,
+        stop_tokens=[199999, 200020])
+    text = tokenizer.decode(out[0, ids2.shape[1]:], skip_special_tokens=True)
+    print(f"  Output: {text[:120]}")
+    # Cleanup eagle
+    del engine.eagle_head
+    engine._eagle_enabled = False
+    return accepted
+if __name__ == "__main__":
+    print("Loading model...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.pack_all_experts()
+    engine.kv_cache.enable_flat_decode()
+    engine.eval()
+    # Warmup
+    print("Warmup...")
+    warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    # Test D=2
+    acc2 = test_acceptance(engine, tokenizer, 2, "D=2 baseline")
+    # Test D=8
+    acc8 = test_acceptance(engine, tokenizer, 8, "D=8 with random layers 2-7")
+    print(f"\n{'='*60}")
+    print(f"  D=2 accepted: {acc2}/5")
+    print(f"  D=8 accepted: {acc8}/5")
+    if acc8 > acc2 + 2:
+        print(f"  WARNING: D=8 significantly better than D=2 — investigate!")
+    elif acc2 <= 2 and acc8 <= 2:
+        print(f"  EXPECTED: Both D=2 and D=8 have low acceptance (undertrained)")
+    print(f"{'='*60}")

FireEcho Engine/debug_bisect.log ADDED Viewed

	@@ -0,0 +1,78 @@

+============================================================
+  Training Flow Bisection
+============================================================
+[Step 1] load_engine(max_seq_len=4096)...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+Traceback (most recent call last):
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/debug_bisect.py", line 43, in <module>
+    check(engine, tokenizer, "after load")
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
+    return func(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/debug_bisect.py", line 23, in check
+    logits = engine.forward(ids, use_cache=True, position=0)
+             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 9964, in forward
+    x = layer(x, self.kv_cache, self._current_seq_id, position, use_cache)
+        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8820, in forward
+    x = x + self.ffn(self.norm2(x))
+            ^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8710, in forward
+    expert_out = self.experts[expert_idx](selected)
+                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 7565, in forward
+    gate_up = self.gate_up_proj(x)  # [*, 2*intermediate]
+              ^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/.venv_infer312/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 7339, in forward
+    return F.linear(x, self.weight, self.bias)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+RuntimeError: size mismatch, got input (5), mat (5x2048), vec (0)

FireEcho Engine/debug_bisect.py ADDED Viewed

	@@ -0,0 +1,149 @@

+#!/usr/bin/env python3
+"""Bisect: exactly which step of the training flow causes NaN.
+Replicates train_eagle_head.py main() step by step, checking forward() after each.
+FIXED: pack_all_experts + enable_flat_decode BEFORE first forward() call.
+"""
+import sys, os, torch, gc, time
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+from fireecho_kernel import FireEchoEagleHead
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
+@torch.no_grad()
+def check(engine, tokenizer, label):
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    vram = torch.cuda.memory_allocated() / 1e9
+    if has_nan:
+        print(f"  [{label}] *** NaN DETECTED *** VRAM={vram:.2f}GB")
+    else:
+        top = logits[:, -1, :].argmax(dim=-1).item()
+        print(f"  [{label}] OK top={top} ('{tokenizer.decode([top])}') VRAM={vram:.2f}GB")
+    return has_nan
+@torch.no_grad()
+def check_speculative(engine, tokenizer, label):
+    """Test speculative_generate specifically."""
+    prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
+    ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
+    engine.reset_cache()
+    engine.eval()
+    eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    stop = [eos_id] if eos_id else [151645]
+    out = engine.speculative_generate(ids, max_new_tokens=20, temperature=0.0, stop_tokens=stop)
+    gen_tokens = out[0, ids.shape[1]:].tolist()
+    text = tokenizer.decode(gen_tokens, skip_special_tokens=True)
+    all_same = len(set(gen_tokens)) <= 1 if gen_tokens else True
+    if all_same and len(gen_tokens) > 3:
+        print(f"  [{label}] *** ALL SAME TOKEN *** = NaN bug! tokens={gen_tokens[:5]}")
+        return True
+    else:
+        print(f"  [{label}] OK: '{text[:80]}' ({len(gen_tokens)} tokens, {len(set(gen_tokens))} unique)")
+        return False
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  Training Flow Bisection (v2 — fixed)")
+    print("=" * 60)
+    # === Step 1: load_engine (matches training exactly) ===
+    print("\n[Step 1] load_engine(max_seq_len=4096) + eval + flat_decode + pack...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    nan1 = check(engine, tokenizer, "after load+pack+flat")
+    if nan1:
+        print("  FATAL: NaN at baseline! Cannot continue.")
+        sys.exit(1)
+    # === Step 2: enable_eagle D=8 (NO checkpoint, matches training) ===
+    print("\n[Step 2] enable_eagle(D=8, no checkpoint)...")
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+        draft_depth=5, num_head_layers=8)
+    nan2 = check(engine, tokenizer, "after eagle D=8 random")
+    # === Step 3: create optimizer ===
+    print("\n[Step 3] create AdamW optimizer...")
+    eagle = engine.eagle_head
+    eagle_params = [p for n, p in eagle.named_parameters()
+                    if 'lm_head' not in n and p.requires_grad]
+    optimizer = torch.optim.AdamW(eagle_params, lr=3e-4, betas=(0.9, 0.95), weight_decay=0.0)
+    nan3 = check(engine, tokenizer, "after optimizer")
+    # === Step 4: load_checkpoint (matches training: weights_only=False) ===
+    print("\n[Step 4] load_checkpoint...")
+    if os.path.exists(EAGLE_CKPT):
+        ckpt = torch.load(EAGLE_CKPT, weights_only=False, map_location='cuda')
+        sd = ckpt.get('eagle_head', ckpt)
+        is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
+        if is_legacy:
+            eagle.load_legacy_checkpoint(sd)
+            print("  Loaded legacy checkpoint")
+        else:
+            eagle.load_state_dict(sd, strict=False)
+            print("  Loaded new-format checkpoint")
+        if 'optimizer' in ckpt:
+            try:
+                optimizer.load_state_dict(ckpt['optimizer'])
+                print("  Loaded optimizer state")
+            except (ValueError, KeyError) as e:
+                print(f"  Optimizer mismatch: {e}")
+        step = ckpt.get('step', 0)
+        print(f"  Step={step}")
+        del ckpt
+        torch.cuda.empty_cache()
+    else:
+        print("  No checkpoint found, using random weights")
+    nan4 = check(engine, tokenizer, "after ckpt load")
+    # === Step 5: warmup ===
+    print("\n[Step 5] warmup 3x generate()...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for i in range(3):
+        out = engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+        text = tokenizer.decode(out[0, wids.shape[1]:], skip_special_tokens=True)
+        print(f"  Warmup {i}: '{text}'")
+    del wids
+    nan5 = check(engine, tokenizer, "after warmup")
+    # === Step 6: speculative_generate (the actual eval path) ===
+    print("\n[Step 6] speculative_generate()...")
+    nan6 = check_speculative(engine, tokenizer, "speculative_generate")
+    # === Summary ===
+    print("\n" + "=" * 60)
+    print("  BISECTION RESULTS")
+    print("=" * 60)
+    results = [
+        ("Step 1: load+pack+flat", nan1),
+        ("Step 2: enable_eagle D=8", nan2),
+        ("Step 3: create optimizer", nan3),
+        ("Step 4: load checkpoint", nan4),
+        ("Step 5: warmup", nan5),
+        ("Step 6: speculative_generate", nan6),
+    ]
+    for name, had_nan in results:
+        status = "*** NaN ***" if had_nan else "OK"
+        print(f"  {name}: {status}")
+    first_fail = next((name for name, nan in results if nan), None)
+    if first_fail:
+        print(f"\n  FIRST FAILURE: {first_fail}")
+    else:
+        print(f"\n  ALL PASSED — no NaN detected!")
+    print("=" * 60)

FireEcho Engine/debug_d8_isolate.log ADDED Viewed

	@@ -0,0 +1,79 @@

+============================================================
+  D=8 NaN Isolation
+============================================================
+[1] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+[2] Warmup...
+  VRAM baseline: 19.96 GB
+[3] Baseline (no eagle)...
+  [baseline] OK — top=13048 ('Hi')
+[4] D=2 eagle head...
+  [EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
+  [EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
+  VRAM: 20.17 GB (+0.21)
+  [D=2] OK — top=13048 ('Hi')
+[5] D=8 eagle head (random init, no checkpoint)...
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+  VRAM: 20.67 GB (+0.72)
+  [D=8 random] OK — top=13048 ('Hi')
+[6] D=8 eagle head (with checkpoint)...
+  [EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+  VRAM: 20.67 GB (+0.72)
+  [D=8 with ckpt] OK — top=13048 ('Hi')
+[7] D=8 eagle head (allocated, NOT registered on engine)...
+  VRAM: 20.67 GB (+0.72)
+  [D=8 unregistered] OK — top=13048 ('Hi')
+[8] D=4 eagle head (checkpoint)...
+  [EAGLE] Loaded legacy D=2 checkpoint. 18 new layer params initialized randomly.
+  [FE-XT] Draft head: D=4, 188.8M params, 378 MB, capture layers [8, 24, 47] + Hebbian memory
+  VRAM: 20.34 GB (+0.38)
+  [D=4] OK — top=13048 ('Hi')
+[9] D=8 eagle head, but _eagle_enabled=False...
+  [EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+  VRAM: 20.67 GB (+0.72)
+  [D=8 flag OFF] OK — top=13048 ('Hi')
+============================================================
+  RESULTS
+============================================================
+  D=8 random:      OK
+  D=8 with ckpt:   OK
+  D=8 unregistered: OK
+  D=4:             OK
+  D=8 flag OFF:    OK

FireEcho Engine/debug_d8_isolate.py ADDED Viewed

	@@ -0,0 +1,156 @@

+#!/usr/bin/env python3
+"""Isolate exactly what about D=8 causes NaN.
+Tests:
+1. D=2 eagle head → forward → should be OK
+2. D=8 eagle head (random, no ckpt) → forward → is NaN from VRAM pressure?
+3. D=8 eagle head (random, NOT assigned to engine) → forward → is NaN from registration?
+4. D=8 allocated but eagle_enabled=False → forward → is NaN from .to() side effect?
+"""
+import sys, os, torch, gc
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+from fireecho_kernel import FireEchoEagleHead
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
+@torch.no_grad()
+def check(engine, tokenizer, label):
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    if has_nan:
+        print(f"  [{label}] NaN DETECTED")
+    else:
+        top = logits[:, -1, :].argmax(dim=-1).item()
+        print(f"  [{label}] OK — top={top} ('{tokenizer.decode([top])}')")
+    return has_nan
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  D=8 NaN Isolation")
+    print("=" * 60)
+    print("\n[1] Loading model...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.pack_all_experts()
+    engine.kv_cache.enable_flat_decode()
+    engine.eval()
+    # Warmup
+    print("\n[2] Warmup...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    vram_base = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM baseline: {vram_base:.2f} GB")
+    # Test 1: Baseline (no eagle)
+    print("\n[3] Baseline (no eagle)...")
+    check(engine, tokenizer, "baseline")
+    # Test 2: D=2 eagle head (should work)
+    print("\n[4] D=2 eagle head...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=2, checkpoint_path=EAGLE_CKPT)
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    check(engine, tokenizer, "D=2")
+    # Cleanup
+    del engine.eagle_head
+    engine._eagle_enabled = False
+    engine._eagle_hidden_states = {}
+    torch.cuda.empty_cache()
+    gc.collect()
+    # Test 3: D=8 eagle head (NO checkpoint, random init)
+    print("\n[5] D=8 eagle head (random init, no checkpoint)...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=8)  # no checkpoint_path
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    nan_d8_random = check(engine, tokenizer, "D=8 random")
+    # Cleanup
+    del engine.eagle_head
+    engine._eagle_enabled = False
+    engine._eagle_hidden_states = {}
+    torch.cuda.empty_cache()
+    gc.collect()
+    # Test 4: D=8 eagle head WITH checkpoint
+    print("\n[6] D=8 eagle head (with checkpoint)...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=8, checkpoint_path=EAGLE_CKPT)
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    nan_d8_ckpt = check(engine, tokenizer, "D=8 with ckpt")
+    # Cleanup
+    del engine.eagle_head
+    engine._eagle_enabled = False
+    engine._eagle_hidden_states = {}
+    torch.cuda.empty_cache()
+    gc.collect()
+    # Test 5: D=8 eagle head allocated but NOT registered as submodule
+    print("\n[7] D=8 eagle head (allocated, NOT registered on engine)...")
+    head_ext = FireEchoEagleHead(
+        dim=config.dim, num_capture_layers=3,
+        num_heads=16, ffn_mult=2, num_layers=8,
+    ).to(dtype=torch.bfloat16, device='cuda')
+    # Do NOT assign to engine — keep as local variable
+    engine._eagle_enabled = True
+    engine._eagle_capture_set = {8, 24, 47}
+    engine._eagle_capture_layers = [8, 24, 47]
+    engine._eagle_hidden_states = {}
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    nan_d8_unreg = check(engine, tokenizer, "D=8 unregistered")
+    # Cleanup
+    del head_ext
+    engine._eagle_enabled = False
+    torch.cuda.empty_cache()
+    gc.collect()
+    # Test 6: D=4 eagle head (between D=2 and D=8)
+    print("\n[8] D=4 eagle head (checkpoint)...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=4, checkpoint_path=EAGLE_CKPT)
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    nan_d4 = check(engine, tokenizer, "D=4")
+    # Cleanup
+    del engine.eagle_head
+    engine._eagle_enabled = False
+    engine._eagle_hidden_states = {}
+    torch.cuda.empty_cache()
+    gc.collect()
+    # Test 7: D=8 but eagle_enabled=False (head exists but flag off)
+    print("\n[9] D=8 eagle head, but _eagle_enabled=False...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=8, checkpoint_path=EAGLE_CKPT)
+    engine._eagle_enabled = False  # disable the flag
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB (+{vram - vram_base:.2f})")
+    nan_d8_flagoff = check(engine, tokenizer, "D=8 flag OFF")
+    # Summary
+    print(f"\n{'='*60}")
+    print("  RESULTS")
+    print(f"{'='*60}")
+    print(f"  D=8 random:      {'NaN' if nan_d8_random else 'OK'}")
+    print(f"  D=8 with ckpt:   {'NaN' if nan_d8_ckpt else 'OK'}")
+    print(f"  D=8 unregistered: {'NaN' if nan_d8_unreg else 'OK'}")
+    print(f"  D=4:             {'NaN' if nan_d4 else 'OK'}")
+    print(f"  D=8 flag OFF:    {'NaN' if nan_d8_flagoff else 'OK'}")

FireEcho Engine/debug_eval_flow.log ADDED Viewed

	@@ -0,0 +1,75 @@

+============================================================
+  Eval Flow Test (replicates training eval)
+============================================================
+[1] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.3 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.4 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.5 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.5 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+[2] Enabling EAGLE (no checkpoint)...
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+[3] Loading checkpoint separately (like training script)...
+  [EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
+  Loaded checkpoint (step 4000)
+  VRAM: 21.25 GB
+[4a] Running manual speculation test WITHOUT warmup...
+--- Manual speculation test ---
+  Prefill logits: has_nan=True
+  FATAL: NaN in prefill! Cannot continue.
+[4b] Warmup (3x generate)...
+  Warmup done
+[4c] Running manual speculation test AFTER warmup...
+--- Manual speculation test ---
+  Prefill logits: has_nan=True
+  FATAL: NaN in prefill! Cannot continue.
+[5] Running full speculative_generate eval...
+  [EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
+  Prompt 0: 61 tokens, 21.3 tok/s
+  Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+  WARNING: All tokens are the same (0) — likely NaN bug!
+  [EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
+  Prompt 1: 61 tokens, 32.5 tok/s
+  Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+  WARNING: All tokens are the same (0) — likely NaN bug!
+  [EAGLE-3] 9 rounds, 43 drafted, 43 accepted (100%), avg 4.8/round
+  Prompt 2: 61 tokens, 31.7 tok/s
+  Output: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+  WARNING: All tokens are the same (0) — likely NaN bug!
+============================================================
+  Done
+============================================================

FireEcho Engine/debug_eval_flow.py ADDED Viewed

	@@ -0,0 +1,186 @@

+#!/usr/bin/env python3
+"""Replicate the exact training eval flow to verify acceptance rate.
+Matches train_eagle_head.py: enable_eagle (no ckpt), load_checkpoint, evaluate.
+"""
+import sys, os, time, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+EVAL_PROMPTS = [
+    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n",
+    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nExplain what a neural network is in simple terms.<|im_end|>\n<|im_start|>assistant\n",
+    "<|im_start|>system\nYou are a helpful coding assistant.<|im_end|>\n<|im_start|>user\nWrite a binary search function in Python.<|im_end|>\n<|im_start|>assistant\n",
+]
+@torch.no_grad()
+def evaluate_verbose(engine, tokenizer, max_new=60):
+    """Run speculative_generate and print acceptance + output for each prompt."""
+    engine.eval()
+    eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    stop_tokens = [eos_id] if eos_id is not None else [151645]
+    for pi, prompt in enumerate(EVAL_PROMPTS):
+        ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
+        engine.reset_cache()
+        t0 = time.perf_counter()
+        out = engine.speculative_generate(
+            ids, max_new_tokens=max_new, temperature=0.0,
+            stop_tokens=stop_tokens)
+        torch.cuda.synchronize()
+        t1 = time.perf_counter()
+        gen_len = out.shape[1] - ids.shape[1]
+        text = tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
+        tps = gen_len / max(t1 - t0, 1e-6)
+        print(f"\n  Prompt {pi}: {gen_len} tokens, {tps:.1f} tok/s")
+        print(f"  Output: {text[:150]}")
+        # Check for all-same-token output (sign of NaN)
+        gen_ids = out[0, ids.shape[1]:].tolist()
+        if len(set(gen_ids)) == 1 and len(gen_ids) > 5:
+            print(f"  WARNING: All tokens are the same ({gen_ids[0]}) — likely NaN bug!")
+@torch.no_grad()
+def test_manual_speculation(engine, tokenizer):
+    """Manually run one round of draft+verify and check each step."""
+    print("\n--- Manual speculation test ---")
+    engine.eval()
+    prompt = EVAL_PROMPTS[0]
+    ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
+    prompt_len = ids.shape[1]
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    # Prefill
+    logits = engine.forward(ids, use_cache=True, position=0)
+    has_nan = logits.isnan().any().item()
+    print(f"  Prefill logits: has_nan={has_nan}")
+    if has_nan:
+        print("  FATAL: NaN in prefill! Cannot continue.")
+        return
+    # Decode first token
+    next_token = logits[:, -1:, :].argmax(dim=-1)
+    print(f"  First token: {next_token.item()} = '{tokenizer.decode([next_token.item()])}'")
+    # Forward it
+    logits = engine.forward(next_token, use_cache=True, position=prompt_len)
+    has_nan = logits.isnan().any().item()
+    print(f"  Post-first-token logits: has_nan={has_nan}")
+    if has_nan:
+        print("  FATAL: NaN after first token forward!")
+        return
+    main_pred = logits[:, -1, :].argmax(dim=-1).item()
+    print(f"  Target predicts next: {main_pred} = '{tokenizer.decode([main_pred])}'")
+    # Draft 5 tokens
+    features = [engine._eagle_hidden_states[l] for l in engine._eagle_capture_layers]
+    for li, f in zip(engine._eagle_capture_layers, features):
+        print(f"  Feature L{li}: has_nan={f.isnan().any().item()}, "
+              f"shape={list(f.shape)}")
+    memory_ctx = engine._get_eagle_memory_context(
+        engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
+    dt, dl = engine.eagle_head.generate_draft(
+        features, next_token, engine.embed, depth=5, memory_context=memory_ctx)
+    print(f"\n  Draft tokens:")
+    for i, t in enumerate(dt):
+        print(f"    [{i}] {t.item()} = '{tokenizer.decode([t.item()])}'")
+    # Verify
+    draft_input = torch.cat(dt, dim=1)
+    current_pos = prompt_len + 1
+    verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
+    has_nan = verify_logits.isnan().any().item()
+    print(f"\n  Verify logits: has_nan={has_nan}")
+    accepted = 0
+    if dt[0].item() == main_pred:
+        accepted = 1
+        for i in range(1, len(dt)):
+            target_pred = verify_logits[:, i - 1, :].argmax(dim=-1).item()
+            match = "MATCH" if dt[i].item() == target_pred else "MISS"
+            print(f"    [{i}] draft={dt[i].item()} target={target_pred} → {match}")
+            if dt[i].item() == target_pred:
+                accepted += 1
+            else:
+                break
+    else:
+        print(f"    [0] MISS: draft={dt[0].item()} target={main_pred}")
+    print(f"  Accepted: {accepted}/{len(dt)}")
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  Eval Flow Test (replicates training eval)")
+    print("=" * 60)
+    # === Match training script flow exactly ===
+    print("\n[1] Loading model...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=512, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    print("\n[2] Enabling EAGLE (no checkpoint)...")
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47),
+        num_heads=16, ffn_mult=2,
+        draft_depth=5, num_head_layers=8)
+    print("\n[3] Loading checkpoint separately (like training script)...")
+    if os.path.exists(EAGLE_CKPT):
+        ckpt = torch.load(EAGLE_CKPT, weights_only=False, map_location='cuda')
+        sd = ckpt.get('eagle_head', ckpt)
+        is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
+        if is_legacy:
+            engine.eagle_head.load_legacy_checkpoint(sd)
+        else:
+            engine.eagle_head.load_state_dict(sd, strict=False)
+        print(f"  Loaded checkpoint (step {ckpt.get('step', '?')})")
+    else:
+        print(f"  No checkpoint found, using random init")
+    # Setup optimizer (like training script)
+    eagle_params = [p for n, p in engine.eagle_head.named_parameters()
+                    if 'lm_head' not in n and p.requires_grad]
+    optimizer = torch.optim.AdamW(eagle_params, lr=3e-4, betas=(0.9, 0.95))
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB")
+    # Test WITHOUT warmup first
+    print("\n[4a] Running manual speculation test WITHOUT warmup...")
+    test_manual_speculation(engine, tokenizer)
+    # Now do warmup
+    print("\n[4b] Warmup (3x generate)...")
+    warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    print("  Warmup done")
+    # Test AFTER warmup
+    print("\n[4c] Running manual speculation test AFTER warmup...")
+    test_manual_speculation(engine, tokenizer)
+    print("\n[5] Running full speculative_generate eval...")
+    evaluate_verbose(engine, tokenizer)
+    print("\n" + "=" * 60)
+    print("  Done")
+    print("=" * 60)

FireEcho Engine/debug_nan_isolate.log ADDED Viewed

	@@ -0,0 +1,57 @@

+============================================================
+  NaN Isolation Test
+============================================================
+[1/6] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+  VRAM after load: 19.95 GB
+[2/6] Warmup...
+[3/6] Test BEFORE enable_eagle()...
+  [before eagle] OK — top token=13048 ('Hi'), max=26.88
+[4/6] Test: just set _eagle_enabled=True (no head creation)...
+  [flag only] OK — top token=13048 ('Hi'), max=26.88
+[5/6] Test: create eagle head + assign as submodule...
+  VRAM after eagle head: 20.17 GB (+0.22 GB)
+  [with head (no ckpt)] OK — top token=13048 ('Hi'), max=26.88
+[6/6] Test: load checkpoint into eagle head...
+  [EAGLE] Loaded legacy D=2 checkpoint. 0 new layer params initialized randomly.
+  [with ckpt] OK — top token=13048 ('Hi'), max=26.88
+============================================================
+  RESULTS
+============================================================
+  Before eagle:      OK
+  Flag only:         OK
+  With head (no ckpt): OK
+  With checkpoint:   OK
+  All tests passed — no NaN detected!

FireEcho Engine/debug_nan_isolate.py ADDED Viewed

	@@ -0,0 +1,174 @@

+#!/usr/bin/env python3
+"""Isolate exactly which step of enable_eagle() causes NaN in target model.
+Tests each sub-step of enable_eagle() independently to find the culprit.
+Also checks per-layer output to find where NaN first appears.
+"""
+import sys, os, torch, gc
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+from fireecho_kernel import FireEchoEagleHead
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
+@torch.no_grad()
+def check_forward(engine, tokenizer, label):
+    """Run a forward pass and report NaN status."""
+    torch.cuda.synchronize()
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    last = logits[:, -1, :]
+    if has_nan:
+        print(f"  [{label}] NaN DETECTED — logits all NaN")
+    else:
+        top_id = last.argmax(dim=-1).item()
+        top_val = last.max().item()
+        print(f"  [{label}] OK — top token={top_id} "
+              f"('{tokenizer.decode([top_id])}'), max={top_val:.2f}")
+    return has_nan
+@torch.no_grad()
+def check_per_layer(engine, tokenizer, label):
+    """Run forward pass manually layer-by-layer, check NaN at each layer."""
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    x = engine.embed(ids)
+    has_nan = x.isnan().any().item()
+    print(f"  [{label}] After embed: has_nan={has_nan}")
+    if has_nan:
+        return
+    first_nan_layer = None
+    for i, layer in enumerate(engine.layers):
+        x = layer(x, engine.kv_cache, engine._current_seq_id, 0, True)
+        has_nan = x.isnan().any().item()
+        if has_nan and first_nan_layer is None:
+            first_nan_layer = i
+            print(f"  [{label}] FIRST NaN at layer {i} !!!")
+            # Check sub-components
+            break
+    if first_nan_layer is None:
+        # Check norm + lm_head
+        x = engine.norm(x)
+        has_nan = x.isnan().any().item()
+        print(f"  [{label}] After norm: has_nan={has_nan}")
+        logits = engine.lm_head(x)
+        has_nan = logits.isnan().any().item()
+        print(f"  [{label}] After lm_head: has_nan={has_nan}")
+        if not has_nan:
+            top_id = logits[:, -1, :].argmax(dim=-1).item()
+            print(f"  [{label}] Top token: {top_id} ('{tokenizer.decode([top_id])}')")
+    else:
+        print(f"  [{label}] NaN starts at layer {first_nan_layer}")
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  NaN Isolation Test")
+    print("=" * 60)
+    print("\n[1/6] Loading model...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.pack_all_experts()
+    engine.kv_cache.enable_flat_decode()
+    engine.eval()
+    # Check VRAM
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM after load: {vram:.2f} GB")
+    print("\n[2/6] Warmup...")
+    warmup_ids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(warmup_ids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    print("\n[3/6] Test BEFORE enable_eagle()...")
+    nan_before = check_forward(engine, tokenizer, "before eagle")
+    if nan_before:
+        print("\n  ERROR: NaN even before enable_eagle! Something wrong with model load.")
+        sys.exit(1)
+    print("\n[4/6] Test: just set _eagle_enabled=True (no head creation)...")
+    engine._eagle_enabled = True
+    engine._eagle_capture_set = {8, 24, 47}
+    engine._eagle_capture_layers = [8, 24, 47]
+    engine._eagle_hidden_states = {}
+    nan_flag_only = check_forward(engine, tokenizer, "flag only")
+    engine._eagle_enabled = False  # reset
+    print("\n[5/6] Test: create eagle head + assign as submodule...")
+    eagle_head = FireEchoEagleHead(
+        dim=config.dim, num_capture_layers=3,
+        num_heads=16, ffn_mult=2, num_layers=2,
+    ).to(dtype=torch.bfloat16, device='cuda')
+    eagle_head.lm_head = engine.lm_head
+    engine.eagle_head = eagle_head  # registers as nn.Module submodule
+    vram2 = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM after eagle head: {vram2:.2f} GB (+{vram2 - vram:.2f} GB)")
+    nan_with_head = check_forward(engine, tokenizer, "with head (no ckpt)")
+    print("\n[6/6] Test: load checkpoint into eagle head...")
+    if os.path.exists(EAGLE_CKPT):
+        ckpt = torch.load(EAGLE_CKPT, map_location='cuda', weights_only=True)
+        sd = ckpt.get('eagle_head', ckpt)
+        is_legacy = any(k.startswith('norm1.') or k.startswith('q_proj.') for k in sd)
+        if is_legacy:
+            eagle_head.load_legacy_checkpoint(sd)
+        else:
+            eagle_head.load_state_dict(sd, strict=False)
+        nan_with_ckpt = check_forward(engine, tokenizer, "with ckpt")
+    else:
+        print(f"  No checkpoint at {EAGLE_CKPT}, skipping")
+        nan_with_ckpt = nan_with_head
+    # Summary
+    print(f"\n{'=' * 60}")
+    print("  RESULTS")
+    print(f"{'=' * 60}")
+    print(f"  Before eagle:      {'NaN' if nan_before else 'OK'}")
+    print(f"  Flag only:         {'NaN' if nan_flag_only else 'OK'}")
+    print(f"  With head (no ckpt): {'NaN' if nan_with_head else 'OK'}")
+    print(f"  With checkpoint:   {'NaN' if nan_with_ckpt else 'OK'}")
+    # If any NaN found, do per-layer analysis
+    if nan_flag_only or nan_with_head or nan_with_ckpt:
+        print(f"\n--- Per-layer NaN analysis ---")
+        if nan_flag_only:
+            engine._eagle_enabled = True
+            engine._eagle_capture_set = {8, 24, 47}
+            engine._eagle_capture_layers = [8, 24, 47]
+            engine._eagle_hidden_states = {}
+            check_per_layer(engine, tokenizer, "flag-only per-layer")
+        elif nan_with_head or nan_with_ckpt:
+            # eagle_head is still assigned
+            engine._eagle_enabled = True
+            engine._eagle_capture_set = {8, 24, 47}
+            engine._eagle_capture_layers = [8, 24, 47]
+            engine._eagle_hidden_states = {}
+            check_per_layer(engine, tokenizer, "full-eagle per-layer")
+            # Also test: head assigned but flag OFF
+            print(f"\n--- Test: head assigned but _eagle_enabled=False ---")
+            engine._eagle_enabled = False
+            check_forward(engine, tokenizer, "head assigned, flag OFF")
+    else:
+        print("  All tests passed — no NaN detected!")

FireEcho Engine/debug_promptlen.py ADDED Viewed

	@@ -0,0 +1,110 @@

+#!/usr/bin/env python3
+"""Test: does prompt length cause NaN? Test with/without eagle."""
+import sys, os, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+SHORT = "Hello"
+MEDIUM = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
+LONG = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
+@torch.no_grad()
+def test_forward(engine, tokenizer, label, prompt):
+    ids = tokenizer.encode(prompt, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    if has_nan:
+        # Count NaN positions
+        nan_count = sum(1 for s in range(logits.shape[1]) if logits[0, s].isnan().any())
+        print(f"  [{label}] NaN! ({nan_count}/{logits.shape[1]} positions) len={ids.shape[1]}")
+    else:
+        top = logits[:, -1, :].argmax(dim=-1).item()
+        print(f"  [{label}] OK top={top} ('{tokenizer.decode([top])}') len={ids.shape[1]}")
+    return has_nan
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  Prompt Length NaN Test")
+    print("=" * 60)
+    print("\n[SETUP] Loading engine...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    # Test WITHOUT eagle
+    print("\n[Phase 1] No eagle — varying prompt lengths...")
+    test_forward(engine, tokenizer, "short (no eagle)", SHORT)
+    test_forward(engine, tokenizer, "medium (no eagle)", MEDIUM)
+    test_forward(engine, tokenizer, "long (no eagle)", LONG)
+    # Warmup
+    print("\n[Warmup]...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    del wids
+    # Test again after warmup
+    print("\n[Phase 2] No eagle, after warmup...")
+    test_forward(engine, tokenizer, "short (warmed)", SHORT)
+    test_forward(engine, tokenizer, "medium (warmed)", MEDIUM)
+    test_forward(engine, tokenizer, "long (warmed)", LONG)
+    # Enable eagle WITH checkpoint
+    print("\n[Phase 3] Enable eagle D=8 with checkpoint...")
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+        draft_depth=5, num_head_layers=8, checkpoint_path=EAGLE_CKPT)
+    test_forward(engine, tokenizer, "short (eagle+ckpt)", SHORT)
+    test_forward(engine, tokenizer, "medium (eagle+ckpt)", MEDIUM)
+    test_forward(engine, tokenizer, "long (eagle+ckpt)", LONG)
+    # Warmup again after eagle
+    print("\n[Warmup after eagle]...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    del wids
+    print("\n[Phase 4] Eagle + ckpt, after warmup...")
+    test_forward(engine, tokenizer, "short (eagle warmed)", SHORT)
+    test_forward(engine, tokenizer, "medium (eagle warmed)", MEDIUM)
+    test_forward(engine, tokenizer, "long (eagle warmed)", LONG)
+    # Test: enable_eagle WITHOUT checkpoint
+    print("\n[Phase 5] Fresh engine, eagle D=8 NO checkpoint...")
+    del engine
+    torch.cuda.empty_cache()
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+        draft_depth=5, num_head_layers=8)  # NO checkpoint
+    # Warmup
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    del wids
+    test_forward(engine, tokenizer, "short (no ckpt)", SHORT)
+    test_forward(engine, tokenizer, "medium (no ckpt)", MEDIUM)
+    test_forward(engine, tokenizer, "long (no ckpt)", LONG)
+    print("\n" + "=" * 60)
+    print("  DONE")
+    print("=" * 60)

FireEcho Engine/debug_seqlen.py ADDED Viewed

	@@ -0,0 +1,65 @@

+#!/usr/bin/env python3
+"""Test: does max_seq_len=512 vs 4096 cause NaN?"""
+import sys, os, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
+@torch.no_grad()
+def check(engine, tokenizer, label):
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    if has_nan:
+        print(f"  [{label}] NaN DETECTED")
+    else:
+        top = logits[:, -1, :].argmax(dim=-1).item()
+        print(f"  [{label}] OK — top={top} ('{tokenizer.decode([top])}')")
+    return has_nan
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  max_seq_len test")
+    print("=" * 60)
+    # Replicate EXACT training script flow: max_seq_len=512
+    print("\n[1] load_engine(max_seq_len=512)...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=512, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB")
+    # Warmup
+    print("\n[2] Warmup...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    # Test WITHOUT eagle (should work)
+    print("\n[3] Forward without eagle (max_seq_len=512)...")
+    check(engine, tokenizer, "no eagle, seq=512")
+    # Test WITH D=8 eagle
+    print("\n[4] Enable D=8 eagle + checkpoint...")
+    engine.enable_eagle(capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+                        num_head_layers=8, checkpoint_path=EAGLE_CKPT)
+    vram = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM: {vram:.2f} GB")
+    nan_512 = check(engine, tokenizer, "D=8, seq=512")
+    print(f"\n{'='*60}")
+    print(f"  max_seq_len=512 + D=8: {'NaN' if nan_512 else 'OK'}")
+    print(f"{'='*60}")

FireEcho Engine/debug_seqlen_threshold.py ADDED Viewed

	@@ -0,0 +1,61 @@

+#!/usr/bin/env python3
+"""Find exact sequence length threshold for NaN. Test with/without pack_all_experts."""
+import sys, os, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+@torch.no_grad()
+def test_len(engine, tokenizer, n_tokens, label=""):
+    """Generate a prompt of approximately n tokens and test forward."""
+    # Use repeating text to control length
+    base = "word " * max(n_tokens, 1)
+    ids = tokenizer.encode(base, return_tensors='pt').cuda()
+    # Truncate to exact length
+    ids = ids[:, :n_tokens]
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    has_nan = logits.isnan().any().item()
+    status = "NaN" if has_nan else "OK"
+    print(f"  len={n_tokens:4d} {label}: {status}")
+    return has_nan
+if __name__ == "__main__":
+    print("=" * 60)
+    print("  Sequence Length NaN Threshold Finder")
+    print("=" * 60)
+    print("\n[1] Loading engine (WITH pack)...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    # Binary search for threshold
+    print("\n[2] Testing WITH pack_all_experts (coarse)...")
+    for n in [1, 5, 10, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 40, 50, 64, 100]:
+        test_len(engine, tokenizer, n, "(packed)")
+    # Now test WITHOUT pack
+    print("\n[3] Reloading engine WITHOUT pack_all_experts...")
+    del engine
+    torch.cuda.empty_cache()
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    # NO pack_all_experts!
+    print("\n[4] Testing WITHOUT pack_all_experts...")
+    for n in [1, 10, 20, 25, 30, 31, 32, 40, 50, 64, 100]:
+        test_len(engine, tokenizer, n, "(unpacked)")
+    print("\n" + "=" * 60)
+    print("  DONE")
+    print("=" * 60)

FireEcho Engine/debug_specgen_trace.py ADDED Viewed

	@@ -0,0 +1,171 @@

+#!/usr/bin/env python3
+"""Trace speculative_generate step by step to find exactly where NaN appears."""
+import sys, os, torch
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+EAGLE_CKPT = os.path.join(os.path.dirname(__file__), "eagle_checkpoints", "eagle_best.pt")
+PROMPT = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite a function to check primes.<|im_end|>\n<|im_start|>assistant\n"
+def check_nan(label, tensor):
+    has_nan = tensor.isnan().any().item()
+    has_inf = tensor.isinf().any().item()
+    if has_nan or has_inf:
+        print(f"  *** {label}: NaN={has_nan} Inf={has_inf} shape={list(tensor.shape)}")
+        # Check which positions have NaN
+        if tensor.dim() == 3:  # [B, S, V]
+            for s in range(tensor.shape[1]):
+                if tensor[0, s].isnan().any():
+                    print(f"      Position {s}: NaN!")
+        return True
+    else:
+        top = tensor[:, -1, :].argmax(dim=-1).item()
+        print(f"  {label}: OK (top={top}) shape={list(tensor.shape)}")
+        return False
+@torch.no_grad()
+def main():
+    print("=" * 60)
+    print("  Speculative Generate NaN Trace")
+    print("=" * 60)
+    # Load engine exactly like training
+    print("\n[SETUP] Loading engine...")
+    engine, tokenizer, config = load_engine(MODEL_PATH, max_seq_len=4096, device="cuda")
+    engine.eval()
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    # Enable EAGLE D=8
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47), num_heads=16, ffn_mult=2,
+        draft_depth=5, num_head_layers=8, checkpoint_path=EAGLE_CKPT)
+    # Warmup
+    print("\n[SETUP] Warmup...")
+    wids = tokenizer.encode("Hello", return_tensors='pt').cuda()
+    for _ in range(3):
+        engine.generate(wids, max_new_tokens=5, temperature=0.0, top_k=0, top_p=1.0)
+    del wids
+    # Now replicate speculative_generate manually
+    print("\n[TRACE] Starting manual speculation trace...")
+    ids = tokenizer.encode(PROMPT, return_tensors='pt').cuda()
+    prompt_len = ids.shape[1]
+    print(f"  Prompt length: {prompt_len}")
+    # Step 1: Reset + prefill
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    print("\n[1] Prefill...")
+    logits = engine.forward(ids, use_cache=True, position=0)
+    torch.cuda.synchronize()
+    nan1 = check_nan("Prefill logits", logits)
+    if nan1:
+        print("  FATAL: NaN in prefill!")
+        return
+    current_pos = prompt_len
+    first_token = logits[:, -1:, :].argmax(dim=-1)
+    print(f"  First token: {first_token.item()} ('{tokenizer.decode([first_token.item()])}')")
+    # Step 2: Process first token through main model
+    print("\n[2] Process first token through main model...")
+    if hasattr(engine.kv_cache, '_graph_mode'):
+        engine.kv_cache._graph_mode = False
+    logits = engine.forward(first_token, use_cache=True, position=current_pos)
+    torch.cuda.synchronize()
+    nan2 = check_nan("First-token logits", logits)
+    if nan2:
+        print("  FATAL: NaN at first token forward!")
+        return
+    current_pos += 1
+    main_pred = logits[:, -1, :].argmax(dim=-1).item()
+    print(f"  main_pred: {main_pred} ('{tokenizer.decode([main_pred])}')")
+    # Step 3: Draft K tokens using EAGLE
+    print("\n[3] Draft K=5 tokens...")
+    features = [engine._eagle_hidden_states[l] for l in engine._eagle_capture_layers]
+    for idx, f in enumerate(features):
+        has_nan = f.isnan().any().item()
+        print(f"  Feature {idx} (layer {engine._eagle_capture_layers[idx]}): "
+              f"shape={list(f.shape)}, NaN={has_nan}")
+    memory_ctx = engine._get_eagle_memory_context(
+        engine._eagle_hidden_states[engine._eagle_capture_layers[-1]])
+    draft_tokens, draft_logits = engine.eagle_head.generate_draft(
+        features, first_token, engine.embed, depth=5, memory_context=memory_ctx)
+    print(f"  Draft tokens: {[t.item() for t in draft_tokens]}")
+    print(f"  Draft decoded: {[tokenizer.decode([t.item()]) for t in draft_tokens]}")
+    for i, dl in enumerate(draft_logits):
+        has_nan = dl.isnan().any().item()
+        if has_nan:
+            print(f"  *** Draft logits[{i}]: NaN!")
+    # Step 4: Verify draft tokens through main model (this is the suspicious step)
+    print("\n[4] Verify K=5 draft tokens through main model...")
+    print(f"  Verifying at position={current_pos} (prompt_len={prompt_len})")
+    draft_input = torch.cat(draft_tokens, dim=1)
+    print(f"  draft_input shape: {list(draft_input.shape)}, tokens: {draft_input[0].tolist()}")
+    verify_logits = engine.forward(draft_input, use_cache=True, position=current_pos)
+    torch.cuda.synchronize()
+    nan4 = check_nan("Verify logits", verify_logits)
+    if nan4:
+        print("\n  FOUND THE BUG: Verify forward (K>1 tokens at position>0) produces NaN!")
+        print("  This is likely a causal mask or KV cache issue in multi-token decode.")
+        # Additional test: verify ONE draft token at a time
+        print("\n[4b] Trying verify ONE token at a time...")
+        # Rollback the K tokens we just stored
+        engine.kv_cache.rollback_to(current_pos, 5)
+        for i, dt in enumerate(draft_tokens):
+            one_logit = engine.forward(dt, use_cache=True, position=current_pos + i)
+            torch.cuda.synchronize()
+            has_nan = one_logit.isnan().any().item()
+            top = one_logit[:, -1, :].argmax(dim=-1).item() if not has_nan else -1
+            print(f"  Token {i} at pos {current_pos + i}: NaN={has_nan} top={top}")
+            if has_nan:
+                print(f"    SINGLE token verify also fails at position {current_pos + i}!")
+                break
+    else:
+        print("\n  Verify logits OK — checking acceptance logic...")
+        if draft_tokens[0].item() == main_pred:
+            print(f"  Draft[0] matches main_pred ({main_pred}) ✓")
+        else:
+            print(f"  Draft[0]={draft_tokens[0].item()} ≠ main_pred={main_pred} ✗")
+        for i in range(1, len(draft_tokens)):
+            target_pred = verify_logits[:, i-1, :].argmax(dim=-1).item()
+            match = "✓" if draft_tokens[i].item() == target_pred else "✗"
+            print(f"  verify[{i-1}]={target_pred} vs draft[{i}]={draft_tokens[i].item()} {match}")
+    # Step 5: Also test a multi-token forward with RANDOM tokens at position>0
+    print("\n[5] Control test: multi-token forward with KNOWN-GOOD tokens...")
+    engine.reset_cache()
+    engine._current_seq_id = 0
+    # Prefill
+    logits = engine.forward(ids, use_cache=True, position=0)
+    # Now try 5 copies of a valid token at position=prompt_len
+    test_tokens = torch.full((1, 5), first_token.item(), dtype=torch.long, device='cuda')
+    test_logits = engine.forward(test_tokens, use_cache=True, position=prompt_len)
+    torch.cuda.synchronize()
+    nan5 = check_nan("Control multi-token logits", test_logits)
+    print("\n" + "=" * 60)
+    print("  TRACE COMPLETE")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

FireEcho Engine/dsmem_ops.py ADDED Viewed

	@@ -0,0 +1,789 @@

+"""
+FireEcho DSMEM — Distributed Shared Memory Operations
+=======================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+Implements DSMEM and Cluster Barriers using Triton's inline_asm_elementwise
+for PTX injection on SM 9.0+ (Hopper) and SM 12.0+ (Blackwell).
+Features:
+  1. mapa PTX - Map local SMEM to cluster-wide address
+  2. mbarrier PTX - Hardware-accelerated cluster barriers
+  3. Cooperative cluster primitives
+Usage:
+    from fireecho.dsmem_ops import (
+        cluster_matmul_dsmem,
+        ClusterConfig,
+    )
+    # 2-CTA cooperative matmul with DSMEM
+    c = cluster_matmul_dsmem(a, b, cluster_size=2)
+"""
+import torch
+import triton
+import triton.language as tl
+from typing import Tuple, Optional
+from dataclasses import dataclass
+@dataclass
+class ClusterConfig:
+    """Configuration for cluster operations."""
+    cluster_x: int = 2       # Cluster size in X (2 for 2-CTA MMA)
+    cluster_y: int = 1
+    cluster_z: int = 1
+    use_dsmem: bool = True   # Enable distributed shared memory
+    use_mbarrier: bool = True  # Use hardware barriers
+# =============================================================================
+# SM120 DSMEM PTX Primitives
+# =============================================================================
+#
+# Blackwell (SM120) introduces Distributed Shared Memory (DSMEM) allowing
+# thread blocks within a cluster to directly access each other's shared memory.
+#
+# Key PTX instructions:
+#   - mapa.shared::cluster - Map local SMEM to cluster-wide address
+#   - mbarrier.arrive/wait - Hardware-accelerated barriers
+#   - fence.acq_rel.cluster - Cluster-scope memory fence
+#   - st.async.shared::cluster - Async store to remote SMEM
+#   - ld.shared::cluster - Load from remote SMEM
+#
+# Reference: CUDA 12.8+ PTX ISA, Section 9.7.13 (Cluster Operations)
+# =============================================================================
+@triton.jit
+def _cluster_rank_x() -> tl.tensor:
+    """Get current block's X rank within cluster (0 to cluster_dim_x-1)."""
+    return tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .u32 %r;
+            mov.u32 %r, %clusterid.x;
+            mov.u32 $0, %r;
+        }
+        """,
+        constraints="=r",
+        args=[],
+        dtype=tl.int32,
+        is_pure=True,
+        pack=1,
+    )
+@triton.jit
+def _cluster_rank_y() -> tl.tensor:
+    """Get current block's Y rank within cluster."""
+    return tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .u32 %r;
+            mov.u32 %r, %clusterid.y;
+            mov.u32 $0, %r;
+        }
+        """,
+        constraints="=r",
+        args=[],
+        dtype=tl.int32,
+        is_pure=True,
+        pack=1,
+    )
+@triton.jit
+def _cluster_dim_x() -> tl.tensor:
+    """Get cluster dimension in X (number of CTAs in X)."""
+    return tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .u32 %r;
+            mov.u32 %r, %nclusterid.x;
+            mov.u32 $0, %r;
+        }
+        """,
+        constraints="=r",
+        args=[],
+        dtype=tl.int32,
+        is_pure=True,
+        pack=1,
+    )
+@triton.jit
+def _cluster_dim_y() -> tl.tensor:
+    """Get cluster dimension in Y."""
+    return tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .u32 %r;
+            mov.u32 %r, %nclusterid.y;
+            mov.u32 $0, %r;
+        }
+        """,
+        constraints="=r",
+        args=[],
+        dtype=tl.int32,
+        is_pure=True,
+        pack=1,
+    )
+# Legacy aliases
+@triton.jit
+def _cluster_rank() -> tl.tensor:
+    """Get current block's rank within cluster (X dimension)."""
+    return _cluster_rank_x()
+@triton.jit
+def _cluster_size() -> tl.tensor:
+    """Get total cluster size (X dimension)."""
+    return _cluster_dim_x()
+@triton.jit
+def _mapa_shared(local_ptr, target_rank):
+    """
+    Map local shared memory pointer to target rank's address space.
+    PTX: mapa.shared::cluster.u64 dst, src, ctaid
+    This maps a local SMEM address to the equivalent address in another
+    CTA's shared memory space within the same cluster.
+    Args:
+        local_ptr: Pointer to local shared memory
+        target_rank: Target CTA rank within cluster
+    Returns:
+        Pointer to remote CTA's shared memory
+    Note: Requires SM 9.0+ (Hopper) or SM 12.0+ (Blackwell)
+    """
+    return tl.inline_asm_elementwise(
+        asm="mapa.shared::cluster.u64 $0, $1, $2;",
+        constraints="=l,l,r",
+        args=[local_ptr, target_rank],
+        dtype=tl.pointer_type(tl.float32),
+        is_pure=True,
+        pack=1,
+    )
+@triton.jit
+def _cluster_barrier_init(barrier_ptr, expected_count):
+    """
+    Initialize mbarrier for cluster-wide synchronization.
+    PTX: mbarrier.init.shared::cluster.b64 [addr], count
+    Args:
+        barrier_ptr: Pointer to barrier in shared memory
+        expected_count: Number of arrivals before completion
+    """
+    tl.inline_asm_elementwise(
+        asm="mbarrier.init.shared::cluster.b64 [$0], $1;",
+        constraints="r,r",
+        args=[barrier_ptr, expected_count],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _cluster_barrier_arrive(barrier_ptr):
+    """
+    Arrive at cluster barrier, returns phase token.
+    PTX: mbarrier.arrive.shared::cluster.b64 state, [addr]
+    Args:
+        barrier_ptr: Pointer to barrier in shared memory
+    Returns:
+        Phase token for wait operation
+    """
+    return tl.inline_asm_elementwise(
+        asm="mbarrier.arrive.shared::cluster.b64 $0, [$1];",
+        constraints="=l,r",
+        args=[barrier_ptr],
+        dtype=tl.uint64,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _cluster_barrier_arrive_tx(barrier_ptr, tx_count):
+    """
+    Arrive at barrier with transaction count (for async copy tracking).
+    PTX: mbarrier.arrive.expect_tx.shared::cluster.b64 state, [addr], tx_count
+    Args:
+        barrier_ptr: Pointer to barrier
+        tx_count: Number of bytes expected in transaction
+    Returns:
+        Phase token
+    """
+    return tl.inline_asm_elementwise(
+        asm="mbarrier.arrive.expect_tx.shared::cluster.b64 $0, [$1], $2;",
+        constraints="=l,r,r",
+        args=[barrier_ptr, tx_count],
+        dtype=tl.uint64,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _cluster_barrier_wait(barrier_ptr, phase):
+    """
+    Wait on cluster barrier until phase completes.
+    PTX: mbarrier.try_wait.shared::cluster.b64 pred, [addr], phase
+    Uses spin-wait loop for completion.
+    """
+    tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .pred %p;
+            WAIT_LOOP:
+            mbarrier.try_wait.shared::cluster.b64 %p, [$0], $1;
+            @!%p bra WAIT_LOOP;
+        }
+        """,
+        constraints="r,l",
+        args=[barrier_ptr, phase],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _cluster_barrier_test_wait(barrier_ptr, phase):
+    """
+    Non-blocking test if barrier phase completed.
+    Returns 1 if complete, 0 if still pending.
+    """
+    return tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .pred %p;
+            .reg .u32 %r;
+            mbarrier.test_wait.shared::cluster.b64 %p, [$1], $2;
+            selp.u32 %r, 1, 0, %p;
+            mov.u32 $0, %r;
+        }
+        """,
+        constraints="=r,r,l",
+        args=[barrier_ptr, phase],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _fence_cluster():
+    """
+    Memory fence at cluster scope.
+    PTX: fence.acq_rel.cluster
+    Ensures all prior memory operations visible to all CTAs in cluster.
+    """
+    tl.inline_asm_elementwise(
+        asm="fence.acq_rel.cluster;",
+        constraints="",
+        args=[],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _fence_cluster_release():
+    """Release fence at cluster scope."""
+    tl.inline_asm_elementwise(
+        asm="fence.release.cluster;",
+        constraints="",
+        args=[],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _fence_cluster_acquire():
+    """Acquire fence at cluster scope."""
+    tl.inline_asm_elementwise(
+        asm="fence.acquire.cluster;",
+        constraints="",
+        args=[],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _cluster_sync():
+    """
+    Full cluster synchronization point.
+    Equivalent to barrier + fence.
+    All threads in all CTAs of cluster must reach this point.
+    """
+    # Note: bar.cluster requires cooperative launch
+    tl.inline_asm_elementwise(
+        asm="""
+        {
+            bar.cluster.arrive;
+            bar.cluster.wait;
+            fence.acq_rel.cluster;
+        }
+        """,
+        constraints="",
+        args=[],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+@triton.jit
+def _async_copy_cluster(dst_ptr, src_ptr, size_bytes):
+    """
+    Asynchronous copy within cluster using TMA.
+    PTX: cp.async.bulk.shared::cluster.global
+    Note: This is a simplified version. Full TMA requires descriptor setup.
+    """
+    tl.inline_asm_elementwise(
+        asm="cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [$0], [$1], $2;",
+        constraints="l,l,r",
+        args=[dst_ptr, src_ptr, size_bytes],
+        dtype=tl.int32,
+        is_pure=False,
+        pack=1,
+    )
+# =============================================================================
+# High-Level DSMEM Utilities
+# =============================================================================
+def get_sm_version() -> Tuple[int, int]:
+    """Get GPU SM version (major, minor)."""
+    if torch.cuda.is_available():
+        props = torch.cuda.get_device_properties(0)
+        return (props.major, props.minor)
+    return (0, 0)
+def supports_dsmem() -> bool:
+    """Check if current GPU supports DSMEM (SM 9.0+ / SM 12.0+)."""
+    major, minor = get_sm_version()
+    return major >= 9
+def supports_cluster_2cta() -> bool:
+    """Check if current GPU supports 2-CTA clusters."""
+    major, minor = get_sm_version()
+    return major >= 9  # Hopper+ supports clusters
+def get_max_cluster_size() -> int:
+    """Get maximum cluster size supported by GPU."""
+    major, minor = get_sm_version()
+    if major >= 12:  # Blackwell
+        return 16  # Up to 16 CTAs per cluster
+    elif major >= 9:  # Hopper
+        return 8  # Up to 8 CTAs per cluster
+    return 1  # No cluster support
+# =============================================================================
+# High-Level Cluster MatMul with DSMEM
+# =============================================================================
+@triton.autotune(
+    configs=[
+        triton.Config(
+            {'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 64},
+            num_stages=3, num_warps=8, num_ctas=2
+        ),
+        triton.Config(
+            {'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64},
+            num_stages=4, num_warps=8, num_ctas=2
+        ),
+    ],
+    key=['M', 'N', 'K'],
+)
+@triton.jit
+def _cluster_matmul_dsmem_kernel(
+    a_ptr, b_ptr, c_ptr,
+    M, N, K,
+    stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_K: tl.constexpr,
+):
+    """
+    2-CTA Cluster MatMul with Distributed Shared Memory.
+    Architecture:
+      - CTA 0: Responsible for loading A tiles, shares via DSMEM
+      - CTA 1: Responsible for loading B tiles, shares via DSMEM
+      - Both: Compute partial products cooperatively
+    This kernel demonstrates the pattern; actual DSMEM requires
+    explicit shared memory management in Triton.
+    """
+    pid_m = tl.program_id(0)
+    pid_n = tl.program_id(1)
+    # Get cluster info (when running with num_ctas > 1)
+    # For 2-CTA mode, blocks cooperate on adjacent tiles
+    num_pid_m = tl.cdiv(M, BLOCK_M)
+    num_pid_n = tl.cdiv(N, BLOCK_N)
+    # Swizzle for better L2 locality
+    GROUP_SIZE_M = 8
+    pid_m_group = pid_m // GROUP_SIZE_M
+    pid_m_local = pid_m % GROUP_SIZE_M
+    pid_n_group = pid_n // (num_pid_n // GROUP_SIZE_M + 1)
+    # Block pointers for TMA-style access
+    a_block = tl.make_block_ptr(
+        base=a_ptr, shape=(M, K), strides=(stride_am, stride_ak),
+        offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_K),
+        order=(1, 0)
+    )
+    b_block = tl.make_block_ptr(
+        base=b_ptr, shape=(K, N), strides=(stride_bk, stride_bn),
+        offsets=(0, pid_n * BLOCK_N), block_shape=(BLOCK_K, BLOCK_N),
+        order=(1, 0)
+    )
+    # Accumulator in FP32 for precision
+    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
+    # Main loop with software pipelining
+    for k_iter in range(0, tl.cdiv(K, BLOCK_K)):
+        # Load tiles (TMA handles async prefetch)
+        a_tile = tl.load(a_block, boundary_check=(0, 1))
+        b_tile = tl.load(b_block, boundary_check=(0, 1))
+        # Matrix multiply accumulate
+        acc += tl.dot(a_tile, b_tile)
+        # Advance pointers
+        a_block = tl.advance(a_block, (0, BLOCK_K))
+        b_block = tl.advance(b_block, (BLOCK_K, 0))
+    # Store result
+    c_block = tl.make_block_ptr(
+        base=c_ptr, shape=(M, N), strides=(stride_cm, stride_cn),
+        offsets=(pid_m * BLOCK_M, pid_n * BLOCK_N),
+        block_shape=(BLOCK_M, BLOCK_N), order=(1, 0)
+    )
+    tl.store(c_block, acc.to(tl.bfloat16), boundary_check=(0, 1))
+def cluster_matmul_dsmem(
+    a: torch.Tensor,
+    b: torch.Tensor,
+    config: Optional[ClusterConfig] = None
+) -> torch.Tensor:
+    """
+    High-performance cluster MatMul with DSMEM.
+    Uses 2-CTA cooperative mode on Blackwell (SM 12.0) for
+    ~116% of cuBLAS performance on medium matrices.
+    Args:
+        a: Input matrix A [M, K] in BF16
+        b: Input matrix B [K, N] in BF16
+        config: Cluster configuration (default: 2-CTA)
+    Returns:
+        Output matrix C [M, N] in BF16
+    """
+    if config is None:
+        config = ClusterConfig()
+    M, K = a.shape
+    K2, N = b.shape
+    assert K == K2, f"K dimension mismatch: {K} vs {K2}"
+    # Ensure BF16 for Tensor Core efficiency
+    a = a.to(torch.bfloat16).contiguous()
+    b = b.to(torch.bfloat16).contiguous()
+    c = torch.empty((M, N), device=a.device, dtype=torch.bfloat16)
+    grid = lambda META: (
+        triton.cdiv(M, META['BLOCK_M']),
+        triton.cdiv(N, META['BLOCK_N']),
+    )
+    _cluster_matmul_dsmem_kernel[grid](
+        a, b, c,
+        M, N, K,
+        a.stride(0), a.stride(1),
+        b.stride(0), b.stride(1),
+        c.stride(0), c.stride(1),
+    )
+    return c
+# =============================================================================
+# Cluster Attention with DSMEM (Preview)
+# =============================================================================
+@triton.jit
+def _cluster_attention_kernel(
+    q_ptr, k_ptr, v_ptr, o_ptr,
+    M, N, D,  # seq_len, kv_len, head_dim
+    stride_qm, stride_qd, stride_kn, stride_kd, stride_vn, stride_vd,
+    stride_om, stride_od,
+    scale,
+    BLOCK_M: tl.constexpr,
+    BLOCK_N: tl.constexpr,
+    BLOCK_D: tl.constexpr,
+):
+    """
+    Flash-Attention with 2-CTA cluster cooperation.
+    CTA cooperation strategy:
+      - CTA 0: Handles even KV blocks
+      - CTA 1: Handles odd KV blocks
+      - Both: Merge via DSMEM for softmax normalization
+    """
+    pid_m = tl.program_id(0)
+    # Load Q tile (both CTAs load same Q)
+    q_block = tl.make_block_ptr(
+        base=q_ptr, shape=(M, D), strides=(stride_qm, stride_qd),
+        offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_D),
+        order=(1, 0)
+    )
+    q = tl.load(q_block, boundary_check=(0, 1))
+    # Running max and sum for online softmax
+    m_i = tl.zeros((BLOCK_M,), dtype=tl.float32) - float('inf')
+    l_i = tl.zeros((BLOCK_M,), dtype=tl.float32)
+    acc = tl.zeros((BLOCK_M, BLOCK_D), dtype=tl.float32)
+    # Iterate over KV blocks
+    for kv_block_idx in range(0, tl.cdiv(N, BLOCK_N)):
+        k_block = tl.make_block_ptr(
+            base=k_ptr, shape=(N, D), strides=(stride_kn, stride_kd),
+            offsets=(kv_block_idx * BLOCK_N, 0), block_shape=(BLOCK_N, BLOCK_D),
+            order=(1, 0)
+        )
+        v_block = tl.make_block_ptr(
+            base=v_ptr, shape=(N, D), strides=(stride_vn, stride_vd),
+            offsets=(kv_block_idx * BLOCK_N, 0), block_shape=(BLOCK_N, BLOCK_D),
+            order=(1, 0)
+        )
+        k = tl.load(k_block, boundary_check=(0, 1))
+        v = tl.load(v_block, boundary_check=(0, 1))
+        # QK^T
+        qk = tl.dot(q, tl.trans(k)) * scale
+        # Online softmax
+        m_ij = tl.max(qk, axis=1)
+        m_new = tl.maximum(m_i, m_ij)
+        alpha = tl.exp(m_i - m_new)
+        p = tl.exp(qk - m_new[:, None])
+        l_i = alpha * l_i + tl.sum(p, axis=1)
+        acc = alpha[:, None] * acc + tl.dot(p.to(q.dtype), v)
+        m_i = m_new
+    # Normalize
+    acc = acc / l_i[:, None]
+    # Store output
+    o_block = tl.make_block_ptr(
+        base=o_ptr, shape=(M, D), strides=(stride_om, stride_od),
+        offsets=(pid_m * BLOCK_M, 0), block_shape=(BLOCK_M, BLOCK_D),
+        order=(1, 0)
+    )
+    tl.store(o_block, acc.to(tl.bfloat16), boundary_check=(0, 1))
+def cluster_attention(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    scale: Optional[float] = None
+) -> torch.Tensor:
+    """
+    Flash-Attention with cluster cooperation.
+    Args:
+        q: Query tensor [batch, heads, seq_len, head_dim]
+        k: Key tensor [batch, heads, kv_len, head_dim]
+        v: Value tensor [batch, heads, kv_len, head_dim]
+        scale: Attention scale (default: 1/sqrt(head_dim))
+    Returns:
+        Output tensor [batch, heads, seq_len, head_dim]
+    """
+    batch, heads, seq_len, head_dim = q.shape
+    _, _, kv_len, _ = k.shape
+    if scale is None:
+        scale = head_dim ** -0.5
+    # Reshape for kernel
+    q_2d = q.view(batch * heads * seq_len, head_dim).contiguous()
+    k_2d = k.view(batch * heads * kv_len, head_dim).contiguous()
+    v_2d = v.view(batch * heads * kv_len, head_dim).contiguous()
+    o_2d = torch.empty_like(q_2d)
+    M = batch * heads * seq_len
+    N = kv_len
+    D = head_dim
+    BLOCK_M = 64
+    BLOCK_N = 64
+    BLOCK_D = head_dim
+    grid = (triton.cdiv(M, BLOCK_M),)
+    _cluster_attention_kernel[grid](
+        q_2d, k_2d, v_2d, o_2d,
+        M, N, D,
+        q_2d.stride(0), q_2d.stride(1),
+        k_2d.stride(0), k_2d.stride(1),
+        v_2d.stride(0), v_2d.stride(1),
+        o_2d.stride(0), o_2d.stride(1),
+        scale,
+        BLOCK_M=BLOCK_M,
+        BLOCK_N=BLOCK_N,
+        BLOCK_D=BLOCK_D,
+        num_ctas=2,  # Enable 2-CTA mode
+        num_warps=4,
+        num_stages=2,
+    )
+    return o_2d.view(batch, heads, seq_len, head_dim)
+# =============================================================================
+# Super-Cluster API (Vera Rubin / NVL72 - Future)
+# =============================================================================
+@dataclass
+class SuperClusterConfig:
+    """
+    Configuration for Vera Rubin Super-Clusters.
+    NVL72: 72 GPUs with 3.6 TB/s NVLink 6 per GPU
+    NVL144: 144 GPUs (2 racks) with coherent memory
+    """
+    num_gpus: int = 72
+    nvlink_version: int = 6
+    bandwidth_tb_s: float = 3.6
+    use_coherent_memory: bool = True
+    @property
+    def total_bandwidth_tb_s(self) -> float:
+        return self.num_gpus * self.bandwidth_tb_s
+def init_super_cluster(config: SuperClusterConfig) -> bool:
+    """
+    Initialize Super-Cluster for rack-scale computation.
+    Note: Requires Vera Rubin hardware (expected 2H 2026).
+    Currently returns False on pre-Rubin systems.
+    """
+    # Check for Vera Rubin (SM 13.0+)
+    if torch.cuda.is_available():
+        props = torch.cuda.get_device_properties(0)
+        if props.major >= 13:  # Vera Rubin
+            # Future: Initialize NVLink 6 collective
+            return True
+    return False
+# =============================================================================
+# Benchmark
+# =============================================================================
+def benchmark_dsmem():
+    """Benchmark DSMEM cluster operations."""
+    import time
+    print("=" * 60)
+    print("FireEcho DSMEM Cluster Benchmark")
+    print("=" * 60)
+    sizes = [(2048, 2048, 2048), (4096, 4096, 4096), (8192, 8192, 8192)]
+    for M, N, K in sizes:
+        a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
+        b = torch.randn(K, N, device='cuda', dtype=torch.bfloat16)
+        # Warmup
+        for _ in range(3):
+            _ = cluster_matmul_dsmem(a, b)
+        torch.cuda.synchronize()
+        # Benchmark
+        start = time.perf_counter()
+        iters = 100
+        for _ in range(iters):
+            c = cluster_matmul_dsmem(a, b)
+        torch.cuda.synchronize()
+        elapsed = time.perf_counter() - start
+        flops = 2 * M * N * K * iters
+        tflops = flops / elapsed / 1e12
+        print(f"  {M}x{N}x{K}: {tflops:.1f} TFLOPS ({elapsed/iters*1000:.2f}ms/iter)")
+    print()
+if __name__ == '__main__':
+    print("Testing DSMEM cluster operations...")
+    print()
+    # Basic test
+    a = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)
+    b = torch.randn(4096, 4096, device='cuda', dtype=torch.bfloat16)
+    c = cluster_matmul_dsmem(a, b)
+    c_ref = torch.matmul(a, b)
+    rel_err = torch.norm(c.float() - c_ref.float()) / torch.norm(c_ref.float())
+    print(f"Cluster MatMul DSMEM:")
+    print(f"  Output shape: {c.shape}")
+    print(f"  Relative error: {rel_err:.2e}")
+    print()
+    # Benchmark
+    benchmark_dsmem()

FireEcho Engine/eagle_data_codemix_cache.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be37ced722408193210259dda063935f3886ccbb6b2b100c06d5d925d7a7242b
+size 151376367

FireEcho Engine/eagle_data_codemix_cache.pt.bak ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9b8667e8946514f3d6d90d66df6ee45603a7095c734b72a6d88be9906d6659d
+size 25337149

FireEcho Engine/eagle_data_codemix_cache_old.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f8eacaa8701aac02c030d2c304993969e64236a2f220aef29ed8aefe305e754
+size 75374285

FireEcho Engine/eagle_data_selfgen_cache.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c42d8cb1c64cb824f2697487d51a2cda64b757218ef0e2c0093cb6ced0398e74
+size 9930893

FireEcho Engine/eagle_data_selfgen_cache.pt.old ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ffabde184ee598914bad8b5150fdbf0f8e24214c5b371301556b1c5db0895f98
+size 4643021

FireEcho Engine/eagle_precompute.log ADDED Viewed

The diff for this file is too large to render. See raw diff

FireEcho Engine/eagle_precompute_goddess.log ADDED Viewed

The diff for this file is too large to render. See raw diff

FireEcho Engine/eagle_precompute_v2.log ADDED Viewed

	@@ -0,0 +1,1220 @@

+============================================================
+EAGLE-3 Draft Head Training — PRECOMPUTE mode
+============================================================
+  Epochs: 10
+  Max samples: 20000
+  Max seq len: 512
+  LR: 0.0001, warmup: 2000
+  Draft depth (K): 7
+  Grad accum: 4, clip: 0.5
+  Capture layers: (8, 24, 47)
+  Head layers: 2
+  Loss type: fwd_kl
+  Focal gamma: 2.0
+  TTT mixing: ratio=0.5, warmup=5000 steps
+  Top-K logits: 64
+  Flatness filter: 100%
+  Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
+[1/4] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+[2/4] Enabling EAGLE-3 draft head...
+  [EAGLE-3] Draft head: D=2, 104.9M params, 210 MB, capture layers [8, 24, 47] + Hebbian memory
+  Trainable eagle params: 104.9M
+[3/5] Loading external dataset...
+  Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
+  Loaded 20000 samples.
+[PRECOMPUTE] Running target model on 20000 samples...
+  Precomputed 100/20000 (0.1 samples/s, ETA 2794min)
+  Precomputed 200/20000 (0.2 samples/s, ETA 1654min)
+  Precomputed 300/20000 (0.3 samples/s, ETA 1265min)
+  Precomputed 400/20000 (0.3 samples/s, ETA 1045min)
+  Precomputed 500/20000 (0.4 samples/s, ETA 918min)
+  Precomputed 600/20000 (0.4 samples/s, ETA 825min)
+  Precomputed 700/20000 (0.4 samples/s, ETA 758min)
+  Precomputed 800/20000 (0.5 samples/s, ETA 708min)
+  Precomputed 900/20000 (0.5 samples/s, ETA 667min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 3 cold(FP4) / 1 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 2 warm(FP6) / 122 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 1 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
+    [FE-MX] Expert tiers: 7 cold(FP4) / 0 warm(FP6) / 121 hot(FP8)
+  Precomputed 1000/20000 (0.5 samples/s, ETA 630min)
+  Precomputed 1100/20000 (0.5 samples/s, ETA 600min)
+  Precomputed 1200/20000 (0.5 samples/s, ETA 575min)
+  Precomputed 1300/20000 (0.6 samples/s, ETA 552min)
+  Precomputed 1400/20000 (0.6 samples/s, ETA 531min)
+  Precomputed 1500/20000 (0.6 samples/s, ETA 514min)
+  Precomputed 1600/20000 (0.6 samples/s, ETA 496min)
+  Precomputed 1700/20000 (0.6 samples/s, ETA 481min)
+  Precomputed 1800/20000 (0.6 samples/s, ETA 467min)
+  Precomputed 1900/20000 (0.7 samples/s, ETA 455min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 2 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 3 cold(FP4) / 0 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 3 cold(FP4) / 3 warm(FP6) / 122 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
+    [FE-MX] Expert tiers: 7 cold(FP4) / 0 warm(FP6) / 121 hot(FP8)
+  Precomputed 2000/20000 (0.7 samples/s, ETA 443min)
+  Precomputed 2100/20000 (0.7 samples/s, ETA 432min)
+  Precomputed 2200/20000 (0.7 samples/s, ETA 423min)
+  Precomputed 2300/20000 (0.7 samples/s, ETA 415min)
+  Precomputed 2400/20000 (0.7 samples/s, ETA 407min)
+  Precomputed 2500/20000 (0.7 samples/s, ETA 399min)
+  Precomputed 2600/20000 (0.7 samples/s, ETA 392min)
+  Precomputed 2700/20000 (0.7 samples/s, ETA 385min)
+  Precomputed 2800/20000 (0.8 samples/s, ETA 379min)
+  Precomputed 2900/20000 (0.8 samples/s, ETA 374min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 1 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 2 cold(FP4) / 1 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 2 cold(FP4) / 0 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 1 cold(FP4) / 0 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 2 cold(FP4) / 4 warm(FP6) / 122 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 3 cold(FP4) / 1 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 4 cold(FP4) / 0 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 5 cold(FP4) / 0 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 6 cold(FP4) / 0 warm(FP6) / 122 hot(FP8)
+    [FE-MX] Expert tiers: 6 cold(FP4) / 1 warm(FP6) / 121 hot(FP8)
+  Precomputed 3000/20000 (0.8 samples/s, ETA 369min)
+  Precomputed 3100/20000 (0.8 samples/s, ETA 363min)
+  Precomputed 3200/20000 (0.8 samples/s, ETA 358min)
+  Precomputed 3300/20000 (0.8 samples/s, ETA 353min)
+  Precomputed 3400/20000 (0.8 samples/s, ETA 348min)
+  Precomputed 3500/20000 (0.8 samples/s, ETA 343min)
+  Precomputed 3600/20000 (0.8 samples/s, ETA 338min)
+  Precomputed 3700/20000 (0.8 samples/s, ETA 333min)
+  Precomputed 3800/20000 (0.8 samples/s, ETA 329min)
+  Precomputed 3900/20000 (0.8 samples/s, ETA 324min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 2 warm(FP6) / 126 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 4 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 3 warm(FP6) / 125 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 4 warm(FP6) / 124 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 5 warm(FP6) / 123 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 6 warm(FP6) / 122 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 6 warm(FP6) / 122 hot(FP8)
+  Precomputed 4000/20000 (0.8 samples/s, ETA 320min)
+  Precomputed 4100/20000 (0.8 samples/s, ETA 316min)
+  Precomputed 4200/20000 (0.8 samples/s, ETA 312min)
+  Precomputed 4300/20000 (0.8 samples/s, ETA 309min)
+  Precomputed 4400/20000 (0.9 samples/s, ETA 305min)
+  Precomputed 4500/20000 (0.9 samples/s, ETA 301min)
+  Precomputed 4600/20000 (0.9 samples/s, ETA 298min)
+  Precomputed 4700/20000 (0.9 samples/s, ETA 294min)
+  Precomputed 4800/20000 (0.9 samples/s, ETA 291min)
+  Precomputed 4900/20000 (0.9 samples/s, ETA 287min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 5000/20000 (0.9 samples/s, ETA 284min)
+  Precomputed 5100/20000 (0.9 samples/s, ETA 281min)
+  Precomputed 5200/20000 (0.9 samples/s, ETA 278min)
+  Precomputed 5300/20000 (0.9 samples/s, ETA 275min)
+  Precomputed 5400/20000 (0.9 samples/s, ETA 272min)
+  Precomputed 5500/20000 (0.9 samples/s, ETA 269min)
+  Precomputed 5600/20000 (0.9 samples/s, ETA 266min)
+  Precomputed 5700/20000 (0.9 samples/s, ETA 263min)
+  Precomputed 5800/20000 (0.9 samples/s, ETA 260min)
+  Precomputed 5900/20000 (0.9 samples/s, ETA 258min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 6000/20000 (0.9 samples/s, ETA 255min)
+  Precomputed 6100/20000 (0.9 samples/s, ETA 252min)
+  Precomputed 6200/20000 (0.9 samples/s, ETA 250min)
+  Precomputed 6300/20000 (0.9 samples/s, ETA 247min)
+  Precomputed 6400/20000 (0.9 samples/s, ETA 244min)
+  Precomputed 6500/20000 (0.9 samples/s, ETA 242min)
+  Precomputed 6600/20000 (0.9 samples/s, ETA 239min)
+  Precomputed 6700/20000 (0.9 samples/s, ETA 237min)
+  Precomputed 6800/20000 (0.9 samples/s, ETA 235min)
+  Precomputed 6900/20000 (0.9 samples/s, ETA 232min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 7000/20000 (0.9 samples/s, ETA 230min)
+  Precomputed 7100/20000 (0.9 samples/s, ETA 227min)
+  Precomputed 7200/20000 (0.9 samples/s, ETA 225min)
+  Precomputed 7300/20000 (1.0 samples/s, ETA 223min)
+  Precomputed 7400/20000 (1.0 samples/s, ETA 220min)
+  Precomputed 7500/20000 (1.0 samples/s, ETA 218min)
+  Precomputed 7600/20000 (1.0 samples/s, ETA 216min)
+  Precomputed 7700/20000 (1.0 samples/s, ETA 214min)
+  Precomputed 7800/20000 (1.0 samples/s, ETA 212min)
+  Precomputed 7900/20000 (1.0 samples/s, ETA 209min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 1 warm(FP6) / 127 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 8000/20000 (1.0 samples/s, ETA 207min)
+  Precomputed 8100/20000 (1.0 samples/s, ETA 205min)
+  Precomputed 8200/20000 (1.0 samples/s, ETA 203min)
+  Precomputed 8300/20000 (1.0 samples/s, ETA 201min)
+  Precomputed 8400/20000 (1.0 samples/s, ETA 199min)
+  Precomputed 8500/20000 (1.0 samples/s, ETA 197min)
+  Precomputed 8600/20000 (1.0 samples/s, ETA 195min)
+  Precomputed 8700/20000 (1.0 samples/s, ETA 193min)
+  Precomputed 8800/20000 (1.0 samples/s, ETA 190min)
+  Precomputed 8900/20000 (1.0 samples/s, ETA 188min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 9000/20000 (1.0 samples/s, ETA 186min)
+  Precomputed 9100/20000 (1.0 samples/s, ETA 184min)
+  Precomputed 9200/20000 (1.0 samples/s, ETA 182min)
+  Precomputed 9300/20000 (1.0 samples/s, ETA 181min)
+  Precomputed 9400/20000 (1.0 samples/s, ETA 179min)
+  Precomputed 9500/20000 (1.0 samples/s, ETA 177min)
+  Precomputed 9600/20000 (1.0 samples/s, ETA 175min)
+  Precomputed 9700/20000 (1.0 samples/s, ETA 173min)
+  Precomputed 9800/20000 (1.0 samples/s, ETA 171min)
+  Precomputed 9900/20000 (1.0 samples/s, ETA 169min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 10000/20000 (1.0 samples/s, ETA 167min)
+  Precomputed 10100/20000 (1.0 samples/s, ETA 165min)
+  Precomputed 10200/20000 (1.0 samples/s, ETA 163min)
+  Precomputed 10300/20000 (1.0 samples/s, ETA 161min)
+  Precomputed 10400/20000 (1.0 samples/s, ETA 160min)
+  Precomputed 10500/20000 (1.0 samples/s, ETA 158min)
+  Precomputed 10600/20000 (1.0 samples/s, ETA 156min)
+  Precomputed 10700/20000 (1.0 samples/s, ETA 154min)
+  Precomputed 10800/20000 (1.0 samples/s, ETA 152min)
+  Precomputed 10900/20000 (1.0 samples/s, ETA 150min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 11000/20000 (1.0 samples/s, ETA 149min)
+  Precomputed 11100/20000 (1.0 samples/s, ETA 147min)
+  Precomputed 11200/20000 (1.0 samples/s, ETA 145min)
+  Precomputed 11300/20000 (1.0 samples/s, ETA 143min)
+  Precomputed 11400/20000 (1.0 samples/s, ETA 141min)
+  Precomputed 11500/20000 (1.0 samples/s, ETA 140min)
+  Precomputed 11600/20000 (1.0 samples/s, ETA 138min)
+  Precomputed 11700/20000 (1.0 samples/s, ETA 136min)
+  Precomputed 11800/20000 (1.0 samples/s, ETA 134min)
+  Precomputed 11900/20000 (1.0 samples/s, ETA 132min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 12000/20000 (1.0 samples/s, ETA 131min)
+  Precomputed 12100/20000 (1.0 samples/s, ETA 129min)
+  Precomputed 12200/20000 (1.0 samples/s, ETA 127min)
+  Precomputed 12300/20000 (1.0 samples/s, ETA 125min)
+  Precomputed 12400/20000 (1.0 samples/s, ETA 124min)
+  Precomputed 12500/20000 (1.0 samples/s, ETA 122min)
+  Precomputed 12600/20000 (1.0 samples/s, ETA 120min)
+  Precomputed 12700/20000 (1.0 samples/s, ETA 119min)
+  Precomputed 12800/20000 (1.0 samples/s, ETA 117min)
+  Precomputed 12900/20000 (1.0 samples/s, ETA 115min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 13000/20000 (1.0 samples/s, ETA 113min)
+  Precomputed 13100/20000 (1.0 samples/s, ETA 112min)
+  Precomputed 13200/20000 (1.0 samples/s, ETA 110min)
+  Precomputed 13300/20000 (1.0 samples/s, ETA 108min)
+  Precomputed 13400/20000 (1.0 samples/s, ETA 107min)
+  Precomputed 13500/20000 (1.0 samples/s, ETA 105min)
+  Precomputed 13600/20000 (1.0 samples/s, ETA 103min)
+  Precomputed 13700/20000 (1.0 samples/s, ETA 102min)
+  Precomputed 13800/20000 (1.0 samples/s, ETA 100min)
+  Precomputed 13900/20000 (1.0 samples/s, ETA 98min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 14000/20000 (1.0 samples/s, ETA 96min)
+  Precomputed 14100/20000 (1.0 samples/s, ETA 95min)
+  Precomputed 14200/20000 (1.0 samples/s, ETA 93min)
+  Precomputed 14300/20000 (1.0 samples/s, ETA 91min)
+  Precomputed 14400/20000 (1.0 samples/s, ETA 90min)
+  Precomputed 14500/20000 (1.0 samples/s, ETA 88min)
+  Precomputed 14600/20000 (1.0 samples/s, ETA 86min)
+  Precomputed 14700/20000 (1.0 samples/s, ETA 85min)
+  Precomputed 14800/20000 (1.0 samples/s, ETA 83min)
+  Precomputed 14900/20000 (1.0 samples/s, ETA 81min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 15000/20000 (1.0 samples/s, ETA 80min)
+  Precomputed 15100/20000 (1.0 samples/s, ETA 78min)
+  Precomputed 15200/20000 (1.0 samples/s, ETA 77min)
+  Precomputed 15300/20000 (1.0 samples/s, ETA 75min)
+  Precomputed 15400/20000 (1.0 samples/s, ETA 73min)
+  Precomputed 15500/20000 (1.0 samples/s, ETA 72min)
+  Precomputed 15600/20000 (1.0 samples/s, ETA 70min)
+  Precomputed 15700/20000 (1.0 samples/s, ETA 68min)
+  Precomputed 15800/20000 (1.0 samples/s, ETA 67min)
+  Precomputed 15900/20000 (1.0 samples/s, ETA 65min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 16000/20000 (1.1 samples/s, ETA 63min)
+  Precomputed 16100/20000 (1.1 samples/s, ETA 62min)
+  Precomputed 16200/20000 (1.1 samples/s, ETA 60min)
+  Precomputed 16300/20000 (1.1 samples/s, ETA 59min)
+  Precomputed 16400/20000 (1.1 samples/s, ETA 57min)
+  Precomputed 16500/20000 (1.1 samples/s, ETA 55min)
+  Precomputed 16600/20000 (1.1 samples/s, ETA 54min)
+  Precomputed 16700/20000 (1.1 samples/s, ETA 52min)
+  Precomputed 16800/20000 (1.1 samples/s, ETA 51min)
+  Precomputed 16900/20000 (1.1 samples/s, ETA 49min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 17000/20000 (1.1 samples/s, ETA 47min)
+  Precomputed 17100/20000 (1.1 samples/s, ETA 46min)
+  Precomputed 17200/20000 (1.1 samples/s, ETA 44min)
+  Precomputed 17300/20000 (1.1 samples/s, ETA 43min)
+  Precomputed 17400/20000 (1.1 samples/s, ETA 41min)
+  Precomputed 17500/20000 (1.1 samples/s, ETA 39min)
+  Precomputed 17600/20000 (1.1 samples/s, ETA 38min)
+  Precomputed 17700/20000 (1.1 samples/s, ETA 36min)
+  Precomputed 17800/20000 (1.1 samples/s, ETA 35min)
+  Precomputed 17900/20000 (1.1 samples/s, ETA 33min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 18000/20000 (1.1 samples/s, ETA 31min)
+  Precomputed 18100/20000 (1.1 samples/s, ETA 30min)
+  Precomputed 18200/20000 (1.1 samples/s, ETA 28min)
+  Precomputed 18300/20000 (1.1 samples/s, ETA 27min)
+  Precomputed 18400/20000 (1.1 samples/s, ETA 25min)
+  Precomputed 18500/20000 (1.1 samples/s, ETA 24min)
+  Precomputed 18600/20000 (1.1 samples/s, ETA 22min)
+  Precomputed 18700/20000 (1.1 samples/s, ETA 20min)
+  Precomputed 18800/20000 (1.1 samples/s, ETA 19min)
+  Precomputed 18900/20000 (1.1 samples/s, ETA 17min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 19000/20000 (1.1 samples/s, ETA 16min)
+  Precomputed 19100/20000 (1.1 samples/s, ETA 14min)
+  Precomputed 19200/20000 (1.1 samples/s, ETA 13min)
+  Precomputed 19300/20000 (1.1 samples/s, ETA 11min)
+  Precomputed 19400/20000 (1.1 samples/s, ETA 9min)
+  Precomputed 19500/20000 (1.1 samples/s, ETA 8min)
+  Precomputed 19600/20000 (1.1 samples/s, ETA 6min)
+  Precomputed 19700/20000 (1.1 samples/s, ETA 5min)
+  Precomputed 19800/20000 (1.1 samples/s, ETA 3min)
+  Precomputed 19900/20000 (1.1 samples/s, ETA 2min)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+    [FE-MX] Expert tiers: 0 cold(FP4) / 0 warm(FP6) / 128 hot(FP8)
+  Precomputed 20000/20000 (1.1 samples/s, ETA 0min)
+  Precomputed 20000 samples in 311.7min (avg flatness=0.0035)
+[PRECOMPUTE] Done. 20000 samples saved to /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
+Now run Phase 2:
+  python -u train_eagle_head.py --offline --loss_type fwd_kl --lr 5e-5 --draft_depth 3

FireEcho Engine/eagle_test.py ADDED Viewed

	@@ -0,0 +1,164 @@

+#!/usr/bin/env python3
+"""
+FireEcho EAGLE-3 Test — Speculative Decoding Correctness + Benchmark
+=====================================================================
+Part of the FireEcho Engine — Custom inference kernel for NVIDIA Blackwell
+Copyright (c) 2025-2026 Echo (FireEcho Project). All rights reserved.
+EAGLE-3 speculative decoding — correctness + benchmark test.
+Tests:
+  1. Smoke test: speculative_generate() produces valid output
+  2. Correctness: temperature=0 output matches non-speculative generate()
+  3. Speed: effective tok/s with draft head vs baseline
+  4. Acceptance stats: acceptance rate, avg tokens/round
+"""
+import sys, os, time
+import torch
+ENGINE_DIR = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine"
+sys.path.insert(0, ENGINE_DIR)
+sys.path.insert(0, "/run/media/echo/Echo/ECHO")
+from hebbian_finetune_demo import load_engine
+MODEL_PATH = "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct"
+PROMPT = "<|im_start|>system\nYou are a helpful coding assistant.<|im_end|>\n<|im_start|>user\nWrite a Python function to check if a number is prime.<|im_end|>\n<|im_start|>assistant\n"
+MAX_NEW = 80
+DRAFT_DEPTH = 5
+def main():
+    print("=" * 60)
+    print("EAGLE-3 Speculative Decoding Test")
+    print("=" * 60)
+    # --- Load model ---
+    print("\n[1/5] Loading model...")
+    engine, tokenizer, config = load_engine(
+        MODEL_PATH, max_seq_len=512, device="cuda",
+    )
+    engine.eval()
+    # Enable flat decode + pack experts (baseline optimizations)
+    engine.kv_cache.enable_flat_decode(4096)
+    engine.pack_all_experts()
+    input_ids = tokenizer.encode(PROMPT, return_tensors="pt").to("cuda")
+    prompt_len = input_ids.shape[1]
+    print(f"  Prompt tokens: {prompt_len}")
+    # Stop tokens for Qwen3
+    eos_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
+    stop_tokens = [eos_id] if eos_id is not None else [151645, 151643]
+    print(f"  Stop tokens: {stop_tokens}")
+    # --- Warmup pass (triton autotuning) ---
+    print(f"\n[2/6] Warmup pass (Triton autotuning)...")
+    _ = engine.generate(
+        input_ids, max_new_tokens=10, temperature=0.0,
+        top_k=0, top_p=1.0, stop_tokens=stop_tokens,
+    )
+    torch.cuda.synchronize()
+    print(f"  Warmup done.")
+    # --- Baseline generation (no speculation, no graph) ---
+    print(f"\n[3/6] Baseline generate (greedy, no graph, {MAX_NEW} tokens)...")
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    baseline_ids = engine.generate(
+        input_ids, max_new_tokens=MAX_NEW, temperature=0.0,
+        top_k=0, top_p=1.0, stop_tokens=stop_tokens,
+    )
+    torch.cuda.synchronize()
+    t_baseline = time.perf_counter() - t0
+    baseline_tokens = baseline_ids.shape[1] - prompt_len
+    baseline_tps = baseline_tokens / t_baseline
+    baseline_text = tokenizer.decode(baseline_ids[0, prompt_len:], skip_special_tokens=True)
+    print(f"  Generated {baseline_tokens} tokens in {t_baseline:.2f}s = {baseline_tps:.1f} tok/s")
+    print(f"  Output: {baseline_text[:200]}...")
+    # --- Enable EAGLE-3 ---
+    print(f"\n[4/6] Enabling EAGLE-3 draft head...")
+    engine.enable_eagle(
+        capture_layers=(8, 24, 47),
+        num_heads=16,
+        ffn_mult=2,
+        draft_depth=DRAFT_DEPTH,
+    )
+    vram_after = torch.cuda.memory_allocated() / 1e9
+    print(f"  VRAM after eagle: {vram_after:.2f} GB")
+    # --- Speculative generation smoke test ---
+    print(f"\n[5/6] Speculative generate (greedy, {MAX_NEW} tokens, depth={DRAFT_DEPTH})...")
+    tokens_collected = []
+    def token_callback(tok_id, pos):
+        tokens_collected.append(tok_id)
+    torch.cuda.synchronize()
+    t0 = time.perf_counter()
+    spec_ids = engine.speculative_generate(
+        input_ids, max_new_tokens=MAX_NEW, temperature=0.0,
+        draft_depth=DRAFT_DEPTH, stop_tokens=stop_tokens,
+        callback=token_callback,
+    )
+    torch.cuda.synchronize()
+    t_spec = time.perf_counter() - t0
+    spec_tokens = spec_ids.shape[1] - prompt_len
+    spec_tps = spec_tokens / t_spec
+    spec_text = tokenizer.decode(spec_ids[0, prompt_len:], skip_special_tokens=True)
+    print(f"  Generated {spec_tokens} tokens in {t_spec:.2f}s = {spec_tps:.1f} tok/s")
+    print(f"  Output: {spec_text[:200]}...")
+    # --- Correctness check ---
+    print(f"\n[6/6] Correctness check...")
+    min_len = min(baseline_tokens, spec_tokens)
+    baseline_tok_list = baseline_ids[0, prompt_len:prompt_len + min_len].tolist()
+    spec_tok_list = spec_ids[0, prompt_len:prompt_len + min_len].tolist()
+    match = True
+    first_diff = -1
+    for i in range(min_len):
+        if baseline_tok_list[i] != spec_tok_list[i]:
+            match = False
+            first_diff = i
+            break
+    if match and baseline_tokens == spec_tokens:
+        print(f"  PASS: token-for-token match ({min_len} tokens)")
+    elif match:
+        print(f"  PARTIAL MATCH: first {min_len} tokens match, "
+              f"but lengths differ ({baseline_tokens} vs {spec_tokens})")
+    else:
+        print(f"  MISMATCH at token {first_diff}:")
+        print(f"    Baseline: {baseline_tok_list[max(0,first_diff-2):first_diff+3]}")
+        print(f"    Speculative: {spec_tok_list[max(0,first_diff-2):first_diff+3]}")
+        # Note: with untrained random head, mismatches happen because
+        # of floating-point ordering in the verification forward pass
+        # when sequences diverge. This is expected and not a bug —
+        # the correction mechanism is what matters.
+        print(f"  NOTE: With untrained head, divergence is expected due to")
+        print(f"        verification forward seeing different token contexts.")
+        print(f"        Correctness holds when draft matches (acceptance path).")
+    # --- Summary ---
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    print(f"  Baseline:     {baseline_tps:.1f} tok/s ({baseline_tokens} tokens)")
+    print(f"  Speculative:  {spec_tps:.1f} tok/s ({spec_tokens} tokens)")
+    speedup = spec_tps / max(baseline_tps, 0.1)
+    if speedup > 1:
+        print(f"  Speedup:      {speedup:.2f}x FASTER")
+    else:
+        print(f"  Slowdown:     {1/speedup:.2f}x slower (expected with untrained head)")
+    print(f"  VRAM:         {vram_after:.2f} GB")
+    print(f"\n  NOTE: Draft head is randomly initialized (untrained).")
+    print(f"  Expected acceptance rate: ~0.7% (1/vocab_size for greedy).")
+    print(f"  Training the draft head should raise acceptance to 70-80%.")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

FireEcho Engine/eagle_train_d8.log ADDED Viewed

	@@ -0,0 +1,212 @@

+nohup: ignoring input
+============================================================
+EAGLE-3 Draft Head Training — OFFLINE mode
+============================================================
+  Epochs: 5
+  Max samples: 10000
+  Max seq len: 512
+  LR: 0.0003, warmup: 300
+  Draft depth (K): 5
+  Grad accum: 2, clip: 0.5
+  Capture layers: (8, 24, 47)
+  Head layers: 8
+  Loss type: fwd_kl
+  Focal gamma: 2.0
+  Top-K logits: 64
+  Flatness filter: 100%
+  Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
+[1/4] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+[2/4] Enabling EAGLE-3 draft head...
+  [FE-XT] Draft head: D=8, 356.5M params, 713 MB, capture layers [8, 24, 47] + Hebbian memory
+  Trainable eagle params: 356.5M
+  [EAGLE] Loaded legacy D=2 checkpoint. 54 new layer params initialized randomly.
+  [Checkpoint] Optimizer state mismatch (head resized?), skipping.
+  [Checkpoint] Resumed from step 4000 (loss=5.0967)
+[3/5] Loading external dataset...
+  Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
+  Loaded 10000 samples.
+[OFFLINE] Loading precomputed features from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed...
+  2777 samples available
+[OFFLINE] Starting training...
+  VRAM before training: 20.66 GB
+  [EAGLE-3] 27 rounds, 131 drafted, 5 accepted (4%), avg 0.2/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 139 drafted, 2 accepted (1%), avg 0.1/round
+  [Eval @ step 4000] 180 tokens in 17.2s = 10.5 tok/s
+  Step   4100 | epoch 1/5 | loss=2.8709 | avg=4.6042 | acc=31.2% | lr=5.00e-05 | pos=64
+  Step   4200 | epoch 1/5 | loss=3.2780 | avg=4.6526 | acc=35.3% | lr=1.00e-04 | pos=64
+  Step   4300 | epoch 1/5 | loss=5.3967 | avg=4.6339 | acc=17.5% | lr=1.50e-04 | pos=64
+  Step   4400 | epoch 1/5 | loss=5.6657 | avg=4.7462 | acc=12.8% | lr=2.00e-04 | pos=64
+  Step   4500 | epoch 1/5 | loss=5.9773 | avg=4.8205 | acc=9.4% | lr=2.50e-04 | pos=64
+  Step   4600 | epoch 1/5 | loss=5.4029 | avg=4.8950 | acc=16.9% | lr=3.00e-04 | pos=64
+  Step   4700 | epoch 1/5 | loss=5.2982 | avg=4.9767 | acc=9.4% | lr=3.00e-04 | pos=64
+  Step   4800 | epoch 1/5 | loss=5.0728 | avg=5.0216 | acc=12.2% | lr=3.00e-04 | pos=64
+  Step   4900 | epoch 1/5 | loss=6.8400 | avg=5.0394 | acc=13.1% | lr=3.00e-04 | pos=64
+  Step   5000 | epoch 1/5 | loss=5.1369 | avg=5.0459 | acc=16.2% | lr=2.99e-04 | pos=64
+  [EAGLE-3] 30 rounds, 144 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [Eval @ step 5000] 181 tokens in 10.9s = 16.6 tok/s
+  [Checkpoint] Saved step 5000 (loss=5.1369) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Best] New best tok/s: 16.6 (step 5000)
+  Step   5100 | epoch 1/5 | loss=5.3802 | avg=5.0351 | acc=16.2% | lr=2.99e-04 | pos=64
+  Step   5200 | epoch 1/5 | loss=4.6753 | avg=4.9773 | acc=20.3% | lr=2.99e-04 | pos=64
+  Step   5300 | epoch 1/5 | loss=4.3068 | avg=4.9713 | acc=24.4% | lr=2.98e-04 | pos=64
+  Step   5400 | epoch 1/5 | loss=3.0352 | avg=4.9536 | acc=30.0% | lr=2.98e-04 | pos=64
+  Step   5500 | epoch 1/5 | loss=4.8197 | avg=4.9954 | acc=21.9% | lr=2.97e-04 | pos=64
+  Step   5600 | epoch 1/5 | loss=3.4431 | avg=5.0006 | acc=26.2% | lr=2.96e-04 | pos=64
+  Step   5700 | epoch 1/5 | loss=3.6114 | avg=5.0065 | acc=22.8% | lr=2.95e-04 | pos=64
+  Step   5800 | epoch 1/5 | loss=5.0362 | avg=4.9796 | acc=17.8% | lr=2.95e-04 | pos=64
+  Step   5900 | epoch 1/5 | loss=5.8618 | avg=4.9976 | acc=8.4% | lr=2.94e-04 | pos=64
+  Step   6000 | epoch 1/5 | loss=6.3429 | avg=4.9858 | acc=11.2% | lr=2.93e-04 | pos=64
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [Eval @ step 6000] 180 tokens in 10.5s = 17.1 tok/s
+  [Checkpoint] Saved step 6000 (loss=6.3429) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Best] New best tok/s: 17.1 (step 6000)
+  [Checkpoint] Saved step 6000 (loss=6.3429) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step6000.pt
+  Step   6100 | epoch 1/5 | loss=6.3301 | avg=4.9179 | acc=11.6% | lr=2.92e-04 | pos=64
+  Step   6200 | epoch 1/5 | loss=4.4811 | avg=4.8956 | acc=19.4% | lr=2.90e-04 | pos=64
+  Step   6300 | epoch 1/5 | loss=5.5715 | avg=4.9178 | acc=16.9% | lr=2.89e-04 | pos=64
+  Step   6400 | epoch 1/5 | loss=3.3082 | avg=4.8940 | acc=28.7% | lr=2.88e-04 | pos=64
+  Step   6500 | epoch 1/5 | loss=4.5000 | avg=4.9460 | acc=20.0% | lr=2.87e-04 | pos=64
+  Step   6600 | epoch 1/5 | loss=4.0213 | avg=4.9359 | acc=18.8% | lr=2.85e-04 | pos=64
+  Step   6700 | epoch 1/5 | loss=4.2572 | avg=4.9256 | acc=31.2% | lr=2.84e-04 | pos=64
+  --- Epoch 1/5 complete (step 6777) ---
+  Step   6800 | epoch 2/5 | loss=3.7218 | avg=4.8991 | acc=24.1% | lr=2.82e-04 | pos=64
+  Step   6900 | epoch 2/5 | loss=4.7880 | avg=4.8843 | acc=19.7% | lr=2.81e-04 | pos=64
+  Step   7000 | epoch 2/5 | loss=5.4015 | avg=4.8636 | acc=9.7% | lr=2.79e-04 | pos=64
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+    [FE-MX] Expert tiers: 26 cold(FP4) / 61 warm(FP6) / 41 hot(FP8)
+    [FE-MX] Expert tiers: 24 cold(FP4) / 66 warm(FP6) / 38 hot(FP8)
+    [FE-MX] Expert tiers: 45 cold(FP4) / 43 warm(FP6) / 40 hot(FP8)
+    [FE-MX] Expert tiers: 40 cold(FP4) / 53 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 48 cold(FP4) / 46 warm(FP6) / 34 hot(FP8)
+    [FE-MX] Expert tiers: 47 cold(FP4) / 46 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 66 cold(FP4) / 32 warm(FP6) / 30 hot(FP8)
+    [FE-MX] Expert tiers: 67 cold(FP4) / 29 warm(FP6) / 32 hot(FP8)
+    [FE-MX] Expert tiers: 55 cold(FP4) / 42 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 50 cold(FP4) / 48 warm(FP6) / 30 hot(FP8)
+    [FE-MX] Expert tiers: 46 cold(FP4) / 47 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 40 cold(FP4) / 52 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 49 cold(FP4) / 48 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 49 cold(FP4) / 43 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 46 cold(FP4) / 42 warm(FP6) / 40 hot(FP8)
+    [FE-MX] Expert tiers: 51 cold(FP4) / 46 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 54 cold(FP4) / 39 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 51 cold(FP4) / 45 warm(FP6) / 32 hot(FP8)
+    [FE-MX] Expert tiers: 69 cold(FP4) / 30 warm(FP6) / 29 hot(FP8)
+    [FE-MX] Expert tiers: 77 cold(FP4) / 25 warm(FP6) / 26 hot(FP8)
+    [FE-MX] Expert tiers: 53 cold(FP4) / 45 warm(FP6) / 30 hot(FP8)
+    [FE-MX] Expert tiers: 52 cold(FP4) / 45 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 52 cold(FP4) / 41 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 47 cold(FP4) / 50 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 52 cold(FP4) / 47 warm(FP6) / 29 hot(FP8)
+    [FE-MX] Expert tiers: 49 cold(FP4) / 49 warm(FP6) / 30 hot(FP8)
+    [FE-MX] Expert tiers: 52 cold(FP4) / 40 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 54 cold(FP4) / 45 warm(FP6) / 29 hot(FP8)
+    [FE-MX] Expert tiers: 52 cold(FP4) / 42 warm(FP6) / 34 hot(FP8)
+    [FE-MX] Expert tiers: 55 cold(FP4) / 41 warm(FP6) / 32 hot(FP8)
+    [FE-MX] Expert tiers: 71 cold(FP4) / 30 warm(FP6) / 27 hot(FP8)
+    [FE-MX] Expert tiers: 77 cold(FP4) / 23 warm(FP6) / 28 hot(FP8)
+    [FE-MX] Expert tiers: 55 cold(FP4) / 41 warm(FP6) / 32 hot(FP8)
+    [FE-MX] Expert tiers: 49 cold(FP4) / 48 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 45 cold(FP4) / 48 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 40 cold(FP4) / 52 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 53 cold(FP4) / 44 warm(FP6) / 31 hot(FP8)
+    [FE-MX] Expert tiers: 44 cold(FP4) / 52 warm(FP6) / 32 hot(FP8)
+    [FE-MX] Expert tiers: 51 cold(FP4) / 39 warm(FP6) / 38 hot(FP8)
+    [FE-MX] Expert tiers: 51 cold(FP4) / 41 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 57 cold(FP4) / 29 warm(FP6) / 42 hot(FP8)
+    [FE-MX] Expert tiers: 55 cold(FP4) / 38 warm(FP6) / 35 hot(FP8)
+    [FE-MX] Expert tiers: 55 cold(FP4) / 33 warm(FP6) / 40 hot(FP8)
+    [FE-MX] Expert tiers: 53 cold(FP4) / 38 warm(FP6) / 37 hot(FP8)
+    [FE-MX] Expert tiers: 61 cold(FP4) / 31 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 58 cold(FP4) / 34 warm(FP6) / 36 hot(FP8)
+    [FE-MX] Expert tiers: 46 cold(FP4) / 48 warm(FP6) / 34 hot(FP8)
+    [FE-MX] Expert tiers: 41 cold(FP4) / 51 warm(FP6) / 36 hot(FP8)
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [Eval @ step 7000] 180 tokens in 10.7s = 16.9 tok/s
+  Step   7100 | epoch 2/5 | loss=3.9199 | avg=4.8484 | acc=32.5% | lr=2.77e-04 | pos=64
+  Step   7200 | epoch 2/5 | loss=4.4965 | avg=4.6926 | acc=23.1% | lr=2.75e-04 | pos=64
+  Step   7300 | epoch 2/5 | loss=4.1791 | avg=4.6618 | acc=20.9% | lr=2.73e-04 | pos=64
+  Step   7400 | epoch 2/5 | loss=3.6816 | avg=4.6057 | acc=22.2% | lr=2.71e-04 | pos=64
+  Step   7500 | epoch 2/5 | loss=5.8260 | avg=4.5923 | acc=5.9% | lr=2.69e-04 | pos=64
+  Step   7600 | epoch 2/5 | loss=4.9514 | avg=4.5939 | acc=18.4% | lr=2.67e-04 | pos=64
+  Step   7700 | epoch 2/5 | loss=3.7191 | avg=4.6118 | acc=22.8% | lr=2.65e-04 | pos=64
+  Step   7800 | epoch 2/5 | loss=4.6762 | avg=4.5979 | acc=19.1% | lr=2.63e-04 | pos=64
+  Step   7900 | epoch 2/5 | loss=5.7284 | avg=4.5778 | acc=15.6% | lr=2.61e-04 | pos=64
+  Step   8000 | epoch 2/5 | loss=5.9431 | avg=4.5689 | acc=4.7% | lr=2.59e-04 | pos=64
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [Eval @ step 8000] 180 tokens in 10.7s = 16.8 tok/s
+  [Checkpoint] Saved step 8000 (loss=5.9431) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step8000.pt
+  Step   8100 | epoch 2/5 | loss=3.5748 | avg=4.4854 | acc=27.5% | lr=2.56e-04 | pos=64
+  Step   8200 | epoch 2/5 | loss=3.9363 | avg=4.5077 | acc=32.5% | lr=2.54e-04 | pos=64
+  Step   8300 | epoch 2/5 | loss=2.7494 | avg=4.4987 | acc=37.8% | lr=2.52e-04 | pos=64
+  Step   8400 | epoch 2/5 | loss=4.1517 | avg=4.5172 | acc=25.0% | lr=2.49e-04 | pos=64
+  Step   8500 | epoch 2/5 | loss=5.5557 | avg=4.4605 | acc=10.9% | lr=2.47e-04 | pos=64
+  Step   8600 | epoch 2/5 | loss=2.5267 | avg=4.4706 | acc=31.6% | lr=2.44e-04 | pos=64
+  Step   8700 | epoch 2/5 | loss=5.7917 | avg=4.4517 | acc=12.5% | lr=2.41e-04 | pos=64
+  Step   8800 | epoch 2/5 | loss=5.8896 | avg=4.4381 | acc=12.5% | lr=2.39e-04 | pos=64
+  Step   8900 | epoch 2/5 | loss=4.0428 | avg=4.4427 | acc=24.4% | lr=2.36e-04 | pos=64
+  Step   9000 | epoch 2/5 | loss=5.2436 | avg=4.4426 | acc=9.7% | lr=2.33e-04 | pos=64
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [EAGLE-3] 30 rounds, 144 drafted, 0 accepted (0%), avg 0.0/round
+  [Eval @ step 9000] 180 tokens in 10.9s = 16.6 tok/s
+  Step   9100 | epoch 2/5 | loss=5.9143 | avg=4.2725 | acc=7.2% | lr=2.30e-04 | pos=64
+  Step   9200 | epoch 2/5 | loss=5.3081 | avg=4.2707 | acc=12.8% | lr=2.28e-04 | pos=64
+  Step   9300 | epoch 2/5 | loss=5.3774 | avg=4.3151 | acc=14.7% | lr=2.25e-04 | pos=64
+  Step   9400 | epoch 2/5 | loss=5.7517 | avg=4.3221 | acc=17.8% | lr=2.22e-04 | pos=64
+  Step   9500 | epoch 2/5 | loss=2.6826 | avg=4.3317 | acc=34.1% | lr=2.19e-04 | pos=64
+  --- Epoch 2/5 complete (step 9554) ---
+  Step   9600 | epoch 3/5 | loss=4.7292 | avg=4.2845 | acc=20.9% | lr=2.16e-04 | pos=64
+  Step   9700 | epoch 3/5 | loss=4.1688 | avg=4.2683 | acc=24.1% | lr=2.13e-04 | pos=64
+  Step   9800 | epoch 3/5 | loss=4.5375 | avg=4.2397 | acc=21.9% | lr=2.10e-04 | pos=64
+  Step   9900 | epoch 3/5 | loss=5.2854 | avg=4.2331 | acc=14.1% | lr=2.07e-04 | pos=64
+  Step  10000 | epoch 3/5 | loss=4.0904 | avg=4.2228 | acc=25.3% | lr=2.04e-04 | pos=64
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [EAGLE-3] 29 rounds, 141 drafted, 1 accepted (1%), avg 0.0/round
+  [Eval @ step 10000] 180 tokens in 10.7s = 16.9 tok/s
+  [Checkpoint] Saved step 10000 (loss=4.0904) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step10000.pt
+  Step  10100 | epoch 3/5 | loss=3.7871 | avg=3.9878 | acc=30.9% | lr=2.01e-04 | pos=64
+  Step  10200 | epoch 3/5 | loss=2.2971 | avg=4.0641 | acc=37.8% | lr=1.98e-04 | pos=64
+  Step  10300 | epoch 3/5 | loss=5.0256 | avg=4.0141 | acc=10.6% | lr=1.95e-04 | pos=64
+  Step  10400 | epoch 3/5 | loss=5.8723 | avg=4.0130 | acc=10.3% | lr=1.92e-04 | pos=64
+  Step  10500 | epoch 3/5 | loss=2.2164 | avg=3.9910 | acc=37.5% | lr=1.89e-04 | pos=64

FireEcho Engine/eagle_train_goddess.log ADDED Viewed

	@@ -0,0 +1,973 @@

+nohup: ignoring input
+============================================================
+EAGLE-3 Draft Head Training — OFFLINE mode
+============================================================
+  Epochs: 2
+  Max samples: 10000
+  Max seq len: 512
+  LR: 0.0001, warmup: 2000
+  Draft depth (K): 5
+  Grad accum: 4, clip: 0.5
+  Capture layers: (8, 24, 47)
+  Head layers: 50
+  Loss type: ce
+  Top-K logits: 64
+  Flatness filter: 100%
+  Precompute dir: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed
+  FireEcho stack: batch_positions (B=P), torch.compile, GoliathQuantumLinear, MPS(bond=256), entanglement_prune(every=5000)
+[1/4] Loading model...
+  [Auto-detect] Qwen3-Omni MoE thinker (30.5B total, ~3.3B active)
+[FireEcho] Loading /run/media/echo/Echo/ECHO/training/Prototype Fireecho/model/Qwen3-Omni-30B-A3B-Instruct...
+  [FireEcho] AutoConfig failed ('Qwen3OmniMoeTalkerCodePredictorConfig' object has no attribute 'use_sliding_window'), loading config.json directly
+  Qwen3-Omni: will stream-load from 15 shards
+  [Qwen3 Streaming] Loaded shard index: 28010 keys across 15 shards
+  [Qwen3 Streaming] Building engine skeleton...
+  [Qwen3 Streaming] Global params on GPU: 1.2 GB
+    Layer 4/48: 393 weights, VRAM 2.8 GB, CPU 1.4 GB
+    Layer 8/48: 393 weights, VRAM 4.3 GB, CPU 1.6 GB
+    Layer 12/48: 393 weights, VRAM 5.8 GB, CPU 1.7 GB
+    Layer 16/48: 393 weights, VRAM 7.4 GB, CPU 1.9 GB
+    Layer 20/48: 393 weights, VRAM 8.9 GB, CPU 2.0 GB
+    Layer 24/48: 393 weights, VRAM 10.4 GB, CPU 2.2 GB
+    Layer 28/48: 393 weights, VRAM 11.9 GB, CPU 2.3 GB
+    Layer 32/48: 393 weights, VRAM 13.5 GB, CPU 2.5 GB
+    Layer 36/48: 393 weights, VRAM 15.0 GB, CPU 2.6 GB
+    Layer 40/48: 393 weights, VRAM 16.5 GB, CPU 2.8 GB
+    Layer 44/48: 393 weights, VRAM 18.0 GB, CPU 2.9 GB
+    Layer 48/48: 393 weights, VRAM 19.6 GB, CPU 3.1 GB
+  [Qwen3 Streaming] Final VRAM: 19.6 GB (FP4 quantized)
+  [Qwen3 Streaming] Done: 1571.8M params, 18867 weights loaded
+  Total params:     1.57B
+  Frozen params:    1.54B (base model, FP4)
+  Trainable params: 30.2M (Hebbian only)
+  [Flat KV] Enabled: 4096 tokens, 403 MB
+  [Packed MoE] 48 layers packed (6144 experts → contiguous)
+[2/4] Enabling EAGLE-3 draft head...
+  [FE-XT] Draft head: D=50, 2118.3M params, 4237 MB, capture layers [8, 24, 47] + Hebbian memory
+  [FireEcho] WARNING: --use_mps and --use_quantum_linear are mutually exclusive
+  [FireEcho] Using MPS (bigger memory win enables batching)
+  [FireEcho] MPS compression (bond_dim=256)...
+  [MPS] Replaced 150 FFN layers with bond_dim=256
+  [MPS] Params: 2429.8M → 1407.4M (1.7x compression)
+  [FireEcho] torch.compile(eagle, mode='default', fullgraph=False)...
+  [FireEcho] Compilation enabled (first steps will be slow for tracing)
+  Trainable eagle params: 1096.0M
+  [Checkpoint] Resumed from step 5000 (loss=6.9199)
+[3/5] Loading external dataset...
+  Loading cached dataset from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_data_codemix_cache.pt...
+  Loaded 41122 samples.
+[OFFLINE] Loading precomputed features from /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_precomputed...
+  41122 samples available
+[OFFLINE] Starting training...
+  VRAM before training: 26.57 GB
+  [VRAM] Deleting base model layers (--no_eval)...
+  [VRAM] Freed 18.6 GB (26.6 → 7.9 GB)
+  Step   5100 | epoch 1/2 | loss=6.0456 | avg=6.2106 | acc=14.4% | lr=1.25e-06 | pos=64
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] torch._dynamo hit config.recompile_limit (8)
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]    function: 'forward' (/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py:8993)
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]    last reason: 0/7: self._draft_pos == 1                                     # draft_k[:, :, pos:pos + 1, :] = k.detach()  # training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py:8982 in _draft_attn (HINT: torch.compile considers integer attributes of the nn.Module to be static. If you are observing recompilation, you might want to make this integer dynamic using torch._dynamo.config.allow_unspec_int_on_nn_module = True, or convert this integer into a tensor.)
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] User stack trace:
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]   File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 9039, in forward
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]     x = self._draft_attn(x, pos, layer, draft_k, draft_v)
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]   File "/run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/fireecho_kernel.py", line 8982, in _draft_attn
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8]     draft_k[:, :, pos:pos + 1, :] = k.detach()
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
+W0211 07:20:41.750000 16279 .venv_infer312/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py:1705] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html
+  Step   5200 | epoch 1/2 | loss=6.4288 | avg=6.1992 | acc=8.8% | lr=2.50e-06 | pos=64
+  Step   5300 | epoch 1/2 | loss=6.5290 | avg=6.1820 | acc=8.8% | lr=3.75e-06 | pos=64
+  Step   5400 | epoch 1/2 | loss=7.1685 | avg=6.1450 | acc=7.8% | lr=5.00e-06 | pos=64
+  Step   5500 | epoch 1/2 | loss=6.1653 | avg=6.1139 | acc=5.6% | lr=6.25e-06 | pos=64
+  Step   5600 | epoch 1/2 | loss=6.4737 | avg=6.0755 | acc=10.9% | lr=7.50e-06 | pos=64
+  Step   5700 | epoch 1/2 | loss=4.9286 | avg=6.0427 | acc=13.1% | lr=8.75e-06 | pos=64
+  Step   5800 | epoch 1/2 | loss=4.8731 | avg=6.0262 | acc=19.1% | lr=1.00e-05 | pos=64
+  Step   5900 | epoch 1/2 | loss=6.6587 | avg=6.0177 | acc=8.4% | lr=1.13e-05 | pos=64
+  Step   6000 | epoch 1/2 | loss=5.7042 | avg=5.9727 | acc=16.6% | lr=1.25e-05 | pos=64
+  Step   6100 | epoch 1/2 | loss=4.5372 | avg=5.5973 | acc=18.8% | lr=1.38e-05 | pos=64
+  Step   6200 | epoch 1/2 | loss=6.5012 | avg=5.6312 | acc=7.8% | lr=1.50e-05 | pos=64
+  Step   6300 | epoch 1/2 | loss=6.2758 | avg=5.6285 | acc=10.0% | lr=1.63e-05 | pos=64
+  Step   6400 | epoch 1/2 | loss=4.1524 | avg=5.6293 | acc=19.7% | lr=1.75e-05 | pos=64
+  Step   6500 | epoch 1/2 | loss=6.8674 | avg=5.5965 | acc=17.6% | lr=1.88e-05 | pos=41
+  Step   6600 | epoch 1/2 | loss=5.3658 | avg=5.6164 | acc=15.3% | lr=2.00e-05 | pos=64
+  Step   6700 | epoch 1/2 | loss=4.1285 | avg=5.6091 | acc=25.3% | lr=2.13e-05 | pos=64
+  Step   6800 | epoch 1/2 | loss=7.3849 | avg=5.5995 | acc=6.6% | lr=2.25e-05 | pos=64
+  Step   6900 | epoch 1/2 | loss=6.1772 | avg=5.5865 | acc=11.9% | lr=2.38e-05 | pos=64
+  Step   7000 | epoch 1/2 | loss=6.1639 | avg=5.5709 | acc=16.2% | lr=2.50e-05 | pos=64
+  Step   7100 | epoch 1/2 | loss=5.3027 | avg=5.4978 | acc=11.2% | lr=2.63e-05 | pos=64
+  Step   7200 | epoch 1/2 | loss=5.3408 | avg=5.4792 | acc=10.0% | lr=2.75e-05 | pos=64
+  Step   7300 | epoch 1/2 | loss=4.4438 | avg=5.4632 | acc=17.5% | lr=2.87e-05 | pos=64
+  Step   7400 | epoch 1/2 | loss=5.1489 | avg=5.4713 | acc=13.1% | lr=3.00e-05 | pos=64
+  Step   7500 | epoch 1/2 | loss=6.4010 | avg=5.4904 | acc=12.8% | lr=3.13e-05 | pos=64
+  Step   7600 | epoch 1/2 | loss=3.8629 | avg=5.5108 | acc=23.1% | lr=3.25e-05 | pos=64
+  Step   7700 | epoch 1/2 | loss=7.2239 | avg=5.5110 | acc=5.9% | lr=3.38e-05 | pos=64
+  Step   7800 | epoch 1/2 | loss=6.8530 | avg=5.5100 | acc=10.3% | lr=3.50e-05 | pos=64
+  Step   7900 | epoch 1/2 | loss=6.6124 | avg=5.5197 | acc=8.1% | lr=3.63e-05 | pos=64
+  Step   8000 | epoch 1/2 | loss=4.6751 | avg=5.5232 | acc=23.4% | lr=3.75e-05 | pos=64
+  Step   8100 | epoch 1/2 | loss=4.0154 | avg=5.5424 | acc=24.4% | lr=3.87e-05 | pos=64
+  Step   8200 | epoch 1/2 | loss=5.5367 | avg=5.6600 | acc=12.8% | lr=4.00e-05 | pos=64
+  Step   8300 | epoch 1/2 | loss=6.1311 | avg=5.6181 | acc=13.4% | lr=4.12e-05 | pos=64
+  Step   8400 | epoch 1/2 | loss=6.5729 | avg=5.6331 | acc=12.5% | lr=4.25e-05 | pos=64
+  Step   8500 | epoch 1/2 | loss=4.5534 | avg=5.6485 | acc=16.9% | lr=4.37e-05 | pos=64
+  Step   8600 | epoch 1/2 | loss=6.8225 | avg=5.6280 | acc=10.6% | lr=4.50e-05 | pos=64
+  Step   8700 | epoch 1/2 | loss=4.0110 | avg=5.6234 | acc=22.8% | lr=4.63e-05 | pos=64
+  Step   8800 | epoch 1/2 | loss=5.4399 | avg=5.6160 | acc=13.1% | lr=4.75e-05 | pos=64
+  Step   8900 | epoch 1/2 | loss=4.5850 | avg=5.6229 | acc=16.9% | lr=4.87e-05 | pos=64
+  Step   9000 | epoch 1/2 | loss=7.4199 | avg=5.6474 | acc=7.8% | lr=5.00e-05 | pos=64
+  Step   9100 | epoch 1/2 | loss=7.1357 | avg=5.7880 | acc=6.6% | lr=5.12e-05 | pos=64
+  Step   9200 | epoch 1/2 | loss=4.8856 | avg=5.7771 | acc=15.6% | lr=5.25e-05 | pos=64
+  Step   9300 | epoch 1/2 | loss=6.1873 | avg=5.8079 | acc=5.9% | lr=5.37e-05 | pos=64
+  Step   9400 | epoch 1/2 | loss=5.3464 | avg=5.8165 | acc=15.0% | lr=5.50e-05 | pos=64
+  Step   9500 | epoch 1/2 | loss=3.5382 | avg=5.7942 | acc=19.7% | lr=5.63e-05 | pos=64
+  Step   9600 | epoch 1/2 | loss=7.2470 | avg=5.8229 | acc=8.8% | lr=5.75e-05 | pos=64
+  Step   9700 | epoch 1/2 | loss=7.5141 | avg=5.8537 | acc=6.9% | lr=5.88e-05 | pos=64
+  Step   9800 | epoch 1/2 | loss=5.1512 | avg=5.8826 | acc=13.8% | lr=6.00e-05 | pos=64
+  Step   9900 | epoch 1/2 | loss=5.1891 | avg=5.8964 | acc=14.4% | lr=6.13e-05 | pos=64
+  Step  10000 | epoch 1/2 | loss=6.2276 | avg=5.9194 | acc=9.7% | lr=6.25e-05 | pos=64
+  [Checkpoint] Saved step 10000 (loss=6.2276) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 10000] loss=6.2276
+  [Checkpoint] Saved step 10000 (loss=6.2276) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step10000.pt
+  [Prune @ step 10000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  10100 | epoch 1/2 | loss=5.8900 | avg=5.9241 | acc=14.1% | lr=6.38e-05 | pos=64
+  Step  10200 | epoch 1/2 | loss=6.1215 | avg=5.9139 | acc=14.7% | lr=6.50e-05 | pos=64
+  Step  10300 | epoch 1/2 | loss=6.7283 | avg=5.9547 | acc=11.2% | lr=6.62e-05 | pos=64
+  Step  10400 | epoch 1/2 | loss=6.2089 | avg=6.0322 | acc=10.9% | lr=6.75e-05 | pos=64
+  Step  10500 | epoch 1/2 | loss=7.0789 | avg=6.0858 | acc=6.2% | lr=6.88e-05 | pos=64
+  Step  10600 | epoch 1/2 | loss=6.5472 | avg=6.0790 | acc=12.8% | lr=7.00e-05 | pos=64
+  Step  10700 | epoch 1/2 | loss=6.8952 | avg=6.0853 | acc=6.6% | lr=7.13e-05 | pos=64
+  Step  10800 | epoch 1/2 | loss=5.0417 | avg=6.0856 | acc=15.6% | lr=7.25e-05 | pos=64
+  Step  10900 | epoch 1/2 | loss=4.8823 | avg=6.0906 | acc=19.7% | lr=7.38e-05 | pos=64
+  Step  11000 | epoch 1/2 | loss=5.6943 | avg=6.1095 | acc=16.9% | lr=7.50e-05 | pos=64
+  Step  11100 | epoch 1/2 | loss=6.4133 | avg=6.1913 | acc=13.1% | lr=7.62e-05 | pos=64
+  Step  11200 | epoch 1/2 | loss=7.7836 | avg=6.2704 | acc=8.1% | lr=7.75e-05 | pos=64
+  Step  11300 | epoch 1/2 | loss=5.7336 | avg=6.2414 | acc=8.1% | lr=7.88e-05 | pos=64
+  Step  11400 | epoch 1/2 | loss=7.5261 | avg=6.2279 | acc=4.7% | lr=8.00e-05 | pos=64
+  Step  11500 | epoch 1/2 | loss=7.2932 | avg=6.2434 | acc=5.3% | lr=8.13e-05 | pos=64
+  Step  11600 | epoch 1/2 | loss=3.8389 | avg=6.2505 | acc=26.9% | lr=8.25e-05 | pos=64
+  Step  11700 | epoch 1/2 | loss=6.8235 | avg=6.2256 | acc=8.8% | lr=8.38e-05 | pos=64
+  Step  11800 | epoch 1/2 | loss=5.8012 | avg=6.2382 | acc=11.6% | lr=8.50e-05 | pos=64
+  Step  11900 | epoch 1/2 | loss=5.3869 | avg=6.2630 | acc=14.4% | lr=8.63e-05 | pos=64
+  Step  12000 | epoch 1/2 | loss=5.2938 | avg=6.2744 | acc=13.8% | lr=8.75e-05 | pos=64
+  Step  12100 | epoch 1/2 | loss=6.6599 | avg=6.4246 | acc=11.2% | lr=8.88e-05 | pos=64
+  Step  12200 | epoch 1/2 | loss=6.5154 | avg=6.3953 | acc=6.2% | lr=9.00e-05 | pos=64
+  Step  12300 | epoch 1/2 | loss=5.3954 | avg=6.4561 | acc=14.7% | lr=9.12e-05 | pos=64
+  Step  12400 | epoch 1/2 | loss=7.5228 | avg=6.3996 | acc=5.0% | lr=9.25e-05 | pos=64
+  Step  12500 | epoch 1/2 | loss=7.7880 | avg=6.3830 | acc=6.2% | lr=9.38e-05 | pos=64
+  Step  12600 | epoch 1/2 | loss=7.4444 | avg=6.3519 | acc=6.6% | lr=9.50e-05 | pos=64
+  Step  12700 | epoch 1/2 | loss=7.9002 | avg=6.3342 | acc=6.6% | lr=9.63e-05 | pos=64
+  Step  12800 | epoch 1/2 | loss=6.0377 | avg=6.3263 | acc=10.0% | lr=9.75e-05 | pos=64
+  Step  12900 | epoch 1/2 | loss=6.9872 | avg=6.3369 | acc=8.8% | lr=9.88e-05 | pos=64
+  Step  13000 | epoch 1/2 | loss=5.5612 | avg=6.3423 | acc=15.6% | lr=1.00e-04 | pos=64
+  Step  13100 | epoch 1/2 | loss=5.8940 | avg=6.5114 | acc=8.4% | lr=1.00e-04 | pos=64
+  Step  13200 | epoch 1/2 | loss=7.6319 | avg=6.4637 | acc=4.7% | lr=1.00e-04 | pos=64
+  Step  13300 | epoch 1/2 | loss=5.4036 | avg=6.4090 | acc=14.1% | lr=1.00e-04 | pos=64
+  Step  13400 | epoch 1/2 | loss=5.3561 | avg=6.3912 | acc=13.1% | lr=1.00e-04 | pos=64
+  Step  13500 | epoch 1/2 | loss=6.9826 | avg=6.3646 | acc=8.8% | lr=1.00e-04 | pos=64
+  Step  13600 | epoch 1/2 | loss=5.7324 | avg=6.3270 | acc=14.7% | lr=1.00e-04 | pos=64
+  Step  13700 | epoch 1/2 | loss=4.5450 | avg=6.2767 | acc=19.7% | lr=1.00e-04 | pos=64
+  Step  13800 | epoch 1/2 | loss=4.9770 | avg=6.2691 | acc=15.0% | lr=1.00e-04 | pos=64
+  Step  13900 | epoch 1/2 | loss=5.7575 | avg=6.2462 | acc=12.5% | lr=1.00e-04 | pos=64
+  Step  14000 | epoch 1/2 | loss=6.1865 | avg=6.2350 | acc=11.6% | lr=1.00e-04 | pos=64
+  Step  14100 | epoch 1/2 | loss=5.2309 | avg=6.1144 | acc=14.1% | lr=1.00e-04 | pos=64
+  Step  14200 | epoch 1/2 | loss=6.7469 | avg=6.0611 | acc=6.6% | lr=9.99e-05 | pos=64
+  Step  14300 | epoch 1/2 | loss=5.6130 | avg=6.1187 | acc=15.3% | lr=9.99e-05 | pos=64
+  Step  14400 | epoch 1/2 | loss=7.1063 | avg=6.1532 | acc=6.9% | lr=9.99e-05 | pos=64
+  Step  14500 | epoch 1/2 | loss=6.6918 | avg=6.0775 | acc=10.9% | lr=9.99e-05 | pos=64
+  Step  14600 | epoch 1/2 | loss=5.2415 | avg=6.0832 | acc=13.4% | lr=9.99e-05 | pos=64
+  Step  14700 | epoch 1/2 | loss=6.1558 | avg=6.0358 | acc=10.6% | lr=9.99e-05 | pos=64
+  Step  14800 | epoch 1/2 | loss=6.6280 | avg=6.0206 | acc=9.7% | lr=9.99e-05 | pos=64
+  Step  14900 | epoch 1/2 | loss=6.3373 | avg=6.0078 | acc=13.8% | lr=9.99e-05 | pos=64
+  Step  15000 | epoch 1/2 | loss=6.4039 | avg=6.0172 | acc=8.8% | lr=9.98e-05 | pos=64
+  [Checkpoint] Saved step 15000 (loss=6.4039) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 15000] loss=6.4039
+  [Checkpoint] Saved step 15000 (loss=6.4039) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step15000.pt
+  [Prune @ step 15000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  15100 | epoch 1/2 | loss=4.2505 | avg=5.8018 | acc=29.7% | lr=9.98e-05 | pos=64
+  Step  15200 | epoch 1/2 | loss=5.3202 | avg=5.8986 | acc=18.4% | lr=9.98e-05 | pos=64
+  Step  15300 | epoch 1/2 | loss=4.9784 | avg=5.9032 | acc=22.2% | lr=9.98e-05 | pos=64
+  Step  15400 | epoch 1/2 | loss=5.5990 | avg=5.9162 | acc=7.5% | lr=9.98e-05 | pos=64
+  Step  15500 | epoch 1/2 | loss=6.0779 | avg=5.8561 | acc=12.8% | lr=9.97e-05 | pos=64
+  Step  15600 | epoch 1/2 | loss=5.3501 | avg=5.8783 | acc=17.5% | lr=9.97e-05 | pos=64
+  Step  15700 | epoch 1/2 | loss=5.4835 | avg=5.8528 | acc=14.1% | lr=9.97e-05 | pos=64
+  Step  15800 | epoch 1/2 | loss=5.8244 | avg=5.8483 | acc=11.6% | lr=9.97e-05 | pos=64
+  Step  15900 | epoch 1/2 | loss=5.0472 | avg=5.8358 | acc=11.6% | lr=9.97e-05 | pos=64
+  Step  16000 | epoch 1/2 | loss=5.7255 | avg=5.8358 | acc=14.7% | lr=9.96e-05 | pos=64
+  Step  16100 | epoch 1/2 | loss=6.9115 | avg=5.9519 | acc=9.1% | lr=9.96e-05 | pos=64
+  Step  16200 | epoch 1/2 | loss=7.3667 | avg=5.9884 | acc=3.8% | lr=9.96e-05 | pos=64
+  Step  16300 | epoch 1/2 | loss=5.9598 | avg=5.8925 | acc=9.7% | lr=9.96e-05 | pos=64
+  Step  16400 | epoch 1/2 | loss=4.7891 | avg=5.8581 | acc=20.6% | lr=9.95e-05 | pos=64
+  Step  16500 | epoch 1/2 | loss=5.1974 | avg=5.8379 | acc=16.6% | lr=9.95e-05 | pos=64
+  Step  16600 | epoch 1/2 | loss=4.4763 | avg=5.8057 | acc=21.6% | lr=9.95e-05 | pos=64
+  Step  16700 | epoch 1/2 | loss=5.6903 | avg=5.8039 | acc=14.4% | lr=9.94e-05 | pos=64
+  Step  16800 | epoch 1/2 | loss=6.3023 | avg=5.7967 | acc=10.3% | lr=9.94e-05 | pos=64
+  Step  16900 | epoch 1/2 | loss=4.3212 | avg=5.7818 | acc=23.4% | lr=9.94e-05 | pos=64
+  Step  17000 | epoch 1/2 | loss=3.9120 | avg=5.7626 | acc=22.8% | lr=9.94e-05 | pos=64
+  Step  17100 | epoch 1/2 | loss=6.4101 | avg=5.7769 | acc=9.4% | lr=9.93e-05 | pos=64
+  Step  17200 | epoch 1/2 | loss=4.9407 | avg=5.8005 | acc=19.1% | lr=9.93e-05 | pos=64
+  Step  17300 | epoch 1/2 | loss=8.5146 | avg=5.7985 | acc=4.7% | lr=9.93e-05 | pos=64
+  Step  17400 | epoch 1/2 | loss=6.6819 | avg=5.7593 | acc=6.6% | lr=9.92e-05 | pos=64
+  Step  17500 | epoch 1/2 | loss=5.3934 | avg=5.7124 | acc=11.2% | lr=9.92e-05 | pos=64
+  Step  17600 | epoch 1/2 | loss=5.6320 | avg=5.7167 | acc=13.1% | lr=9.92e-05 | pos=64
+  Step  17700 | epoch 1/2 | loss=4.9097 | avg=5.7025 | acc=19.7% | lr=9.91e-05 | pos=64
+  Step  17800 | epoch 1/2 | loss=5.3642 | avg=5.6747 | acc=11.2% | lr=9.91e-05 | pos=64
+  Step  17900 | epoch 1/2 | loss=5.7257 | avg=5.6797 | acc=9.4% | lr=9.90e-05 | pos=64
+  Step  18000 | epoch 1/2 | loss=7.2424 | avg=5.6691 | acc=5.0% | lr=9.90e-05 | pos=64
+  Step  18100 | epoch 1/2 | loss=4.9557 | avg=5.6849 | acc=17.8% | lr=9.90e-05 | pos=64
+  Step  18200 | epoch 1/2 | loss=5.3597 | avg=5.7598 | acc=12.8% | lr=9.89e-05 | pos=64
+  Step  18300 | epoch 1/2 | loss=5.5707 | avg=5.7254 | acc=16.9% | lr=9.89e-05 | pos=64
+  Step  18400 | epoch 1/2 | loss=5.3697 | avg=5.6821 | acc=14.7% | lr=9.88e-05 | pos=64
+  Step  18500 | epoch 1/2 | loss=5.9737 | avg=5.6687 | acc=8.4% | lr=9.88e-05 | pos=64
+  Step  18600 | epoch 1/2 | loss=6.3940 | avg=5.6782 | acc=12.5% | lr=9.87e-05 | pos=64
+  Step  18700 | epoch 1/2 | loss=6.1741 | avg=5.6582 | acc=9.7% | lr=9.87e-05 | pos=64
+  Step  18800 | epoch 1/2 | loss=5.0890 | avg=5.6381 | acc=20.0% | lr=9.87e-05 | pos=64
+  Step  18900 | epoch 1/2 | loss=9.5439 | avg=5.6474 | acc=7.2% | lr=9.86e-05 | pos=64
+  Step  19000 | epoch 1/2 | loss=6.2727 | avg=5.6501 | acc=12.2% | lr=9.86e-05 | pos=64
+  Step  19100 | epoch 1/2 | loss=5.0060 | avg=5.6381 | acc=15.3% | lr=9.85e-05 | pos=64
+  Step  19200 | epoch 1/2 | loss=4.6388 | avg=5.6294 | acc=23.1% | lr=9.85e-05 | pos=64
+  Step  19300 | epoch 1/2 | loss=5.7475 | avg=5.6296 | acc=14.4% | lr=9.84e-05 | pos=64
+  Step  19400 | epoch 1/2 | loss=6.7555 | avg=5.6299 | acc=7.8% | lr=9.84e-05 | pos=64
+  Step  19500 | epoch 1/2 | loss=7.1358 | avg=5.5876 | acc=6.6% | lr=9.83e-05 | pos=64
+  Step  19600 | epoch 1/2 | loss=4.5881 | avg=5.5850 | acc=25.6% | lr=9.83e-05 | pos=64
+  Step  19700 | epoch 1/2 | loss=4.3789 | avg=5.5623 | acc=22.5% | lr=9.82e-05 | pos=64
+  Step  19800 | epoch 1/2 | loss=5.7571 | avg=5.5662 | acc=13.8% | lr=9.81e-05 | pos=64
+  Step  19900 | epoch 1/2 | loss=8.4748 | avg=5.5578 | acc=2.5% | lr=9.81e-05 | pos=64
+  Step  20000 | epoch 1/2 | loss=5.1173 | avg=5.5488 | acc=18.8% | lr=9.80e-05 | pos=64
+  [Checkpoint] Saved step 20000 (loss=5.1173) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 20000] loss=5.1173
+  [Checkpoint] Saved step 20000 (loss=5.1173) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step20000.pt
+  [Prune @ step 20000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  20100 | epoch 1/2 | loss=5.4756 | avg=5.5505 | acc=14.1% | lr=9.80e-05 | pos=64
+  Step  20200 | epoch 1/2 | loss=5.1688 | avg=5.5710 | acc=14.7% | lr=9.79e-05 | pos=64
+  Step  20300 | epoch 1/2 | loss=8.4751 | avg=5.5537 | acc=4.7% | lr=9.79e-05 | pos=64
+  Step  20400 | epoch 1/2 | loss=5.0624 | avg=5.5354 | acc=14.4% | lr=9.78e-05 | pos=64
+  Step  20500 | epoch 1/2 | loss=5.6380 | avg=5.5492 | acc=10.6% | lr=9.78e-05 | pos=64
+  Step  20600 | epoch 1/2 | loss=4.9411 | avg=5.5314 | acc=17.2% | lr=9.77e-05 | pos=64
+  Step  20700 | epoch 1/2 | loss=5.1845 | avg=5.5182 | acc=17.2% | lr=9.76e-05 | pos=64
+  Step  20800 | epoch 1/2 | loss=4.4440 | avg=5.4811 | acc=18.4% | lr=9.76e-05 | pos=64
+  Step  20900 | epoch 1/2 | loss=6.9643 | avg=5.4939 | acc=11.6% | lr=9.75e-05 | pos=64
+  Step  21000 | epoch 1/2 | loss=7.3513 | avg=5.4954 | acc=4.7% | lr=9.74e-05 | pos=64
+  Step  21100 | epoch 1/2 | loss=4.9541 | avg=5.4434 | acc=13.4% | lr=9.74e-05 | pos=64
+  Step  21200 | epoch 1/2 | loss=5.8833 | avg=5.3946 | acc=10.6% | lr=9.73e-05 | pos=64
+  Step  21300 | epoch 1/2 | loss=4.7049 | avg=5.3664 | acc=17.5% | lr=9.73e-05 | pos=64
+  Step  21400 | epoch 1/2 | loss=7.0340 | avg=5.3853 | acc=3.8% | lr=9.72e-05 | pos=64
+  Step  21500 | epoch 1/2 | loss=4.7712 | avg=5.4051 | acc=14.4% | lr=9.71e-05 | pos=64
+  Step  21600 | epoch 1/2 | loss=4.1569 | avg=5.3882 | acc=20.3% | lr=9.71e-05 | pos=64
+  Step  21700 | epoch 1/2 | loss=4.9068 | avg=5.3744 | acc=17.5% | lr=9.70e-05 | pos=64
+  Step  21800 | epoch 1/2 | loss=5.4254 | avg=5.3602 | acc=11.6% | lr=9.69e-05 | pos=64
+  Step  21900 | epoch 1/2 | loss=6.2506 | avg=5.3689 | acc=14.1% | lr=9.68e-05 | pos=64
+  Step  22000 | epoch 1/2 | loss=5.2534 | avg=5.3725 | acc=14.7% | lr=9.68e-05 | pos=64
+  Step  22100 | epoch 1/2 | loss=4.6903 | avg=5.3125 | acc=18.1% | lr=9.67e-05 | pos=64
+  Step  22200 | epoch 1/2 | loss=4.0345 | avg=5.3496 | acc=17.2% | lr=9.66e-05 | pos=64
+  Step  22300 | epoch 1/2 | loss=5.4078 | avg=5.3544 | acc=15.6% | lr=9.66e-05 | pos=64
+  Step  22400 | epoch 1/2 | loss=6.7715 | avg=5.3836 | acc=8.1% | lr=9.65e-05 | pos=64
+  Step  22500 | epoch 1/2 | loss=9.3450 | avg=5.3618 | acc=8.8% | lr=9.64e-05 | pos=64
+  Step  22600 | epoch 1/2 | loss=6.1452 | avg=5.3786 | acc=12.8% | lr=9.63e-05 | pos=64
+  Step  22700 | epoch 1/2 | loss=6.4993 | avg=5.3722 | acc=10.0% | lr=9.63e-05 | pos=64
+  Step  22800 | epoch 1/2 | loss=6.7072 | avg=5.3715 | acc=7.5% | lr=9.62e-05 | pos=64
+  Step  22900 | epoch 1/2 | loss=5.6727 | avg=5.3747 | acc=16.9% | lr=9.61e-05 | pos=64
+  Step  23000 | epoch 1/2 | loss=4.0313 | avg=5.3578 | acc=27.5% | lr=9.60e-05 | pos=64
+  Step  23100 | epoch 1/2 | loss=6.6814 | avg=5.3814 | acc=9.1% | lr=9.60e-05 | pos=64
+  Step  23200 | epoch 1/2 | loss=3.7822 | avg=5.2563 | acc=26.6% | lr=9.59e-05 | pos=64
+  Step  23300 | epoch 1/2 | loss=5.5860 | avg=5.3178 | acc=25.9% | lr=9.58e-05 | pos=64
+  Step  23400 | epoch 1/2 | loss=3.8420 | avg=5.3004 | acc=25.6% | lr=9.57e-05 | pos=64
+  Step  23500 | epoch 1/2 | loss=4.1972 | avg=5.2734 | acc=21.9% | lr=9.56e-05 | pos=64
+  Step  23600 | epoch 1/2 | loss=4.7770 | avg=5.2667 | acc=18.8% | lr=9.55e-05 | pos=64
+  Step  23700 | epoch 1/2 | loss=5.3051 | avg=5.2498 | acc=17.8% | lr=9.55e-05 | pos=64
+  Step  23800 | epoch 1/2 | loss=5.1812 | avg=5.2774 | acc=13.1% | lr=9.54e-05 | pos=64
+  Step  23900 | epoch 1/2 | loss=5.8178 | avg=5.2822 | acc=16.6% | lr=9.53e-05 | pos=64
+  Step  24000 | epoch 1/2 | loss=4.4594 | avg=5.2729 | acc=17.8% | lr=9.52e-05 | pos=64
+  Step  24100 | epoch 1/2 | loss=4.0387 | avg=5.1920 | acc=20.9% | lr=9.51e-05 | pos=64
+  Step  24200 | epoch 1/2 | loss=6.8931 | avg=5.2629 | acc=9.1% | lr=9.50e-05 | pos=64
+  Step  24300 | epoch 1/2 | loss=4.7364 | avg=5.2967 | acc=22.5% | lr=9.50e-05 | pos=64
+  Step  24400 | epoch 1/2 | loss=4.4333 | avg=5.2412 | acc=23.8% | lr=9.49e-05 | pos=64
+  Step  24500 | epoch 1/2 | loss=4.4960 | avg=5.2340 | acc=23.4% | lr=9.48e-05 | pos=64
+  Step  24600 | epoch 1/2 | loss=4.1843 | avg=5.2443 | acc=23.4% | lr=9.47e-05 | pos=64
+  Step  24700 | epoch 1/2 | loss=7.4006 | avg=5.2936 | acc=6.9% | lr=9.46e-05 | pos=64
+  Step  24800 | epoch 1/2 | loss=3.6557 | avg=5.2591 | acc=28.4% | lr=9.45e-05 | pos=64
+  Step  24900 | epoch 1/2 | loss=4.9822 | avg=5.2392 | acc=17.5% | lr=9.44e-05 | pos=64
+  Step  25000 | epoch 1/2 | loss=4.4623 | avg=5.2394 | acc=18.1% | lr=9.43e-05 | pos=64
+  [Checkpoint] Saved step 25000 (loss=4.4623) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 25000] loss=4.4623
+  [Checkpoint] Saved step 25000 (loss=4.4623) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step25000.pt
+  [Prune @ step 25000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  25100 | epoch 1/2 | loss=3.3957 | avg=5.2251 | acc=32.2% | lr=9.42e-05 | pos=64
+  Step  25200 | epoch 1/2 | loss=3.5391 | avg=5.2375 | acc=27.5% | lr=9.41e-05 | pos=64
+  Step  25300 | epoch 1/2 | loss=4.9235 | avg=5.2656 | acc=15.6% | lr=9.40e-05 | pos=64
+  Step  25400 | epoch 1/2 | loss=5.1743 | avg=5.2758 | acc=15.6% | lr=9.39e-05 | pos=64
+  Step  25500 | epoch 1/2 | loss=3.6510 | avg=5.2463 | acc=25.3% | lr=9.39e-05 | pos=64
+  Step  25600 | epoch 1/2 | loss=5.4870 | avg=5.2134 | acc=11.9% | lr=9.38e-05 | pos=64
+  Step  25700 | epoch 1/2 | loss=3.2654 | avg=5.2024 | acc=38.4% | lr=9.37e-05 | pos=64
+  Step  25800 | epoch 1/2 | loss=4.1588 | avg=5.2047 | acc=21.9% | lr=9.36e-05 | pos=64
+  Step  25900 | epoch 1/2 | loss=3.7836 | avg=5.2088 | acc=23.8% | lr=9.35e-05 | pos=64
+  Step  26000 | epoch 1/2 | loss=4.3097 | avg=5.2034 | acc=20.3% | lr=9.34e-05 | pos=64
+  Step  26100 | epoch 1/2 | loss=4.6737 | avg=5.1301 | acc=19.7% | lr=9.33e-05 | pos=64
+  Step  26200 | epoch 1/2 | loss=3.9339 | avg=5.2112 | acc=26.2% | lr=9.32e-05 | pos=64
+  Step  26300 | epoch 1/2 | loss=6.8034 | avg=5.1860 | acc=11.6% | lr=9.31e-05 | pos=64
+  Step  26400 | epoch 1/2 | loss=5.6778 | avg=5.1827 | acc=12.2% | lr=9.30e-05 | pos=64
+  Step  26500 | epoch 1/2 | loss=5.0070 | avg=5.2093 | acc=18.1% | lr=9.29e-05 | pos=64
+  Step  26600 | epoch 1/2 | loss=6.1985 | avg=5.1966 | acc=13.1% | lr=9.28e-05 | pos=64
+  Step  26700 | epoch 1/2 | loss=5.7865 | avg=5.2021 | acc=13.1% | lr=9.26e-05 | pos=64
+  Step  26800 | epoch 1/2 | loss=4.6918 | avg=5.1976 | acc=15.6% | lr=9.25e-05 | pos=64
+  Step  26900 | epoch 1/2 | loss=6.2116 | avg=5.1911 | acc=11.9% | lr=9.24e-05 | pos=64
+  Step  27000 | epoch 1/2 | loss=3.0124 | avg=5.1775 | acc=31.2% | lr=9.23e-05 | pos=64
+  Step  27100 | epoch 1/2 | loss=4.9378 | avg=5.2475 | acc=13.1% | lr=9.22e-05 | pos=64
+  Step  27200 | epoch 1/2 | loss=4.4908 | avg=5.1403 | acc=20.3% | lr=9.21e-05 | pos=64
+  Step  27300 | epoch 1/2 | loss=4.0158 | avg=5.1399 | acc=25.9% | lr=9.20e-05 | pos=64
+  Step  27400 | epoch 1/2 | loss=5.7095 | avg=5.1167 | acc=13.1% | lr=9.19e-05 | pos=64
+  Step  27500 | epoch 1/2 | loss=6.7299 | avg=5.1009 | acc=10.9% | lr=9.18e-05 | pos=64
+  Step  27600 | epoch 1/2 | loss=5.1221 | avg=5.0998 | acc=13.1% | lr=9.17e-05 | pos=64
+  Step  27700 | epoch 1/2 | loss=5.4922 | avg=5.1194 | acc=12.5% | lr=9.16e-05 | pos=64
+  Step  27800 | epoch 1/2 | loss=5.9491 | avg=5.1337 | acc=13.4% | lr=9.15e-05 | pos=64
+  Step  27900 | epoch 1/2 | loss=4.2654 | avg=5.1359 | acc=24.4% | lr=9.13e-05 | pos=64
+  Step  28000 | epoch 1/2 | loss=5.3780 | avg=5.1356 | acc=16.6% | lr=9.12e-05 | pos=64
+  Step  28100 | epoch 1/2 | loss=5.6094 | avg=4.9985 | acc=13.4% | lr=9.11e-05 | pos=64
+  Step  28200 | epoch 1/2 | loss=4.0248 | avg=5.1104 | acc=32.8% | lr=9.10e-05 | pos=64
+  Step  28300 | epoch 1/2 | loss=4.5946 | avg=5.1675 | acc=15.3% | lr=9.09e-05 | pos=64
+  Step  28400 | epoch 1/2 | loss=6.2588 | avg=5.1339 | acc=13.4% | lr=9.08e-05 | pos=64
+  Step  28500 | epoch 1/2 | loss=5.9369 | avg=5.1111 | acc=9.1% | lr=9.07e-05 | pos=64
+  Step  28600 | epoch 1/2 | loss=7.0753 | avg=5.1176 | acc=10.6% | lr=9.05e-05 | pos=64
+  Step  28700 | epoch 1/2 | loss=4.8857 | avg=5.1273 | acc=14.4% | lr=9.04e-05 | pos=64
+  Step  28800 | epoch 1/2 | loss=4.1414 | avg=5.1216 | acc=25.6% | lr=9.03e-05 | pos=64
+  Step  28900 | epoch 1/2 | loss=5.8579 | avg=5.1102 | acc=11.9% | lr=9.02e-05 | pos=64
+  Step  29000 | epoch 1/2 | loss=5.0406 | avg=5.1018 | acc=15.6% | lr=9.01e-05 | pos=64
+  Step  29100 | epoch 1/2 | loss=5.7378 | avg=4.9941 | acc=12.2% | lr=9.00e-05 | pos=64
+  Step  29200 | epoch 1/2 | loss=5.6251 | avg=5.0211 | acc=12.5% | lr=8.98e-05 | pos=64
+  Step  29300 | epoch 1/2 | loss=4.2895 | avg=4.9873 | acc=19.1% | lr=8.97e-05 | pos=64
+  Step  29400 | epoch 1/2 | loss=5.7916 | avg=5.0025 | acc=15.6% | lr=8.96e-05 | pos=64
+  Step  29500 | epoch 1/2 | loss=4.0017 | avg=5.0211 | acc=17.8% | lr=8.95e-05 | pos=64
+  Step  29600 | epoch 1/2 | loss=5.8437 | avg=5.0314 | acc=12.2% | lr=8.93e-05 | pos=64
+  Step  29700 | epoch 1/2 | loss=4.3955 | avg=5.0171 | acc=20.6% | lr=8.92e-05 | pos=64
+  Step  29800 | epoch 1/2 | loss=5.0815 | avg=5.0323 | acc=20.0% | lr=8.91e-05 | pos=64
+  Step  29900 | epoch 1/2 | loss=4.6394 | avg=5.0057 | acc=24.4% | lr=8.90e-05 | pos=64
+  Step  30000 | epoch 1/2 | loss=5.7835 | avg=5.0147 | acc=11.2% | lr=8.89e-05 | pos=64
+  [Checkpoint] Saved step 30000 (loss=5.7835) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 30000] loss=5.7835
+  [Checkpoint] Saved step 30000 (loss=5.7835) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step30000.pt
+  [Prune @ step 30000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  30100 | epoch 1/2 | loss=4.7267 | avg=4.8656 | acc=21.9% | lr=8.87e-05 | pos=64
+  Step  30200 | epoch 1/2 | loss=4.4325 | avg=4.9138 | acc=22.2% | lr=8.86e-05 | pos=64
+  Step  30300 | epoch 1/2 | loss=4.8922 | avg=4.9353 | acc=15.3% | lr=8.85e-05 | pos=64
+  Step  30400 | epoch 1/2 | loss=4.9547 | avg=4.9822 | acc=15.9% | lr=8.83e-05 | pos=64
+  Step  30500 | epoch 1/2 | loss=5.1371 | avg=4.9771 | acc=12.8% | lr=8.82e-05 | pos=64
+  Step  30600 | epoch 1/2 | loss=3.7506 | avg=5.0036 | acc=26.6% | lr=8.81e-05 | pos=64
+  Step  30700 | epoch 1/2 | loss=5.1478 | avg=4.9848 | acc=18.1% | lr=8.80e-05 | pos=64
+  Step  30800 | epoch 1/2 | loss=4.3728 | avg=4.9735 | acc=23.4% | lr=8.78e-05 | pos=64
+  Step  30900 | epoch 1/2 | loss=5.3286 | avg=4.9876 | acc=17.8% | lr=8.77e-05 | pos=64
+  Step  31000 | epoch 1/2 | loss=4.2759 | avg=4.9893 | acc=20.0% | lr=8.76e-05 | pos=64
+  Step  31100 | epoch 1/2 | loss=5.9771 | avg=5.1014 | acc=14.4% | lr=8.74e-05 | pos=64
+  Step  31200 | epoch 1/2 | loss=4.1367 | avg=5.0642 | acc=33.1% | lr=8.73e-05 | pos=64
+  Step  31300 | epoch 1/2 | loss=5.9808 | avg=4.9843 | acc=10.6% | lr=8.72e-05 | pos=64
+  Step  31400 | epoch 1/2 | loss=4.9826 | avg=4.9561 | acc=14.4% | lr=8.70e-05 | pos=64
+  Step  31500 | epoch 1/2 | loss=4.0148 | avg=4.9658 | acc=23.1% | lr=8.69e-05 | pos=64
+  Step  31600 | epoch 1/2 | loss=3.3919 | avg=4.9370 | acc=26.9% | lr=8.68e-05 | pos=64
+  Step  31700 | epoch 1/2 | loss=4.8560 | avg=4.9134 | acc=18.8% | lr=8.66e-05 | pos=64
+  Step  31800 | epoch 1/2 | loss=5.1120 | avg=4.9220 | acc=13.8% | lr=8.65e-05 | pos=64
+  Step  31900 | epoch 1/2 | loss=6.1961 | avg=4.9327 | acc=13.4% | lr=8.64e-05 | pos=64
+  Step  32000 | epoch 1/2 | loss=4.5527 | avg=4.9425 | acc=23.8% | lr=8.62e-05 | pos=64
+  Step  32100 | epoch 1/2 | loss=4.9468 | avg=5.1283 | acc=18.4% | lr=8.61e-05 | pos=64
+  Step  32200 | epoch 1/2 | loss=3.7239 | avg=4.9893 | acc=28.1% | lr=8.59e-05 | pos=64
+  Step  32300 | epoch 1/2 | loss=5.6031 | avg=4.9943 | acc=9.1% | lr=8.58e-05 | pos=64
+  Step  32400 | epoch 1/2 | loss=3.6938 | avg=5.0201 | acc=28.1% | lr=8.57e-05 | pos=64
+  Step  32500 | epoch 1/2 | loss=4.3661 | avg=5.0048 | acc=24.1% | lr=8.55e-05 | pos=64
+  Step  32600 | epoch 1/2 | loss=4.6400 | avg=4.9678 | acc=14.1% | lr=8.54e-05 | pos=64
+  Step  32700 | epoch 1/2 | loss=5.0756 | avg=4.9601 | acc=16.2% | lr=8.52e-05 | pos=64
+  Step  32800 | epoch 1/2 | loss=4.4300 | avg=4.9512 | acc=21.9% | lr=8.51e-05 | pos=64
+  Step  32900 | epoch 1/2 | loss=5.4190 | avg=4.9723 | acc=13.1% | lr=8.50e-05 | pos=64
+  Step  33000 | epoch 1/2 | loss=4.1838 | avg=4.9546 | acc=27.2% | lr=8.48e-05 | pos=64
+  Step  33100 | epoch 1/2 | loss=5.1738 | avg=4.8528 | acc=13.8% | lr=8.47e-05 | pos=64
+  Step  33200 | epoch 1/2 | loss=6.6131 | avg=5.0019 | acc=10.3% | lr=8.45e-05 | pos=64
+  Step  33300 | epoch 1/2 | loss=4.0026 | avg=4.9887 | acc=24.1% | lr=8.44e-05 | pos=64
+  Step  33400 | epoch 1/2 | loss=5.3191 | avg=4.9493 | acc=17.2% | lr=8.42e-05 | pos=64
+  Step  33500 | epoch 1/2 | loss=6.1506 | avg=4.9538 | acc=12.8% | lr=8.41e-05 | pos=64
+  Step  33600 | epoch 1/2 | loss=4.4988 | avg=4.9433 | acc=18.8% | lr=8.40e-05 | pos=64
+  Step  33700 | epoch 1/2 | loss=4.9283 | avg=4.9385 | acc=18.1% | lr=8.38e-05 | pos=64
+  Step  33800 | epoch 1/2 | loss=3.6502 | avg=4.9370 | acc=31.2% | lr=8.37e-05 | pos=64
+  Step  33900 | epoch 1/2 | loss=5.3868 | avg=4.9375 | acc=15.9% | lr=8.35e-05 | pos=64
+  Step  34000 | epoch 1/2 | loss=4.7499 | avg=4.9267 | acc=20.9% | lr=8.34e-05 | pos=64
+  Step  34100 | epoch 1/2 | loss=4.0668 | avg=4.8603 | acc=25.0% | lr=8.32e-05 | pos=64
+  Step  34200 | epoch 1/2 | loss=6.0244 | avg=4.7980 | acc=13.8% | lr=8.31e-05 | pos=64
+  Step  34300 | epoch 1/2 | loss=6.1788 | avg=4.9079 | acc=11.2% | lr=8.29e-05 | pos=64
+  Step  34400 | epoch 1/2 | loss=4.1456 | avg=4.8985 | acc=25.9% | lr=8.28e-05 | pos=64
+  Step  34500 | epoch 1/2 | loss=4.1256 | avg=4.8664 | acc=22.5% | lr=8.26e-05 | pos=64
+  Step  34600 | epoch 1/2 | loss=3.3021 | avg=4.8585 | acc=31.2% | lr=8.25e-05 | pos=64
+  Step  34700 | epoch 1/2 | loss=4.5752 | avg=4.8328 | acc=21.9% | lr=8.23e-05 | pos=64
+  Step  34800 | epoch 1/2 | loss=4.3158 | avg=4.8388 | acc=22.8% | lr=8.22e-05 | pos=64
+  Step  34900 | epoch 1/2 | loss=4.7157 | avg=4.8370 | acc=20.9% | lr=8.20e-05 | pos=64
+  Step  35000 | epoch 1/2 | loss=4.5456 | avg=4.8354 | acc=22.2% | lr=8.19e-05 | pos=64
+  [Checkpoint] Saved step 35000 (loss=4.5456) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 35000] loss=4.5456
+  [Checkpoint] Saved step 35000 (loss=4.5456) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step35000.pt
+  [Prune @ step 35000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  35100 | epoch 1/2 | loss=4.8744 | avg=4.8083 | acc=19.4% | lr=8.17e-05 | pos=64
+  Step  35200 | epoch 1/2 | loss=4.0071 | avg=4.8246 | acc=29.7% | lr=8.16e-05 | pos=64
+  Step  35300 | epoch 1/2 | loss=5.6044 | avg=4.7499 | acc=12.5% | lr=8.14e-05 | pos=64
+  Step  35400 | epoch 1/2 | loss=4.2247 | avg=4.7446 | acc=28.1% | lr=8.13e-05 | pos=64
+  Step  35500 | epoch 1/2 | loss=3.1700 | avg=4.7470 | acc=33.1% | lr=8.11e-05 | pos=64
+  Step  35600 | epoch 1/2 | loss=6.0358 | avg=4.7393 | acc=8.1% | lr=8.09e-05 | pos=64
+  Step  35700 | epoch 1/2 | loss=5.0337 | avg=4.7483 | acc=13.8% | lr=8.08e-05 | pos=64
+  Step  35800 | epoch 1/2 | loss=4.3446 | avg=4.7279 | acc=18.1% | lr=8.06e-05 | pos=64
+  Step  35900 | epoch 1/2 | loss=6.4008 | avg=4.7461 | acc=7.5% | lr=8.05e-05 | pos=64
+  Step  36000 | epoch 1/2 | loss=4.2510 | avg=4.7567 | acc=18.8% | lr=8.03e-05 | pos=64
+  Step  36100 | epoch 1/2 | loss=3.6081 | avg=4.7084 | acc=31.2% | lr=8.02e-05 | pos=64
+  Step  36200 | epoch 1/2 | loss=4.5875 | avg=4.7817 | acc=16.2% | lr=8.00e-05 | pos=64
+  Step  36300 | epoch 1/2 | loss=5.6620 | avg=4.8110 | acc=15.9% | lr=7.98e-05 | pos=64
+  Step  36400 | epoch 1/2 | loss=2.7728 | avg=4.8058 | acc=34.7% | lr=7.97e-05 | pos=64
+  Step  36500 | epoch 1/2 | loss=3.4039 | avg=4.7638 | acc=33.1% | lr=7.95e-05 | pos=64
+  Step  36600 | epoch 1/2 | loss=5.3272 | avg=4.7858 | acc=17.2% | lr=7.94e-05 | pos=64
+  Step  36700 | epoch 1/2 | loss=5.1757 | avg=4.7830 | acc=16.2% | lr=7.92e-05 | pos=64
+  Step  36800 | epoch 1/2 | loss=4.8154 | avg=4.7898 | acc=17.2% | lr=7.90e-05 | pos=64
+  Step  36900 | epoch 1/2 | loss=3.7366 | avg=4.7910 | acc=23.1% | lr=7.89e-05 | pos=64
+  Step  37000 | epoch 1/2 | loss=6.1341 | avg=4.7940 | acc=8.1% | lr=7.87e-05 | pos=64
+  Step  37100 | epoch 1/2 | loss=3.4661 | avg=4.7218 | acc=28.1% | lr=7.86e-05 | pos=64
+  Step  37200 | epoch 1/2 | loss=5.7530 | avg=4.7379 | acc=7.2% | lr=7.84e-05 | pos=64
+  Step  37300 | epoch 1/2 | loss=4.6459 | avg=4.7362 | acc=18.1% | lr=7.82e-05 | pos=64
+  Step  37400 | epoch 1/2 | loss=5.7151 | avg=4.7266 | acc=13.8% | lr=7.81e-05 | pos=64
+  Step  37500 | epoch 1/2 | loss=5.3537 | avg=4.7269 | acc=17.2% | lr=7.79e-05 | pos=64
+  Step  37600 | epoch 1/2 | loss=4.1849 | avg=4.7563 | acc=27.8% | lr=7.77e-05 | pos=64
+  Step  37700 | epoch 1/2 | loss=5.0910 | avg=4.7244 | acc=16.9% | lr=7.76e-05 | pos=64
+  Step  37800 | epoch 1/2 | loss=4.2421 | avg=4.7285 | acc=26.2% | lr=7.74e-05 | pos=64
+  Step  37900 | epoch 1/2 | loss=4.9375 | avg=4.7565 | acc=14.4% | lr=7.72e-05 | pos=64
+  Step  38000 | epoch 1/2 | loss=4.2003 | avg=4.7620 | acc=25.6% | lr=7.71e-05 | pos=64
+  Step  38100 | epoch 1/2 | loss=4.7036 | avg=4.6738 | acc=20.3% | lr=7.69e-05 | pos=64
+  Step  38200 | epoch 1/2 | loss=4.3345 | avg=4.6845 | acc=31.9% | lr=7.68e-05 | pos=64
+  Step  38300 | epoch 1/2 | loss=4.1513 | avg=4.7093 | acc=20.3% | lr=7.66e-05 | pos=64
+  Step  38400 | epoch 1/2 | loss=5.4202 | avg=4.6723 | acc=9.4% | lr=7.64e-05 | pos=64
+  Step  38500 | epoch 1/2 | loss=3.9058 | avg=4.6523 | acc=22.5% | lr=7.62e-05 | pos=64
+  Step  38600 | epoch 1/2 | loss=5.6458 | avg=4.6797 | acc=14.4% | lr=7.61e-05 | pos=64
+  Step  38700 | epoch 1/2 | loss=6.4054 | avg=4.6650 | acc=14.7% | lr=7.59e-05 | pos=64
+  Step  38800 | epoch 1/2 | loss=3.6383 | avg=4.6586 | acc=27.5% | lr=7.57e-05 | pos=64
+  Step  38900 | epoch 1/2 | loss=5.0023 | avg=4.6817 | acc=12.5% | lr=7.56e-05 | pos=64
+  Step  39000 | epoch 1/2 | loss=5.0706 | avg=4.6753 | acc=15.6% | lr=7.54e-05 | pos=64
+  Step  39100 | epoch 1/2 | loss=4.5618 | avg=4.7584 | acc=21.6% | lr=7.52e-05 | pos=64
+  Step  39200 | epoch 1/2 | loss=6.4242 | avg=4.7358 | acc=10.3% | lr=7.51e-05 | pos=64
+  Step  39300 | epoch 1/2 | loss=5.8426 | avg=4.7251 | acc=15.6% | lr=7.49e-05 | pos=64
+  Step  39400 | epoch 1/2 | loss=3.8919 | avg=4.7502 | acc=27.8% | lr=7.47e-05 | pos=64
+  Step  39500 | epoch 1/2 | loss=4.7654 | avg=4.7497 | acc=14.4% | lr=7.46e-05 | pos=64
+  Step  39600 | epoch 1/2 | loss=5.4378 | avg=4.7354 | acc=14.4% | lr=7.44e-05 | pos=64
+  Step  39700 | epoch 1/2 | loss=4.6643 | avg=4.7324 | acc=17.8% | lr=7.42e-05 | pos=64
+  Step  39800 | epoch 1/2 | loss=4.2357 | avg=4.7084 | acc=20.9% | lr=7.40e-05 | pos=64
+  Step  39900 | epoch 1/2 | loss=3.6500 | avg=4.6791 | acc=25.6% | lr=7.39e-05 | pos=64
+  Step  40000 | epoch 1/2 | loss=7.7348 | avg=4.6723 | acc=5.9% | lr=7.37e-05 | pos=64
+  [Checkpoint] Saved step 40000 (loss=7.7348) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 40000] loss=7.7348
+  [Checkpoint] Saved step 40000 (loss=7.7348) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step40000.pt
+  [Prune @ step 40000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  40100 | epoch 1/2 | loss=3.5851 | avg=4.5714 | acc=29.4% | lr=7.35e-05 | pos=64
+  Step  40200 | epoch 1/2 | loss=4.0951 | avg=4.6517 | acc=19.7% | lr=7.33e-05 | pos=64
+  Step  40300 | epoch 1/2 | loss=2.4417 | avg=4.6563 | acc=44.4% | lr=7.32e-05 | pos=64
+  Step  40400 | epoch 1/2 | loss=3.8633 | avg=4.6643 | acc=30.6% | lr=7.30e-05 | pos=64
+  Step  40500 | epoch 1/2 | loss=3.5076 | avg=4.6672 | acc=24.1% | lr=7.28e-05 | pos=64
+  Step  40600 | epoch 1/2 | loss=4.5050 | avg=4.6642 | acc=26.6% | lr=7.26e-05 | pos=64
+  Step  40700 | epoch 1/2 | loss=4.4809 | avg=4.6755 | acc=26.2% | lr=7.25e-05 | pos=64
+  Step  40800 | epoch 1/2 | loss=4.7991 | avg=4.6569 | acc=18.8% | lr=7.23e-05 | pos=64
+  Step  40900 | epoch 1/2 | loss=4.5394 | avg=4.6263 | acc=19.4% | lr=7.21e-05 | pos=64
+  Step  41000 | epoch 1/2 | loss=6.2422 | avg=4.6316 | acc=10.6% | lr=7.19e-05 | pos=64
+  Step  41100 | epoch 1/2 | loss=4.1456 | avg=4.7179 | acc=23.4% | lr=7.18e-05 | pos=64
+  Step  41200 | epoch 1/2 | loss=4.6278 | avg=4.7067 | acc=20.3% | lr=7.16e-05 | pos=64
+  Step  41300 | epoch 1/2 | loss=3.8254 | avg=4.7079 | acc=27.2% | lr=7.14e-05 | pos=64
+  Step  41400 | epoch 1/2 | loss=4.6589 | avg=4.6779 | acc=15.0% | lr=7.12e-05 | pos=64
+  Step  41500 | epoch 1/2 | loss=5.2391 | avg=4.6726 | acc=15.6% | lr=7.11e-05 | pos=64
+  Step  41600 | epoch 1/2 | loss=3.8890 | avg=4.6655 | acc=25.0% | lr=7.09e-05 | pos=64
+  Step  41700 | epoch 1/2 | loss=4.5508 | avg=4.6545 | acc=24.1% | lr=7.07e-05 | pos=64
+  Step  41800 | epoch 1/2 | loss=4.3258 | avg=4.6345 | acc=17.2% | lr=7.05e-05 | pos=64
+  Step  41900 | epoch 1/2 | loss=4.4810 | avg=4.6326 | acc=15.3% | lr=7.03e-05 | pos=64
+  Step  42000 | epoch 1/2 | loss=5.7353 | avg=4.6313 | acc=9.7% | lr=7.02e-05 | pos=64
+  Step  42100 | epoch 1/2 | loss=4.5156 | avg=4.4535 | acc=16.2% | lr=7.00e-05 | pos=64
+  Step  42200 | epoch 1/2 | loss=4.3565 | avg=4.4592 | acc=19.1% | lr=6.98e-05 | pos=64
+  Step  42300 | epoch 1/2 | loss=3.0806 | avg=4.4856 | acc=30.6% | lr=6.96e-05 | pos=64
+  Step  42400 | epoch 1/2 | loss=3.4895 | avg=4.5476 | acc=26.2% | lr=6.94e-05 | pos=64
+  Step  42500 | epoch 1/2 | loss=5.0846 | avg=4.5426 | acc=16.6% | lr=6.93e-05 | pos=64
+  Step  42600 | epoch 1/2 | loss=4.1276 | avg=4.5449 | acc=20.9% | lr=6.91e-05 | pos=64
+  Step  42700 | epoch 1/2 | loss=5.2457 | avg=4.5581 | acc=16.2% | lr=6.89e-05 | pos=64
+  Step  42800 | epoch 1/2 | loss=5.6974 | avg=4.5725 | acc=21.9% | lr=6.87e-05 | pos=64
+  Step  42900 | epoch 1/2 | loss=5.6322 | avg=4.5719 | acc=10.9% | lr=6.85e-05 | pos=64
+  Step  43000 | epoch 1/2 | loss=2.8729 | avg=4.5562 | acc=40.0% | lr=6.84e-05 | pos=64
+  Step  43100 | epoch 1/2 | loss=5.8592 | avg=4.4284 | acc=10.6% | lr=6.82e-05 | pos=64
+  Step  43200 | epoch 1/2 | loss=4.0402 | avg=4.4922 | acc=23.8% | lr=6.80e-05 | pos=64
+  Step  43300 | epoch 1/2 | loss=3.9593 | avg=4.5362 | acc=23.4% | lr=6.78e-05 | pos=64
+  Step  43400 | epoch 1/2 | loss=4.9662 | avg=4.4911 | acc=16.2% | lr=6.76e-05 | pos=64
+  Step  43500 | epoch 1/2 | loss=4.2632 | avg=4.4986 | acc=24.7% | lr=6.74e-05 | pos=64
+  Step  43600 | epoch 1/2 | loss=3.8268 | avg=4.5082 | acc=23.1% | lr=6.73e-05 | pos=64
+  Step  43700 | epoch 1/2 | loss=2.9263 | avg=4.5108 | acc=31.6% | lr=6.71e-05 | pos=64
+  Step  43800 | epoch 1/2 | loss=4.2181 | avg=4.5107 | acc=25.3% | lr=6.69e-05 | pos=64
+  Step  43900 | epoch 1/2 | loss=2.0058 | avg=4.5230 | acc=47.5% | lr=6.67e-05 | pos=64
+  Step  44000 | epoch 1/2 | loss=3.8730 | avg=4.5080 | acc=18.1% | lr=6.65e-05 | pos=64
+  Step  44100 | epoch 1/2 | loss=3.9568 | avg=4.4506 | acc=26.6% | lr=6.63e-05 | pos=64
+  Step  44200 | epoch 1/2 | loss=3.9711 | avg=4.4644 | acc=22.2% | lr=6.62e-05 | pos=64
+  Step  44300 | epoch 1/2 | loss=3.6343 | avg=4.4960 | acc=29.7% | lr=6.60e-05 | pos=64
+  Step  44400 | epoch 1/2 | loss=3.9971 | avg=4.4870 | acc=28.7% | lr=6.58e-05 | pos=64
+  Step  44500 | epoch 1/2 | loss=3.7042 | avg=4.4727 | acc=27.5% | lr=6.56e-05 | pos=64
+  Step  44600 | epoch 1/2 | loss=4.0212 | avg=4.4760 | acc=26.2% | lr=6.54e-05 | pos=64
+  Step  44700 | epoch 1/2 | loss=3.5272 | avg=4.4681 | acc=28.7% | lr=6.52e-05 | pos=64
+  Step  44800 | epoch 1/2 | loss=6.0561 | avg=4.4643 | acc=10.9% | lr=6.50e-05 | pos=64
+  Step  44900 | epoch 1/2 | loss=4.6864 | avg=4.4690 | acc=18.8% | lr=6.49e-05 | pos=64
+  Step  45000 | epoch 1/2 | loss=7.1617 | avg=4.4785 | acc=10.6% | lr=6.47e-05 | pos=64
+  [Checkpoint] Saved step 45000 (loss=7.1617) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 45000] loss=7.1617
+  [Checkpoint] Saved step 45000 (loss=7.1617) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step45000.pt
+  [Prune @ step 45000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  45100 | epoch 1/2 | loss=3.2811 | avg=4.3899 | acc=34.7% | lr=6.45e-05 | pos=64
+  Step  45200 | epoch 1/2 | loss=4.6385 | avg=4.5053 | acc=21.6% | lr=6.43e-05 | pos=64
+  Step  45300 | epoch 1/2 | loss=3.3678 | avg=4.5298 | acc=24.1% | lr=6.41e-05 | pos=64
+  Step  45400 | epoch 1/2 | loss=4.6456 | avg=4.5162 | acc=23.8% | lr=6.39e-05 | pos=64
+  Step  45500 | epoch 1/2 | loss=4.6782 | avg=4.4555 | acc=21.9% | lr=6.37e-05 | pos=64
+  Step  45600 | epoch 1/2 | loss=5.1122 | avg=4.4485 | acc=16.2% | lr=6.36e-05 | pos=64
+  Step  45700 | epoch 1/2 | loss=4.5492 | avg=4.4482 | acc=18.1% | lr=6.34e-05 | pos=64
+  Step  45800 | epoch 1/2 | loss=4.5077 | avg=4.4446 | acc=20.0% | lr=6.32e-05 | pos=64
+  Step  45900 | epoch 1/2 | loss=5.6127 | avg=4.4549 | acc=13.1% | lr=6.30e-05 | pos=64
+  Step  46000 | epoch 1/2 | loss=6.9352 | avg=4.4635 | acc=7.2% | lr=6.28e-05 | pos=64
+  Step  46100 | epoch 1/2 | loss=3.1496 | avg=4.4735 | acc=36.9% | lr=6.26e-05 | pos=64
+  --- Epoch 1/2 complete (step 46122) ---
+  Step  46200 | epoch 2/2 | loss=4.6651 | avg=4.5114 | acc=15.6% | lr=6.24e-05 | pos=64
+  Step  46300 | epoch 2/2 | loss=5.0748 | avg=4.5066 | acc=18.1% | lr=6.22e-05 | pos=64
+  Step  46400 | epoch 2/2 | loss=5.6306 | avg=4.4797 | acc=15.0% | lr=6.21e-05 | pos=64
+  Step  46500 | epoch 2/2 | loss=6.4561 | avg=4.4427 | acc=13.8% | lr=6.19e-05 | pos=64
+  Step  46600 | epoch 2/2 | loss=3.1082 | avg=4.3921 | acc=35.6% | lr=6.17e-05 | pos=64
+  Step  46700 | epoch 2/2 | loss=2.9550 | avg=4.3908 | acc=35.6% | lr=6.15e-05 | pos=64
+  Step  46800 | epoch 2/2 | loss=4.7431 | avg=4.3802 | acc=17.2% | lr=6.13e-05 | pos=64
+  Step  46900 | epoch 2/2 | loss=3.6641 | avg=4.3776 | acc=30.0% | lr=6.11e-05 | pos=64
+  Step  47000 | epoch 2/2 | loss=4.2720 | avg=4.3821 | acc=23.4% | lr=6.09e-05 | pos=64
+  Step  47100 | epoch 2/2 | loss=4.5297 | avg=4.0870 | acc=24.4% | lr=6.07e-05 | pos=64
+  Step  47200 | epoch 2/2 | loss=3.6537 | avg=4.2045 | acc=24.7% | lr=6.05e-05 | pos=64
+  Step  47300 | epoch 2/2 | loss=3.0951 | avg=4.2407 | acc=30.3% | lr=6.04e-05 | pos=64
+  Step  47400 | epoch 2/2 | loss=6.0267 | avg=4.3068 | acc=11.6% | lr=6.02e-05 | pos=64
+  Step  47500 | epoch 2/2 | loss=6.2363 | avg=4.3007 | acc=11.9% | lr=6.00e-05 | pos=64
+  Step  47600 | epoch 2/2 | loss=3.5052 | avg=4.3228 | acc=30.6% | lr=5.98e-05 | pos=64
+  Step  47700 | epoch 2/2 | loss=5.3205 | avg=4.3406 | acc=13.4% | lr=5.96e-05 | pos=64
+  Step  47800 | epoch 2/2 | loss=4.6430 | avg=4.3462 | acc=25.0% | lr=5.94e-05 | pos=32
+  Step  47900 | epoch 2/2 | loss=5.5823 | avg=4.3542 | acc=16.2% | lr=5.92e-05 | pos=64
+  Step  48000 | epoch 2/2 | loss=5.0836 | avg=4.3626 | acc=13.4% | lr=5.90e-05 | pos=64
+  Step  48100 | epoch 2/2 | loss=4.8668 | avg=4.1157 | acc=20.6% | lr=5.88e-05 | pos=64
+  Step  48200 | epoch 2/2 | loss=3.2015 | avg=4.1707 | acc=34.4% | lr=5.87e-05 | pos=64
+  Step  48300 | epoch 2/2 | loss=5.3611 | avg=4.2596 | acc=13.1% | lr=5.85e-05 | pos=64
+  Step  48400 | epoch 2/2 | loss=4.4880 | avg=4.3157 | acc=14.1% | lr=5.83e-05 | pos=64
+  Step  48500 | epoch 2/2 | loss=5.8044 | avg=4.3075 | acc=12.5% | lr=5.81e-05 | pos=64
+  Step  48600 | epoch 2/2 | loss=5.0874 | avg=4.3231 | acc=16.2% | lr=5.79e-05 | pos=64
+  Step  48700 | epoch 2/2 | loss=5.7498 | avg=4.3361 | acc=21.9% | lr=5.77e-05 | pos=64
+  Step  48800 | epoch 2/2 | loss=4.2398 | avg=4.3508 | acc=17.5% | lr=5.75e-05 | pos=64
+  Step  48900 | epoch 2/2 | loss=4.4350 | avg=4.3526 | acc=21.9% | lr=5.73e-05 | pos=64
+  Step  49000 | epoch 2/2 | loss=5.5366 | avg=4.3496 | acc=13.1% | lr=5.71e-05 | pos=64
+  Step  49100 | epoch 2/2 | loss=4.8387 | avg=4.4597 | acc=22.2% | lr=5.69e-05 | pos=64
+  Step  49200 | epoch 2/2 | loss=4.5019 | avg=4.3805 | acc=14.4% | lr=5.68e-05 | pos=64
+  Step  49300 | epoch 2/2 | loss=3.1210 | avg=4.3799 | acc=33.1% | lr=5.66e-05 | pos=64
+  Step  49400 | epoch 2/2 | loss=6.9753 | avg=4.4128 | acc=10.0% | lr=5.64e-05 | pos=64
+  Step  49500 | epoch 2/2 | loss=3.5888 | avg=4.4103 | acc=28.1% | lr=5.62e-05 | pos=64
+  Step  49600 | epoch 2/2 | loss=5.8356 | avg=4.3700 | acc=12.2% | lr=5.60e-05 | pos=64
+  Step  49700 | epoch 2/2 | loss=5.1198 | avg=4.3594 | acc=18.8% | lr=5.58e-05 | pos=64
+  Step  49800 | epoch 2/2 | loss=4.5969 | avg=4.3558 | acc=23.8% | lr=5.56e-05 | pos=64
+  Step  49900 | epoch 2/2 | loss=3.4335 | avg=4.3543 | acc=28.4% | lr=5.54e-05 | pos=64
+  Step  50000 | epoch 2/2 | loss=4.0635 | avg=4.3603 | acc=22.8% | lr=5.52e-05 | pos=64
+  [Checkpoint] Saved step 50000 (loss=4.0635) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 50000] loss=4.0635
+  [Checkpoint] Saved step 50000 (loss=4.0635) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step50000.pt
+  [Prune @ step 50000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  50100 | epoch 2/2 | loss=5.2289 | avg=4.6177 | acc=14.1% | lr=5.50e-05 | pos=64
+  Step  50200 | epoch 2/2 | loss=2.9311 | avg=4.4205 | acc=30.0% | lr=5.49e-05 | pos=64
+  Step  50300 | epoch 2/2 | loss=3.6592 | avg=4.3510 | acc=20.6% | lr=5.47e-05 | pos=64
+  Step  50400 | epoch 2/2 | loss=3.3800 | avg=4.3341 | acc=30.3% | lr=5.45e-05 | pos=64
+  Step  50500 | epoch 2/2 | loss=4.5648 | avg=4.3210 | acc=18.8% | lr=5.43e-05 | pos=64
+  Step  50600 | epoch 2/2 | loss=2.6815 | avg=4.3175 | acc=40.6% | lr=5.41e-05 | pos=64
+  Step  50700 | epoch 2/2 | loss=2.7688 | avg=4.3010 | acc=38.4% | lr=5.39e-05 | pos=64
+  Step  50800 | epoch 2/2 | loss=3.5650 | avg=4.2845 | acc=31.6% | lr=5.37e-05 | pos=64
+  Step  50900 | epoch 2/2 | loss=4.3807 | avg=4.2764 | acc=22.2% | lr=5.35e-05 | pos=64
+  Step  51000 | epoch 2/2 | loss=6.0428 | avg=4.2931 | acc=13.1% | lr=5.33e-05 | pos=64
+  Step  51100 | epoch 2/2 | loss=3.4159 | avg=4.4274 | acc=28.1% | lr=5.31e-05 | pos=64
+  Step  51200 | epoch 2/2 | loss=3.6057 | avg=4.3325 | acc=27.5% | lr=5.29e-05 | pos=64
+  Step  51300 | epoch 2/2 | loss=3.0823 | avg=4.2624 | acc=31.2% | lr=5.28e-05 | pos=64
+  Step  51400 | epoch 2/2 | loss=4.4353 | avg=4.2728 | acc=19.3% | lr=5.26e-05 | pos=59
+  Step  51500 | epoch 2/2 | loss=3.2198 | avg=4.2668 | acc=30.3% | lr=5.24e-05 | pos=64
+  Step  51600 | epoch 2/2 | loss=5.9554 | avg=4.2633 | acc=12.5% | lr=5.22e-05 | pos=64
+  Step  51700 | epoch 2/2 | loss=4.5542 | avg=4.2599 | acc=21.6% | lr=5.20e-05 | pos=64
+  Step  51800 | epoch 2/2 | loss=2.9085 | avg=4.2548 | acc=38.4% | lr=5.18e-05 | pos=64
+  Step  51900 | epoch 2/2 | loss=3.8177 | avg=4.2541 | acc=33.1% | lr=5.16e-05 | pos=26
+  Step  52000 | epoch 2/2 | loss=3.5356 | avg=4.2607 | acc=27.5% | lr=5.14e-05 | pos=64
+  Step  52100 | epoch 2/2 | loss=3.6337 | avg=4.1506 | acc=33.4% | lr=5.12e-05 | pos=64
+  Step  52200 | epoch 2/2 | loss=4.2330 | avg=4.2326 | acc=20.9% | lr=5.10e-05 | pos=64
+  Step  52300 | epoch 2/2 | loss=3.7074 | avg=4.1757 | acc=27.8% | lr=5.09e-05 | pos=64
+  Step  52400 | epoch 2/2 | loss=2.1662 | avg=4.1801 | acc=47.8% | lr=5.07e-05 | pos=64
+  Step  52500 | epoch 2/2 | loss=2.5947 | avg=4.1741 | acc=42.2% | lr=5.05e-05 | pos=64
+  Step  52600 | epoch 2/2 | loss=5.7248 | avg=4.1894 | acc=11.2% | lr=5.03e-05 | pos=64
+  Step  52700 | epoch 2/2 | loss=2.4033 | avg=4.2100 | acc=45.3% | lr=5.01e-05 | pos=64
+  Step  52800 | epoch 2/2 | loss=3.9000 | avg=4.2255 | acc=25.3% | lr=4.99e-05 | pos=64
+  Step  52900 | epoch 2/2 | loss=4.7661 | avg=4.2193 | acc=22.2% | lr=4.97e-05 | pos=64
+  Step  53000 | epoch 2/2 | loss=5.2609 | avg=4.2174 | acc=17.8% | lr=4.95e-05 | pos=64
+  Step  53100 | epoch 2/2 | loss=2.6993 | avg=4.2650 | acc=35.0% | lr=4.93e-05 | pos=64
+  Step  53200 | epoch 2/2 | loss=5.0744 | avg=4.2858 | acc=14.4% | lr=4.92e-05 | pos=64
+  Step  53300 | epoch 2/2 | loss=5.8003 | avg=4.2816 | acc=17.2% | lr=4.90e-05 | pos=64
+  Step  53400 | epoch 2/2 | loss=4.3404 | avg=4.2741 | acc=20.6% | lr=4.88e-05 | pos=64
+  Step  53500 | epoch 2/2 | loss=2.6664 | avg=4.2626 | acc=41.9% | lr=4.86e-05 | pos=64
+  Step  53600 | epoch 2/2 | loss=4.7678 | avg=4.2828 | acc=18.4% | lr=4.84e-05 | pos=64
+  Step  53700 | epoch 2/2 | loss=3.2696 | avg=4.2951 | acc=36.6% | lr=4.82e-05 | pos=64
+  Step  53800 | epoch 2/2 | loss=4.6912 | avg=4.2679 | acc=16.6% | lr=4.80e-05 | pos=64
+  Step  53900 | epoch 2/2 | loss=3.8017 | avg=4.2719 | acc=29.1% | lr=4.78e-05 | pos=64
+  Step  54000 | epoch 2/2 | loss=2.0394 | avg=4.2839 | acc=46.9% | lr=4.76e-05 | pos=64
+  Step  54100 | epoch 2/2 | loss=4.9542 | avg=4.0901 | acc=19.4% | lr=4.75e-05 | pos=64
+  Step  54200 | epoch 2/2 | loss=3.9687 | avg=4.0914 | acc=26.2% | lr=4.73e-05 | pos=64
+  Step  54300 | epoch 2/2 | loss=5.5588 | avg=4.1474 | acc=18.8% | lr=4.71e-05 | pos=64
+  Step  54400 | epoch 2/2 | loss=3.4846 | avg=4.1146 | acc=31.2% | lr=4.69e-05 | pos=64
+  Step  54500 | epoch 2/2 | loss=3.5327 | avg=4.0975 | acc=26.6% | lr=4.67e-05 | pos=64
+  Step  54600 | epoch 2/2 | loss=3.3468 | avg=4.1207 | acc=37.8% | lr=4.65e-05 | pos=64
+  Step  54700 | epoch 2/2 | loss=5.5684 | avg=4.1531 | acc=12.5% | lr=4.63e-05 | pos=64
+  Step  54800 | epoch 2/2 | loss=5.1904 | avg=4.1616 | acc=20.9% | lr=4.62e-05 | pos=64
+  Step  54900 | epoch 2/2 | loss=5.1430 | avg=4.1824 | acc=15.6% | lr=4.60e-05 | pos=64
+  Step  55000 | epoch 2/2 | loss=5.3936 | avg=4.1877 | acc=13.8% | lr=4.58e-05 | pos=64
+  [Checkpoint] Saved step 55000 (loss=5.3936) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 55000] loss=5.3936
+  [Checkpoint] Saved step 55000 (loss=5.3936) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step55000.pt
+  [Prune @ step 55000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  55100 | epoch 2/2 | loss=4.5162 | avg=4.2784 | acc=17.5% | lr=4.56e-05 | pos=64
+  Step  55200 | epoch 2/2 | loss=3.8093 | avg=4.1908 | acc=21.2% | lr=4.54e-05 | pos=64
+  Step  55300 | epoch 2/2 | loss=4.7255 | avg=4.1757 | acc=18.1% | lr=4.52e-05 | pos=64
+  Step  55400 | epoch 2/2 | loss=4.1218 | avg=4.1679 | acc=19.4% | lr=4.50e-05 | pos=64
+  Step  55500 | epoch 2/2 | loss=3.7746 | avg=4.1506 | acc=27.2% | lr=4.48e-05 | pos=64
+  Step  55600 | epoch 2/2 | loss=4.0509 | avg=4.1869 | acc=24.7% | lr=4.47e-05 | pos=64
+  Step  55700 | epoch 2/2 | loss=3.6973 | avg=4.1932 | acc=27.2% | lr=4.45e-05 | pos=64
+  Step  55800 | epoch 2/2 | loss=3.7366 | avg=4.1651 | acc=25.0% | lr=4.43e-05 | pos=64
+  Step  55900 | epoch 2/2 | loss=4.5621 | avg=4.1390 | acc=16.2% | lr=4.41e-05 | pos=64
+  Step  56000 | epoch 2/2 | loss=4.4710 | avg=4.1407 | acc=26.6% | lr=4.39e-05 | pos=64
+  Step  56100 | epoch 2/2 | loss=3.0139 | avg=4.1708 | acc=33.1% | lr=4.37e-05 | pos=64
+  Step  56200 | epoch 2/2 | loss=3.7810 | avg=4.1464 | acc=22.5% | lr=4.36e-05 | pos=64
+  Step  56300 | epoch 2/2 | loss=3.9623 | avg=4.0911 | acc=20.3% | lr=4.34e-05 | pos=64
+  Step  56400 | epoch 2/2 | loss=4.3553 | avg=4.1003 | acc=23.1% | lr=4.32e-05 | pos=64
+  Step  56500 | epoch 2/2 | loss=5.0065 | avg=4.0988 | acc=12.8% | lr=4.30e-05 | pos=64
+  Step  56600 | epoch 2/2 | loss=2.5300 | avg=4.1006 | acc=49.7% | lr=4.28e-05 | pos=64
+  Step  56700 | epoch 2/2 | loss=5.8755 | avg=4.0992 | acc=13.4% | lr=4.26e-05 | pos=64
+  Step  56800 | epoch 2/2 | loss=5.3956 | avg=4.1137 | acc=16.2% | lr=4.25e-05 | pos=64
+  Step  56900 | epoch 2/2 | loss=5.2464 | avg=4.1265 | acc=10.0% | lr=4.23e-05 | pos=64
+  Step  57000 | epoch 2/2 | loss=3.0738 | avg=4.1386 | acc=31.2% | lr=4.21e-05 | pos=64
+  Step  57100 | epoch 2/2 | loss=2.6303 | avg=4.2434 | acc=41.9% | lr=4.19e-05 | pos=64
+  Step  57200 | epoch 2/2 | loss=2.7785 | avg=4.1662 | acc=37.2% | lr=4.17e-05 | pos=64
+  Step  57300 | epoch 2/2 | loss=3.0878 | avg=4.1299 | acc=33.4% | lr=4.15e-05 | pos=64
+  Step  57400 | epoch 2/2 | loss=4.4544 | avg=4.1786 | acc=18.1% | lr=4.14e-05 | pos=64
+  Step  57500 | epoch 2/2 | loss=5.2340 | avg=4.2009 | acc=24.4% | lr=4.12e-05 | pos=64
+  Step  57600 | epoch 2/2 | loss=2.3631 | avg=4.1756 | acc=42.8% | lr=4.10e-05 | pos=64
+  Step  57700 | epoch 2/2 | loss=2.8706 | avg=4.1659 | acc=38.1% | lr=4.08e-05 | pos=64
+  Step  57800 | epoch 2/2 | loss=4.6117 | avg=4.1614 | acc=19.7% | lr=4.06e-05 | pos=64
+  Step  57900 | epoch 2/2 | loss=4.4408 | avg=4.1482 | acc=26.6% | lr=4.05e-05 | pos=64
+  Step  58000 | epoch 2/2 | loss=4.8935 | avg=4.1372 | acc=17.2% | lr=4.03e-05 | pos=64
+  Step  58100 | epoch 2/2 | loss=5.1582 | avg=4.0288 | acc=14.4% | lr=4.01e-05 | pos=64
+  Step  58200 | epoch 2/2 | loss=4.6675 | avg=4.0410 | acc=14.1% | lr=3.99e-05 | pos=64
+  Step  58300 | epoch 2/2 | loss=3.7610 | avg=4.0434 | acc=27.8% | lr=3.97e-05 | pos=64
+  Step  58400 | epoch 2/2 | loss=4.6507 | avg=4.0494 | acc=21.6% | lr=3.96e-05 | pos=64
+  Step  58500 | epoch 2/2 | loss=4.7374 | avg=4.0741 | acc=23.1% | lr=3.94e-05 | pos=64
+  Step  58600 | epoch 2/2 | loss=4.1819 | avg=4.0661 | acc=21.2% | lr=3.92e-05 | pos=64
+  Step  58700 | epoch 2/2 | loss=2.6859 | avg=4.0721 | acc=38.1% | lr=3.90e-05 | pos=64
+  Step  58800 | epoch 2/2 | loss=3.4361 | avg=4.1087 | acc=26.9% | lr=3.88e-05 | pos=64
+  Step  58900 | epoch 2/2 | loss=4.4231 | avg=4.0821 | acc=20.3% | lr=3.87e-05 | pos=64
+  Step  59000 | epoch 2/2 | loss=4.5820 | avg=4.0734 | acc=18.8% | lr=3.85e-05 | pos=64
+  Step  59100 | epoch 2/2 | loss=4.5509 | avg=3.9606 | acc=20.0% | lr=3.83e-05 | pos=64
+  Step  59200 | epoch 2/2 | loss=3.7237 | avg=3.9542 | acc=24.7% | lr=3.81e-05 | pos=64
+  Step  59300 | epoch 2/2 | loss=4.6538 | avg=3.9588 | acc=16.6% | lr=3.80e-05 | pos=64
+  Step  59400 | epoch 2/2 | loss=4.2155 | avg=4.0015 | acc=21.2% | lr=3.78e-05 | pos=64
+  Step  59500 | epoch 2/2 | loss=5.7525 | avg=4.0206 | acc=11.9% | lr=3.76e-05 | pos=64
+  Step  59600 | epoch 2/2 | loss=4.0382 | avg=4.0227 | acc=21.6% | lr=3.74e-05 | pos=64
+  Step  59700 | epoch 2/2 | loss=2.7897 | avg=4.0174 | acc=34.4% | lr=3.73e-05 | pos=64
+  Step  59800 | epoch 2/2 | loss=3.3053 | avg=4.0501 | acc=30.6% | lr=3.71e-05 | pos=64
+  Step  59900 | epoch 2/2 | loss=4.6714 | avg=4.0452 | acc=15.3% | lr=3.69e-05 | pos=64
+  Step  60000 | epoch 2/2 | loss=2.9858 | avg=4.0436 | acc=34.1% | lr=3.67e-05 | pos=64
+  [Checkpoint] Saved step 60000 (loss=2.9858) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 60000] loss=2.9858
+  [Checkpoint] Saved step 60000 (loss=2.9858) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step60000.pt
+  [Prune @ step 60000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  60100 | epoch 2/2 | loss=3.8330 | avg=3.9239 | acc=25.6% | lr=3.66e-05 | pos=64
+  Step  60200 | epoch 2/2 | loss=3.5669 | avg=4.1022 | acc=27.2% | lr=3.64e-05 | pos=64
+  Step  60300 | epoch 2/2 | loss=3.5239 | avg=4.0814 | acc=30.0% | lr=3.62e-05 | pos=64
+  Step  60400 | epoch 2/2 | loss=3.2373 | avg=4.0942 | acc=32.5% | lr=3.60e-05 | pos=64
+  Step  60500 | epoch 2/2 | loss=2.7187 | avg=4.0444 | acc=34.4% | lr=3.59e-05 | pos=64
+  Step  60600 | epoch 2/2 | loss=4.2308 | avg=4.0627 | acc=22.2% | lr=3.57e-05 | pos=64
+  Step  60700 | epoch 2/2 | loss=1.7200 | avg=4.0798 | acc=56.9% | lr=3.55e-05 | pos=64
+  Step  60800 | epoch 2/2 | loss=2.9003 | avg=4.0900 | acc=35.0% | lr=3.54e-05 | pos=64
+  Step  60900 | epoch 2/2 | loss=5.0054 | avg=4.0693 | acc=20.3% | lr=3.52e-05 | pos=64
+  Step  61000 | epoch 2/2 | loss=4.2459 | avg=4.0705 | acc=22.2% | lr=3.50e-05 | pos=64
+  Step  61100 | epoch 2/2 | loss=4.2594 | avg=4.0604 | acc=24.7% | lr=3.48e-05 | pos=64
+  Step  61200 | epoch 2/2 | loss=3.8607 | avg=4.1537 | acc=23.4% | lr=3.47e-05 | pos=64
+  Step  61300 | epoch 2/2 | loss=2.3646 | avg=4.1256 | acc=45.3% | lr=3.45e-05 | pos=64
+  Step  61400 | epoch 2/2 | loss=3.8619 | avg=4.0834 | acc=25.3% | lr=3.43e-05 | pos=64
+  Step  61500 | epoch 2/2 | loss=3.2487 | avg=4.0629 | acc=37.8% | lr=3.42e-05 | pos=64
+  Step  61600 | epoch 2/2 | loss=4.6425 | avg=4.0680 | acc=15.9% | lr=3.40e-05 | pos=64
+  Step  61700 | epoch 2/2 | loss=4.3301 | avg=4.0565 | acc=18.8% | lr=3.38e-05 | pos=64
+  Step  61800 | epoch 2/2 | loss=6.8729 | avg=4.0740 | acc=11.9% | lr=3.37e-05 | pos=64
+  Step  61900 | epoch 2/2 | loss=4.2032 | avg=4.0915 | acc=19.1% | lr=3.35e-05 | pos=64
+  Step  62000 | epoch 2/2 | loss=3.9328 | avg=4.0745 | acc=29.4% | lr=3.33e-05 | pos=64
+  Step  62100 | epoch 2/2 | loss=4.3181 | avg=4.0363 | acc=16.9% | lr=3.32e-05 | pos=64
+  Step  62200 | epoch 2/2 | loss=3.8276 | avg=4.0184 | acc=21.2% | lr=3.30e-05 | pos=64
+  Step  62300 | epoch 2/2 | loss=6.0614 | avg=4.0637 | acc=10.3% | lr=3.28e-05 | pos=64
+  Step  62400 | epoch 2/2 | loss=3.9388 | avg=4.0939 | acc=19.4% | lr=3.27e-05 | pos=64
+  Step  62500 | epoch 2/2 | loss=4.1195 | avg=4.0725 | acc=20.9% | lr=3.25e-05 | pos=64
+  Step  62600 | epoch 2/2 | loss=3.3876 | avg=4.1031 | acc=25.9% | lr=3.23e-05 | pos=64
+  Step  62700 | epoch 2/2 | loss=3.9373 | avg=4.0890 | acc=26.9% | lr=3.22e-05 | pos=64
+  Step  62800 | epoch 2/2 | loss=2.9918 | avg=4.0637 | acc=36.9% | lr=3.20e-05 | pos=64
+  Step  62900 | epoch 2/2 | loss=2.9810 | avg=4.0758 | acc=34.1% | lr=3.18e-05 | pos=64
+  Step  63000 | epoch 2/2 | loss=2.5209 | avg=4.0705 | acc=40.3% | lr=3.17e-05 | pos=64
+  Step  63100 | epoch 2/2 | loss=4.8396 | avg=3.9990 | acc=15.3% | lr=3.15e-05 | pos=64
+  Step  63200 | epoch 2/2 | loss=3.2383 | avg=3.9918 | acc=30.9% | lr=3.13e-05 | pos=64
+  Step  63300 | epoch 2/2 | loss=3.8425 | avg=3.9815 | acc=31.6% | lr=3.12e-05 | pos=64
+  Step  63400 | epoch 2/2 | loss=5.0435 | avg=3.9736 | acc=16.9% | lr=3.10e-05 | pos=64
+  Step  63500 | epoch 2/2 | loss=3.1421 | avg=3.9933 | acc=33.8% | lr=3.09e-05 | pos=64
+  Step  63600 | epoch 2/2 | loss=3.3418 | avg=3.9799 | acc=29.7% | lr=3.07e-05 | pos=64
+  Step  63700 | epoch 2/2 | loss=4.1966 | avg=3.9901 | acc=21.9% | lr=3.05e-05 | pos=64
+  Step  63800 | epoch 2/2 | loss=4.3271 | avg=3.9870 | acc=22.8% | lr=3.04e-05 | pos=64
+  Step  63900 | epoch 2/2 | loss=4.1081 | avg=3.9595 | acc=22.5% | lr=3.02e-05 | pos=64
+  Step  64000 | epoch 2/2 | loss=3.9388 | avg=3.9641 | acc=28.7% | lr=3.01e-05 | pos=64
+  Step  64100 | epoch 2/2 | loss=2.7076 | avg=4.1071 | acc=44.7% | lr=2.99e-05 | pos=64
+  Step  64200 | epoch 2/2 | loss=3.7021 | avg=4.0373 | acc=26.9% | lr=2.98e-05 | pos=64
+  Step  64300 | epoch 2/2 | loss=3.4879 | avg=4.0371 | acc=31.6% | lr=2.96e-05 | pos=64
+  Step  64400 | epoch 2/2 | loss=2.7751 | avg=4.0397 | acc=35.3% | lr=2.94e-05 | pos=64
+  Step  64500 | epoch 2/2 | loss=5.1389 | avg=4.0244 | acc=14.7% | lr=2.93e-05 | pos=64
+  Step  64600 | epoch 2/2 | loss=5.9862 | avg=4.0349 | acc=11.9% | lr=2.91e-05 | pos=64
+  Step  64700 | epoch 2/2 | loss=4.7760 | avg=4.0451 | acc=15.3% | lr=2.90e-05 | pos=64
+  Step  64800 | epoch 2/2 | loss=2.2713 | avg=4.0212 | acc=48.1% | lr=2.88e-05 | pos=64
+  Step  64900 | epoch 2/2 | loss=2.6343 | avg=3.9889 | acc=40.0% | lr=2.87e-05 | pos=64
+  Step  65000 | epoch 2/2 | loss=4.3901 | avg=3.9958 | acc=19.7% | lr=2.85e-05 | pos=64
+  [Checkpoint] Saved step 65000 (loss=4.3901) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 65000] loss=4.3901
+  [Checkpoint] Saved step 65000 (loss=4.3901) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step65000.pt
+  [Prune @ step 65000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  65100 | epoch 2/2 | loss=4.3795 | avg=4.1776 | acc=19.1% | lr=2.84e-05 | pos=64
+  Step  65200 | epoch 2/2 | loss=3.1766 | avg=4.0615 | acc=30.0% | lr=2.82e-05 | pos=64
+  Step  65300 | epoch 2/2 | loss=4.0979 | avg=4.1010 | acc=25.9% | lr=2.80e-05 | pos=64
+  Step  65400 | epoch 2/2 | loss=4.3317 | avg=4.0931 | acc=20.0% | lr=2.79e-05 | pos=64
+  Step  65500 | epoch 2/2 | loss=2.4667 | avg=4.0742 | acc=45.0% | lr=2.77e-05 | pos=64
+  Step  65600 | epoch 2/2 | loss=4.8428 | avg=4.0713 | acc=17.8% | lr=2.76e-05 | pos=64
+  Step  65700 | epoch 2/2 | loss=3.8151 | avg=4.0715 | acc=32.8% | lr=2.74e-05 | pos=64
+  Step  65800 | epoch 2/2 | loss=3.7558 | avg=4.0690 | acc=27.8% | lr=2.73e-05 | pos=64
+  Step  65900 | epoch 2/2 | loss=2.2229 | avg=4.0481 | acc=42.8% | lr=2.71e-05 | pos=64
+  Step  66000 | epoch 2/2 | loss=2.9299 | avg=4.0462 | acc=33.4% | lr=2.70e-05 | pos=64
+  Step  66100 | epoch 2/2 | loss=2.6785 | avg=3.9906 | acc=39.4% | lr=2.68e-05 | pos=64
+  Step  66200 | epoch 2/2 | loss=6.0460 | avg=4.0197 | acc=15.3% | lr=2.67e-05 | pos=64
+  Step  66300 | epoch 2/2 | loss=4.0599 | avg=4.0013 | acc=27.5% | lr=2.65e-05 | pos=64
+  Step  66400 | epoch 2/2 | loss=4.4168 | avg=3.9807 | acc=22.2% | lr=2.64e-05 | pos=64
+  Step  66500 | epoch 2/2 | loss=4.7449 | avg=4.0224 | acc=19.1% | lr=2.63e-05 | pos=64
+  Step  66600 | epoch 2/2 | loss=4.7786 | avg=4.0074 | acc=16.6% | lr=2.61e-05 | pos=64
+  Step  66700 | epoch 2/2 | loss=4.5076 | avg=3.9948 | acc=21.9% | lr=2.60e-05 | pos=64
+  Step  66800 | epoch 2/2 | loss=3.3022 | avg=4.0162 | acc=30.9% | lr=2.58e-05 | pos=64
+  Step  66900 | epoch 2/2 | loss=4.1388 | avg=4.0111 | acc=21.9% | lr=2.57e-05 | pos=64
+  Step  67000 | epoch 2/2 | loss=2.4938 | avg=4.0179 | acc=42.5% | lr=2.55e-05 | pos=64
+  Step  67100 | epoch 2/2 | loss=3.3502 | avg=4.0698 | acc=36.2% | lr=2.54e-05 | pos=64
+  Step  67200 | epoch 2/2 | loss=2.8992 | avg=3.9731 | acc=34.4% | lr=2.52e-05 | pos=64
+  Step  67300 | epoch 2/2 | loss=3.8375 | avg=4.0487 | acc=21.9% | lr=2.51e-05 | pos=64
+  Step  67400 | epoch 2/2 | loss=5.3267 | avg=4.0175 | acc=17.8% | lr=2.50e-05 | pos=64
+  Step  67500 | epoch 2/2 | loss=3.4675 | avg=3.9887 | acc=28.7% | lr=2.48e-05 | pos=64
+  Step  67600 | epoch 2/2 | loss=3.7583 | avg=4.0100 | acc=26.2% | lr=2.47e-05 | pos=64
+  Step  67700 | epoch 2/2 | loss=3.8188 | avg=3.9977 | acc=26.2% | lr=2.45e-05 | pos=64
+  Step  67800 | epoch 2/2 | loss=2.5829 | avg=3.9890 | acc=39.7% | lr=2.44e-05 | pos=64
+  Step  67900 | epoch 2/2 | loss=5.0292 | avg=3.9850 | acc=22.2% | lr=2.43e-05 | pos=64
+  Step  68000 | epoch 2/2 | loss=3.7859 | avg=3.9706 | acc=28.7% | lr=2.41e-05 | pos=64
+  Step  68100 | epoch 2/2 | loss=5.1101 | avg=3.9497 | acc=13.4% | lr=2.40e-05 | pos=64
+  Step  68200 | epoch 2/2 | loss=4.4756 | avg=3.9722 | acc=22.5% | lr=2.38e-05 | pos=64
+  Step  68300 | epoch 2/2 | loss=4.1080 | avg=4.0035 | acc=25.9% | lr=2.37e-05 | pos=64
+  Step  68400 | epoch 2/2 | loss=2.8236 | avg=3.9939 | acc=38.8% | lr=2.36e-05 | pos=64
+  Step  68500 | epoch 2/2 | loss=3.0124 | avg=3.9966 | acc=34.1% | lr=2.34e-05 | pos=64
+  Step  68600 | epoch 2/2 | loss=3.7765 | avg=3.9815 | acc=32.5% | lr=2.33e-05 | pos=64
+  Step  68700 | epoch 2/2 | loss=3.0217 | avg=3.9884 | acc=38.1% | lr=2.32e-05 | pos=64
+  Step  68800 | epoch 2/2 | loss=3.5438 | avg=3.9941 | acc=26.2% | lr=2.30e-05 | pos=64
+  Step  68900 | epoch 2/2 | loss=3.8703 | avg=3.9851 | acc=30.6% | lr=2.29e-05 | pos=64
+  Step  69000 | epoch 2/2 | loss=3.6413 | avg=3.9808 | acc=26.2% | lr=2.28e-05 | pos=64
+  Step  69100 | epoch 2/2 | loss=5.3341 | avg=3.8165 | acc=14.1% | lr=2.26e-05 | pos=64
+  Step  69200 | epoch 2/2 | loss=4.8652 | avg=3.8429 | acc=16.9% | lr=2.25e-05 | pos=64
+  Step  69300 | epoch 2/2 | loss=4.3434 | avg=3.8633 | acc=18.8% | lr=2.24e-05 | pos=64
+  Step  69400 | epoch 2/2 | loss=4.7815 | avg=3.8867 | acc=19.7% | lr=2.22e-05 | pos=64
+  Step  69500 | epoch 2/2 | loss=3.6497 | avg=3.8812 | acc=26.6% | lr=2.21e-05 | pos=64
+  Step  69600 | epoch 2/2 | loss=4.9780 | avg=3.8984 | acc=11.9% | lr=2.20e-05 | pos=64
+  Step  69700 | epoch 2/2 | loss=4.0833 | avg=3.9129 | acc=23.1% | lr=2.18e-05 | pos=64
+  Step  69800 | epoch 2/2 | loss=3.6186 | avg=3.9157 | acc=29.1% | lr=2.17e-05 | pos=64
+  Step  69900 | epoch 2/2 | loss=5.5480 | avg=3.9107 | acc=17.2% | lr=2.16e-05 | pos=64
+  Step  70000 | epoch 2/2 | loss=3.7976 | avg=3.9249 | acc=26.6% | lr=2.15e-05 | pos=64
+  [Checkpoint] Saved step 70000 (loss=3.7976) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 70000] loss=3.7976
+  [Checkpoint] Saved step 70000 (loss=3.7976) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step70000.pt
+  [Prune @ step 70000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  70100 | epoch 2/2 | loss=3.8177 | avg=3.9559 | acc=35.0% | lr=2.13e-05 | pos=64
+  Step  70200 | epoch 2/2 | loss=3.2713 | avg=3.9842 | acc=33.1% | lr=2.12e-05 | pos=64
+  Step  70300 | epoch 2/2 | loss=4.4606 | avg=4.0190 | acc=20.9% | lr=2.11e-05 | pos=64
+  Step  70400 | epoch 2/2 | loss=3.0428 | avg=3.9660 | acc=36.9% | lr=2.10e-05 | pos=64
+  Step  70500 | epoch 2/2 | loss=3.2894 | avg=3.9720 | acc=36.6% | lr=2.08e-05 | pos=64
+  Step  70600 | epoch 2/2 | loss=3.7879 | avg=3.9610 | acc=24.4% | lr=2.07e-05 | pos=64
+  Step  70700 | epoch 2/2 | loss=4.1509 | avg=3.9827 | acc=18.8% | lr=2.06e-05 | pos=64
+  Step  70800 | epoch 2/2 | loss=4.3644 | avg=3.9772 | acc=21.2% | lr=2.05e-05 | pos=64
+  Step  70900 | epoch 2/2 | loss=5.7064 | avg=3.9717 | acc=14.7% | lr=2.03e-05 | pos=64
+  Step  71000 | epoch 2/2 | loss=4.7662 | avg=3.9739 | acc=13.8% | lr=2.02e-05 | pos=64
+  Step  71100 | epoch 2/2 | loss=5.3511 | avg=3.8746 | acc=9.1% | lr=2.01e-05 | pos=64
+  Step  71200 | epoch 2/2 | loss=3.9317 | avg=3.9620 | acc=21.2% | lr=2.00e-05 | pos=64
+  Step  71300 | epoch 2/2 | loss=4.2127 | avg=3.9650 | acc=24.1% | lr=1.99e-05 | pos=64
+  Step  71400 | epoch 2/2 | loss=2.6757 | avg=3.9601 | acc=37.5% | lr=1.97e-05 | pos=64
+  Step  71500 | epoch 2/2 | loss=2.9107 | avg=3.9780 | acc=36.6% | lr=1.96e-05 | pos=64
+  Step  71600 | epoch 2/2 | loss=4.7727 | avg=3.9611 | acc=19.1% | lr=1.95e-05 | pos=64
+  Step  71700 | epoch 2/2 | loss=5.1722 | avg=3.9744 | acc=23.1% | lr=1.94e-05 | pos=64
+  Step  71800 | epoch 2/2 | loss=3.2547 | avg=3.9943 | acc=36.2% | lr=1.93e-05 | pos=64
+  Step  71900 | epoch 2/2 | loss=7.5842 | avg=3.9922 | acc=9.1% | lr=1.92e-05 | pos=64
+  Step  72000 | epoch 2/2 | loss=3.9415 | avg=3.9826 | acc=21.9% | lr=1.90e-05 | pos=64
+  Step  72100 | epoch 2/2 | loss=5.2223 | avg=4.1068 | acc=13.8% | lr=1.89e-05 | pos=64
+  Step  72200 | epoch 2/2 | loss=4.7449 | avg=3.9670 | acc=18.4% | lr=1.88e-05 | pos=64
+  Step  72300 | epoch 2/2 | loss=3.4318 | avg=3.9629 | acc=34.7% | lr=1.87e-05 | pos=64
+  Step  72400 | epoch 2/2 | loss=4.4708 | avg=4.0072 | acc=18.1% | lr=1.86e-05 | pos=64
+  Step  72500 | epoch 2/2 | loss=3.6306 | avg=3.9713 | acc=22.5% | lr=1.85e-05 | pos=64
+  Step  72600 | epoch 2/2 | loss=4.0440 | avg=3.9635 | acc=25.9% | lr=1.84e-05 | pos=64
+  Step  72700 | epoch 2/2 | loss=5.3495 | avg=3.9759 | acc=16.6% | lr=1.83e-05 | pos=64
+  Step  72800 | epoch 2/2 | loss=4.3048 | avg=3.9836 | acc=21.2% | lr=1.81e-05 | pos=64
+  Step  72900 | epoch 2/2 | loss=4.7384 | avg=3.9723 | acc=18.8% | lr=1.80e-05 | pos=64
+  Step  73000 | epoch 2/2 | loss=3.5557 | avg=3.9727 | acc=28.1% | lr=1.79e-05 | pos=64
+  Step  73100 | epoch 2/2 | loss=4.0367 | avg=4.0304 | acc=22.8% | lr=1.78e-05 | pos=64
+  Step  73200 | epoch 2/2 | loss=5.0796 | avg=3.9429 | acc=14.7% | lr=1.77e-05 | pos=64
+  Step  73300 | epoch 2/2 | loss=4.5306 | avg=3.9430 | acc=17.8% | lr=1.76e-05 | pos=64
+  Step  73400 | epoch 2/2 | loss=5.6048 | avg=3.9936 | acc=11.9% | lr=1.75e-05 | pos=64
+  Step  73500 | epoch 2/2 | loss=4.9194 | avg=3.9850 | acc=16.2% | lr=1.74e-05 | pos=64
+  Step  73600 | epoch 2/2 | loss=2.6192 | avg=4.0012 | acc=36.9% | lr=1.73e-05 | pos=64
+  Step  73700 | epoch 2/2 | loss=3.2044 | avg=3.9696 | acc=27.8% | lr=1.72e-05 | pos=64
+  Step  73800 | epoch 2/2 | loss=3.2734 | avg=3.9767 | acc=30.6% | lr=1.71e-05 | pos=64
+  Step  73900 | epoch 2/2 | loss=2.4732 | avg=3.9814 | acc=40.3% | lr=1.70e-05 | pos=64
+  Step  74000 | epoch 2/2 | loss=2.4827 | avg=3.9739 | acc=42.5% | lr=1.69e-05 | pos=64
+  Step  74100 | epoch 2/2 | loss=3.3299 | avg=3.8277 | acc=30.0% | lr=1.68e-05 | pos=64
+  Step  74200 | epoch 2/2 | loss=4.2136 | avg=3.8443 | acc=19.7% | lr=1.67e-05 | pos=64
+  Step  74300 | epoch 2/2 | loss=3.9210 | avg=3.9573 | acc=20.0% | lr=1.66e-05 | pos=64
+  Step  74400 | epoch 2/2 | loss=5.3194 | avg=3.9435 | acc=12.8% | lr=1.65e-05 | pos=64
+  Step  74500 | epoch 2/2 | loss=5.1844 | avg=3.9397 | acc=13.4% | lr=1.64e-05 | pos=64
+  Step  74600 | epoch 2/2 | loss=5.9062 | avg=3.9262 | acc=16.6% | lr=1.63e-05 | pos=64
+  Step  74700 | epoch 2/2 | loss=4.4060 | avg=3.9274 | acc=20.0% | lr=1.62e-05 | pos=64
+  Step  74800 | epoch 2/2 | loss=4.1646 | avg=3.9522 | acc=28.1% | lr=1.61e-05 | pos=64
+  Step  74900 | epoch 2/2 | loss=4.6975 | avg=3.9439 | acc=17.5% | lr=1.60e-05 | pos=64
+  Step  75000 | epoch 2/2 | loss=3.8099 | avg=3.9236 | acc=29.4% | lr=1.59e-05 | pos=64
+  [Checkpoint] Saved step 75000 (loss=3.8099) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 75000] loss=3.8099
+  [Checkpoint] Saved step 75000 (loss=3.8099) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step75000.pt
+  [Prune @ step 75000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  75100 | epoch 2/2 | loss=3.1919 | avg=3.8885 | acc=35.0% | lr=1.58e-05 | pos=64
+  Step  75200 | epoch 2/2 | loss=3.0743 | avg=3.9323 | acc=30.0% | lr=1.57e-05 | pos=64
+  Step  75300 | epoch 2/2 | loss=3.8457 | avg=3.9156 | acc=28.7% | lr=1.56e-05 | pos=64
+  Step  75400 | epoch 2/2 | loss=2.7368 | avg=3.8854 | acc=34.4% | lr=1.55e-05 | pos=64
+  Step  75500 | epoch 2/2 | loss=3.8707 | avg=3.8896 | acc=21.6% | lr=1.54e-05 | pos=64
+  Step  75600 | epoch 2/2 | loss=4.2949 | avg=3.9212 | acc=22.2% | lr=1.54e-05 | pos=64
+  Step  75700 | epoch 2/2 | loss=3.7626 | avg=3.9296 | acc=27.2% | lr=1.53e-05 | pos=64
+  Step  75800 | epoch 2/2 | loss=3.9824 | avg=3.9417 | acc=27.5% | lr=1.52e-05 | pos=64
+  Step  75900 | epoch 2/2 | loss=4.4798 | avg=3.9416 | acc=18.4% | lr=1.51e-05 | pos=64
+  Step  76000 | epoch 2/2 | loss=4.6770 | avg=3.9619 | acc=17.2% | lr=1.50e-05 | pos=64
+  Step  76100 | epoch 2/2 | loss=3.6232 | avg=4.0577 | acc=31.9% | lr=1.49e-05 | pos=64
+  Step  76200 | epoch 2/2 | loss=5.7295 | avg=3.9855 | acc=11.9% | lr=1.48e-05 | pos=64
+  Step  76300 | epoch 2/2 | loss=4.0309 | avg=3.9120 | acc=18.8% | lr=1.47e-05 | pos=64
+  Step  76400 | epoch 2/2 | loss=5.7989 | avg=3.8953 | acc=17.5% | lr=1.47e-05 | pos=64
+  Step  76500 | epoch 2/2 | loss=3.2187 | avg=3.8625 | acc=37.8% | lr=1.46e-05 | pos=64
+  Step  76600 | epoch 2/2 | loss=3.6841 | avg=3.8674 | acc=26.2% | lr=1.45e-05 | pos=64
+  Step  76700 | epoch 2/2 | loss=2.7946 | avg=3.8753 | acc=32.2% | lr=1.44e-05 | pos=64
+  Step  76800 | epoch 2/2 | loss=2.9086 | avg=3.8620 | acc=34.4% | lr=1.43e-05 | pos=64
+  Step  76900 | epoch 2/2 | loss=3.3638 | avg=3.8633 | acc=26.9% | lr=1.42e-05 | pos=64
+  Step  77000 | epoch 2/2 | loss=2.6785 | avg=3.8562 | acc=33.4% | lr=1.42e-05 | pos=64
+  Step  77100 | epoch 2/2 | loss=2.9324 | avg=3.9094 | acc=37.5% | lr=1.41e-05 | pos=64
+  Step  77200 | epoch 2/2 | loss=4.9674 | avg=3.9219 | acc=20.6% | lr=1.40e-05 | pos=64
+  Step  77300 | epoch 2/2 | loss=4.9176 | avg=3.9506 | acc=15.6% | lr=1.39e-05 | pos=64
+  Step  77400 | epoch 2/2 | loss=3.9582 | avg=3.9039 | acc=29.4% | lr=1.38e-05 | pos=64
+  Step  77500 | epoch 2/2 | loss=5.7702 | avg=3.9179 | acc=12.8% | lr=1.38e-05 | pos=64
+  Step  77600 | epoch 2/2 | loss=5.2998 | avg=3.9348 | acc=17.8% | lr=1.37e-05 | pos=64
+  Step  77700 | epoch 2/2 | loss=3.3116 | avg=3.9036 | acc=31.9% | lr=1.36e-05 | pos=64
+  Step  77800 | epoch 2/2 | loss=2.3984 | avg=3.9114 | acc=43.1% | lr=1.35e-05 | pos=64
+  Step  77900 | epoch 2/2 | loss=5.5713 | avg=3.9239 | acc=14.1% | lr=1.35e-05 | pos=64
+  Step  78000 | epoch 2/2 | loss=3.8396 | avg=3.9350 | acc=21.2% | lr=1.34e-05 | pos=64
+  Step  78100 | epoch 2/2 | loss=4.2882 | avg=3.9720 | acc=19.4% | lr=1.33e-05 | pos=64
+  Step  78200 | epoch 2/2 | loss=4.6613 | avg=3.9485 | acc=18.1% | lr=1.33e-05 | pos=64
+  Step  78300 | epoch 2/2 | loss=5.1955 | avg=3.9910 | acc=18.4% | lr=1.32e-05 | pos=64
+  Step  78400 | epoch 2/2 | loss=4.9748 | avg=3.9656 | acc=25.0% | lr=1.31e-05 | pos=64
+  Step  78500 | epoch 2/2 | loss=2.4309 | avg=3.9518 | acc=42.5% | lr=1.30e-05 | pos=64
+  Step  78600 | epoch 2/2 | loss=4.3748 | avg=3.9567 | acc=16.9% | lr=1.30e-05 | pos=64
+  Step  78700 | epoch 2/2 | loss=4.7788 | avg=3.9485 | acc=19.7% | lr=1.29e-05 | pos=64
+  Step  78800 | epoch 2/2 | loss=3.7079 | avg=3.9602 | acc=24.4% | lr=1.28e-05 | pos=64
+  Step  78900 | epoch 2/2 | loss=3.0928 | avg=3.9682 | acc=36.2% | lr=1.28e-05 | pos=64
+  Step  79000 | epoch 2/2 | loss=4.6580 | avg=3.9696 | acc=24.4% | lr=1.27e-05 | pos=64
+  Step  79100 | epoch 2/2 | loss=4.3921 | avg=3.8872 | acc=17.5% | lr=1.26e-05 | pos=64
+  Step  79200 | epoch 2/2 | loss=3.5990 | avg=3.8944 | acc=32.8% | lr=1.26e-05 | pos=64
+  Step  79300 | epoch 2/2 | loss=4.0135 | avg=3.9332 | acc=27.2% | lr=1.25e-05 | pos=64
+  Step  79400 | epoch 2/2 | loss=2.0610 | avg=3.9130 | acc=47.2% | lr=1.25e-05 | pos=64
+  Step  79500 | epoch 2/2 | loss=3.4324 | avg=3.9120 | acc=31.6% | lr=1.24e-05 | pos=64
+  Step  79600 | epoch 2/2 | loss=3.4936 | avg=3.9407 | acc=28.7% | lr=1.23e-05 | pos=64
+  Step  79700 | epoch 2/2 | loss=5.2379 | avg=3.9439 | acc=14.1% | lr=1.23e-05 | pos=64
+  Step  79800 | epoch 2/2 | loss=3.8758 | avg=3.9491 | acc=22.5% | lr=1.22e-05 | pos=64
+  Step  79900 | epoch 2/2 | loss=4.3745 | avg=3.9628 | acc=20.0% | lr=1.22e-05 | pos=64
+  Step  80000 | epoch 2/2 | loss=4.3228 | avg=3.9622 | acc=19.1% | lr=1.21e-05 | pos=64
+  [Checkpoint] Saved step 80000 (loss=4.3228) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 80000] loss=4.3228
+  [Checkpoint] Saved step 80000 (loss=4.3228) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step80000.pt
+  [Prune @ step 80000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  80100 | epoch 2/2 | loss=4.2464 | avg=3.8153 | acc=25.6% | lr=1.20e-05 | pos=64
+  Step  80200 | epoch 2/2 | loss=4.7706 | avg=3.8690 | acc=17.2% | lr=1.20e-05 | pos=64
+  Step  80300 | epoch 2/2 | loss=4.3177 | avg=3.8873 | acc=18.1% | lr=1.19e-05 | pos=64
+  Step  80400 | epoch 2/2 | loss=3.3786 | avg=3.9199 | acc=31.2% | lr=1.19e-05 | pos=64
+  Step  80500 | epoch 2/2 | loss=5.1081 | avg=3.9491 | acc=14.7% | lr=1.18e-05 | pos=64
+  Step  80600 | epoch 2/2 | loss=3.8872 | avg=3.9412 | acc=26.9% | lr=1.18e-05 | pos=64
+  Step  80700 | epoch 2/2 | loss=4.5615 | avg=3.9324 | acc=23.4% | lr=1.17e-05 | pos=64
+  Step  80800 | epoch 2/2 | loss=2.6409 | avg=3.9387 | acc=38.4% | lr=1.17e-05 | pos=64
+  Step  80900 | epoch 2/2 | loss=4.2988 | avg=3.9430 | acc=24.1% | lr=1.16e-05 | pos=64
+  Step  81000 | epoch 2/2 | loss=3.1301 | avg=3.9302 | acc=31.9% | lr=1.16e-05 | pos=64
+  Step  81100 | epoch 2/2 | loss=2.5651 | avg=4.2526 | acc=37.5% | lr=1.15e-05 | pos=64
+  Step  81200 | epoch 2/2 | loss=3.5898 | avg=4.0857 | acc=28.1% | lr=1.15e-05 | pos=64
+  Step  81300 | epoch 2/2 | loss=3.5881 | avg=4.0118 | acc=25.3% | lr=1.14e-05 | pos=64
+  Step  81400 | epoch 2/2 | loss=4.5377 | avg=4.0113 | acc=14.1% | lr=1.14e-05 | pos=64
+  Step  81500 | epoch 2/2 | loss=4.6724 | avg=4.0048 | acc=20.3% | lr=1.13e-05 | pos=64
+  Step  81600 | epoch 2/2 | loss=4.7214 | avg=4.0031 | acc=15.6% | lr=1.13e-05 | pos=64
+  Step  81700 | epoch 2/2 | loss=4.3170 | avg=3.9981 | acc=20.9% | lr=1.12e-05 | pos=64
+  Step  81800 | epoch 2/2 | loss=4.8377 | avg=3.9901 | acc=20.6% | lr=1.12e-05 | pos=64
+  Step  81900 | epoch 2/2 | loss=3.1756 | avg=3.9930 | acc=32.8% | lr=1.11e-05 | pos=64
+  Step  82000 | epoch 2/2 | loss=4.3843 | avg=3.9915 | acc=26.2% | lr=1.11e-05 | pos=64
+  Step  82100 | epoch 2/2 | loss=3.1136 | avg=3.8791 | acc=36.9% | lr=1.11e-05 | pos=64
+  Step  82200 | epoch 2/2 | loss=4.1817 | avg=3.8888 | acc=20.3% | lr=1.10e-05 | pos=64
+  Step  82300 | epoch 2/2 | loss=4.8156 | avg=3.8795 | acc=16.6% | lr=1.10e-05 | pos=64
+  Step  82400 | epoch 2/2 | loss=3.4888 | avg=3.8815 | acc=27.5% | lr=1.09e-05 | pos=64
+  Step  82500 | epoch 2/2 | loss=3.1086 | avg=3.9286 | acc=35.3% | lr=1.09e-05 | pos=64
+  Step  82600 | epoch 2/2 | loss=4.4417 | avg=3.9238 | acc=16.2% | lr=1.09e-05 | pos=64
+  Step  82700 | epoch 2/2 | loss=2.7371 | avg=3.9209 | acc=37.8% | lr=1.08e-05 | pos=64
+  Step  82800 | epoch 2/2 | loss=2.6301 | avg=3.9065 | acc=36.9% | lr=1.08e-05 | pos=64
+  Step  82900 | epoch 2/2 | loss=2.8479 | avg=3.9139 | acc=33.8% | lr=1.08e-05 | pos=64
+  Step  83000 | epoch 2/2 | loss=4.6181 | avg=3.9240 | acc=19.7% | lr=1.07e-05 | pos=64
+  Step  83100 | epoch 2/2 | loss=4.5272 | avg=3.8913 | acc=16.9% | lr=1.07e-05 | pos=64
+  Step  83200 | epoch 2/2 | loss=5.2323 | avg=3.9357 | acc=18.1% | lr=1.07e-05 | pos=64
+  Step  83300 | epoch 2/2 | loss=3.5506 | avg=3.9637 | acc=31.6% | lr=1.06e-05 | pos=64
+  Step  83400 | epoch 2/2 | loss=4.3895 | avg=3.9204 | acc=20.6% | lr=1.06e-05 | pos=64
+  Step  83500 | epoch 2/2 | loss=5.3779 | avg=3.9227 | acc=10.6% | lr=1.06e-05 | pos=64
+  Step  83600 | epoch 2/2 | loss=3.8845 | avg=3.9226 | acc=22.5% | lr=1.05e-05 | pos=64
+  Step  83700 | epoch 2/2 | loss=3.4041 | avg=3.9086 | acc=33.8% | lr=1.05e-05 | pos=64
+  Step  83800 | epoch 2/2 | loss=5.1687 | avg=3.9190 | acc=11.9% | lr=1.05e-05 | pos=64
+  Step  83900 | epoch 2/2 | loss=3.3404 | avg=3.9041 | acc=31.9% | lr=1.04e-05 | pos=64
+  Step  84000 | epoch 2/2 | loss=3.6246 | avg=3.9078 | acc=27.8% | lr=1.04e-05 | pos=64
+  Step  84100 | epoch 2/2 | loss=3.4813 | avg=3.9357 | acc=29.7% | lr=1.04e-05 | pos=64
+  Step  84200 | epoch 2/2 | loss=4.8564 | avg=3.9044 | acc=13.8% | lr=1.04e-05 | pos=64
+  Step  84300 | epoch 2/2 | loss=2.3271 | avg=3.9418 | acc=46.9% | lr=1.03e-05 | pos=64
+  Step  84400 | epoch 2/2 | loss=2.5052 | avg=3.9461 | acc=43.4% | lr=1.03e-05 | pos=64
+  Step  84500 | epoch 2/2 | loss=2.7013 | avg=3.9402 | acc=41.9% | lr=1.03e-05 | pos=64
+  Step  84600 | epoch 2/2 | loss=4.3838 | avg=3.9605 | acc=23.8% | lr=1.03e-05 | pos=64
+  Step  84700 | epoch 2/2 | loss=3.0412 | avg=3.9686 | acc=32.5% | lr=1.03e-05 | pos=64
+  Step  84800 | epoch 2/2 | loss=3.4061 | avg=3.9932 | acc=29.4% | lr=1.02e-05 | pos=64
+  Step  84900 | epoch 2/2 | loss=3.2795 | avg=3.9865 | acc=27.5% | lr=1.02e-05 | pos=64
+  Step  85000 | epoch 2/2 | loss=4.0844 | avg=3.9457 | acc=21.9% | lr=1.02e-05 | pos=64
+  [Checkpoint] Saved step 85000 (loss=4.0844) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt
+  [Save @ step 85000] loss=4.0844
+  [Checkpoint] Saved step 85000 (loss=4.0844) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_step85000.pt
+  [Prune @ step 85000] zeroed 0.0M / 1407.4M (0.0% sparsity)
+  Step  85100 | epoch 2/2 | loss=4.7806 | avg=3.9157 | acc=15.3% | lr=1.02e-05 | pos=64
+  Step  85200 | epoch 2/2 | loss=4.5346 | avg=3.9292 | acc=18.8% | lr=1.02e-05 | pos=64
+  Step  85300 | epoch 2/2 | loss=5.6210 | avg=3.9368 | acc=13.4% | lr=1.02e-05 | pos=64
+  Step  85400 | epoch 2/2 | loss=4.9908 | avg=3.9581 | acc=14.1% | lr=1.01e-05 | pos=64
+  Step  85500 | epoch 2/2 | loss=3.1967 | avg=3.9444 | acc=29.1% | lr=1.01e-05 | pos=64
+  Step  85600 | epoch 2/2 | loss=3.4585 | avg=3.9472 | acc=31.9% | lr=1.01e-05 | pos=64
+  Step  85700 | epoch 2/2 | loss=4.3401 | avg=3.9435 | acc=16.6% | lr=1.01e-05 | pos=64
+  Step  85800 | epoch 2/2 | loss=2.7186 | avg=3.9475 | acc=38.4% | lr=1.01e-05 | pos=64
+  Step  85900 | epoch 2/2 | loss=3.3888 | avg=3.9339 | acc=22.5% | lr=1.01e-05 | pos=64
+  Step  86000 | epoch 2/2 | loss=2.7529 | avg=3.9336 | acc=36.2% | lr=1.01e-05 | pos=64
+  Step  86100 | epoch 2/2 | loss=4.3572 | avg=4.0375 | acc=13.4% | lr=1.01e-05 | pos=64
+  Step  86200 | epoch 2/2 | loss=3.2749 | avg=3.9801 | acc=32.2% | lr=1.00e-05 | pos=64
+  Step  86300 | epoch 2/2 | loss=4.6296 | avg=3.9571 | acc=18.8% | lr=1.00e-05 | pos=64
+  Step  86400 | epoch 2/2 | loss=3.8504 | avg=3.9544 | acc=25.9% | lr=1.00e-05 | pos=64
+  Step  86500 | epoch 2/2 | loss=4.1305 | avg=3.9431 | acc=20.6% | lr=1.00e-05 | pos=64
+  Step  86600 | epoch 2/2 | loss=4.0122 | avg=3.9176 | acc=17.8% | lr=1.00e-05 | pos=64
+  Step  86700 | epoch 2/2 | loss=2.8261 | avg=3.9193 | acc=38.1% | lr=1.00e-05 | pos=64
+  Step  86800 | epoch 2/2 | loss=2.0856 | avg=3.9150 | acc=44.4% | lr=1.00e-05 | pos=64
+  Step  86900 | epoch 2/2 | loss=4.3141 | avg=3.9056 | acc=20.6% | lr=1.00e-05 | pos=64
+  Step  87000 | epoch 2/2 | loss=3.7628 | avg=3.9158 | acc=26.2% | lr=1.00e-05 | pos=64
+  Step  87100 | epoch 2/2 | loss=4.9560 | avg=3.8938 | acc=15.9% | lr=1.00e-05 | pos=64
+  Step  87200 | epoch 2/2 | loss=2.9699 | avg=3.9049 | acc=35.0% | lr=1.00e-05 | pos=64
+  --- Epoch 2/2 complete (step 87244) ---
+  [Checkpoint] Saved step 87244 (loss=3.7095) → /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_final.pt
+============================================================
+TRAINING COMPLETE (--no_eval, run benchmark separately)
+============================================================
+Training complete. Best: /run/media/echo/Echo/ECHO/training/Prototype Fireecho/tool/kernel/FireEcho Engine/eagle_checkpoints/eagle_best.pt