Optimization Log
Goal
Achieve < 1000 cycles on the VLIW SIMD Kernel. Starting Baseline: ~147,734 cycles (Scalar). Reference Best: < 1363 cycles (Claude Opus 4.5 Improved).
Optimization Methods (Comprehensive List)
- Vectorization (SIMD): Utilizing
valu,vload,vstoreto process 8 items per instruction. - Instruction Level Parallelism (ILP): Filling all VLIW slots (
alux12,valux6,loadx2) per cycle. - Strength Reduction / Algebraic Simplification: Replacing expensive ops sequences (e.g.,
add+shift+add) with cheaper ones (e.g.,multiply_add). - Common Subexpression Elimination (CSE): Loading shared data (e.g., tree nodes) once per batch instead of per item.
- Loop Unrolling: Reducing loop overhead and exposing more ILP.
- Software Pipelining: Interleaving stages of different items to hide latency and fill slots.
- Register Caching: Keeping frequently used data (indices, values, top interaction tree nodes) in scratchpad to avoid memory access.
- Data Layout Optimization: (Limited capability) Sorting/Grouping data to maximize locality or cache hits (deduplication).
- Dead Code Elimination: Removing debug or unused instructions.
- Constant Folding: Pre-calculating constants.
- Active Set Processing: Tailoring the loop to handle only active/unique items (e.g., specific tree nodes) to minimize work.
- Bit Twiddling: Optimizing boolean logic and flag updates.
Applied Strategy Combinations
Attempt 1: The "Vectorized Algebraic" Approach
Combination: Vectorization + Strength Reduction + Register Caching.
- Vectorization: Process batch of 256 as 32 vectors of 8.
- Strength Reduction: Simplify Hash Stages 0, 2, 4 using
multiply_add(collapsing 3 ops to 1). simplifiy other stages. - Register Caching: Keep all
indicesandvaluesin scratchpad. Do NOT load/store them every round. Only final store. - Expected Result: Significant speedup.
- Bottleneck: Memory Bandwidth for
node_val(random access).
Attempt 2: The "Active Node" Deduplication
Combination: Active Set Processing + ILP.
- Concept: In early rounds (0-7), the number of unique nodes accessed (< 256) is smaller than the batch size (256).
- Method:
- Round 0: Load Node 0 (scalar). Broadcast. Compute all.
- Round 1: Load Node 1, 2. Compute items with idx 1, items with idx 2.
- ...
- Round K: "Gather" items by index (conceptually) or iterate over active nodes.
- Win: Reduces
node_valloads from 256/round toUniques/round.
Attempt 3: Full Pipelined Saturation
Combination: Loop Unrolling + Software Pipelining + All Previous.
- Concept: Completely fill
valuandaluslots by processing multiple rounds or multiple vectors simultaneously.
Execution Log
- (Upcoming) Implementation of Attempt 1.