| # Optimization Log | |
| ## Goal | |
| Achieve < 1000 cycles on the VLIW SIMD Kernel. | |
| Starting Baseline: ~147,734 cycles (Scalar). | |
| Reference Best: < 1363 cycles (Claude Opus 4.5 Improved). | |
| ## Optimization Methods (Comprehensive List) | |
| 1. **Vectorization (SIMD)**: Utilizing `valu`, `vload`, `vstore` to process 8 items per instruction. | |
| 2. **Instruction Level Parallelism (ILP)**: Filling all VLIW slots (`alu` x12, `valu` x6, `load` x2) per cycle. | |
| 3. **Strength Reduction / Algebraic Simplification**: Replacing expensive ops sequences (e.g., `add` + `shift` + `add`) with cheaper ones (e.g., `multiply_add`). | |
| 4. **Common Subexpression Elimination (CSE)**: Loading shared data (e.g., tree nodes) once per batch instead of per item. | |
| 5. **Loop Unrolling**: Reducing loop overhead and exposing more ILP. | |
| 6. **Software Pipelining**: Interleaving stages of different items to hide latency and fill slots. | |
| 7. **Register Caching**: Keeping frequently used data (indices, values, top interaction tree nodes) in scratchpad to avoid memory access. | |
| 8. **Data Layout Optimization**: (Limited capability) Sorting/Grouping data to maximize locality or cache hits (deduplication). | |
| 9. **Dead Code Elimination**: Removing debug or unused instructions. | |
| 10. **Constant Folding**: Pre-calculating constants. | |
| 11. **Active Set Processing**: Tailoring the loop to handle only active/unique items (e.g., specific tree nodes) to minimize work. | |
| 12. **Bit Twiddling**: Optimizing boolean logic and flag updates. | |
| ## Applied Strategy Combinations | |
| ### Attempt 1: The "Vectorized Algebraic" Approach | |
| **Combination**: Vectorization + Strength Reduction + Register Caching. | |
| - **Vectorization**: Process batch of 256 as 32 vectors of 8. | |
| - **Strength Reduction**: Simplify Hash Stages 0, 2, 4 using `multiply_add` (collapsing 3 ops to 1). simplifiy other stages. | |
| - **Register Caching**: Keep all `indices` and `values` in scratchpad. Do NOT load/store them every round. Only final store. | |
| - **Expected Result**: Significant speedup. | |
| - **Bottleneck**: Memory Bandwidth for `node_val` (random access). | |
| ### Attempt 2: The "Active Node" Deduplication | |
| **Combination**: Active Set Processing + ILP. | |
| - **Concept**: In early rounds (0-7), the number of unique nodes accessed (< 256) is smaller than the batch size (256). | |
| - **Method**: | |
| - Round 0: Load Node 0 (scalar). Broadcast. Compute all. | |
| - Round 1: Load Node 1, 2. Compute items with idx 1, items with idx 2. | |
| - ... | |
| - Round K: "Gather" items by index (conceptually) or iterate over active nodes. | |
| - **Win**: Reduces `node_val` loads from 256/round to `Uniques`/round. | |
| ### Attempt 3: Full Pipelined Saturation | |
| **Combination**: Loop Unrolling + Software Pipelining + All Previous. | |
| - **Concept**: Completely fill `valu` and `alu` slots by processing multiple rounds or multiple vectors simultaneously. | |
| ## Execution Log | |
| - *(Upcoming)* Implementation of Attempt 1. | |