| # Walkthrough - Kernel Optimization | |
| I have successfully optimized the kernel, achieving a **30.9x speedup** over the baseline. | |
| ## Results | |
| - **Baseline**: ~147,734 Cycles | |
| - **My Optimized Kernel**: **4,781 Cycles** | |
| - **Correctness**: Verified against reference implementation. | |
| ## Optimization Journey | |
| ### 1. Vectorization & Strength Reduction | |
| I started by converting the scalar loop to a vectorized implementation (`VLEN=8`). I also applied strength reduction to the `MurmurHash3` implementation, replacing complex sequences with efficient `multiply_add` instructions available in the VLIW `valu` engine. | |
| - **Challenge**: Initial naive vectorization suffered from intra-cycle dependency violations (reading a register written in the same cycle). | |
| - **Solution**: Manually pipelined address calculation, load, and compute steps to respect the machine's latency model. | |
| ### 2. Wavefront Parallelism | |
| The naive vectorized loop processed one vector (8 items) at a time, leaving many VLIW slots empty. | |
| - **Strategy**: I refactored the kernel to process **all 32 vectors (256 items) simultaneously**. | |
| - **Implementation**: Instructions are emitted in "Waves" (e.g., "Calculate Addresses for ALL vectors", then "Load ALL vectors"). This allows the `build()` packer to maximally saturate the 6-slot `valu` pipeline. | |
| - **Constraint**: This massive unrolling threatened to exceed the 1536-word scratchpad limit. I implemented **Register Aliasing**, reusing temporary variable memory blocks when their lifetimes didn't overlap (e.g., reusing Load Address buffers for Hash calculation temps). | |
| ### 3. Active Set Optimization (Round 0) | |
| Profiling revealed that Memory Loads (256 scalar loads per round) were the primary bottleneck (~150 cycles overhead/round). | |
| - **Observation**: In Round 0, all item indices start at 0. They all access the same Root Node. | |
| - **Optimization**: Instead of performing 256 loads, I perform **1 Scalar Load** and broadcast the value to all vectors. | |
| - **Impact**: Saved ~500 cycles instantly. | |
| ### Failed Experiments | |
| I attempted to extend Active Set optimization to Rounds 1-3 (where unique nodes are few). Logic complexity involving recursive tree selection introduced subtle data corruption bugs. I reverted this to guarantee 100% correctness. | |
| ## Final Code Structure | |
| The optimized `perf_takehome.py` features: | |
| - **Unrolled Loop**: Explicit per-round logic selection. | |
| - **Round 0 Specialization**: Fast-path for the initial state. | |
| - **Generic Wavefront**: Highly parallel throughput for subsequent rounds. | |
| - **Memory Aliasing**: Smart scratchpad management to fit within hardware limits. | |