# Walkthrough - Kernel Optimization I have successfully optimized the kernel, achieving a **30.9x speedup** over the baseline. ## Results - **Baseline**: ~147,734 Cycles - **My Optimized Kernel**: **4,781 Cycles** - **Correctness**: Verified against reference implementation. ## Optimization Journey ### 1. Vectorization & Strength Reduction I started by converting the scalar loop to a vectorized implementation (`VLEN=8`). I also applied strength reduction to the `MurmurHash3` implementation, replacing complex sequences with efficient `multiply_add` instructions available in the VLIW `valu` engine. - **Challenge**: Initial naive vectorization suffered from intra-cycle dependency violations (reading a register written in the same cycle). - **Solution**: Manually pipelined address calculation, load, and compute steps to respect the machine's latency model. ### 2. Wavefront Parallelism The naive vectorized loop processed one vector (8 items) at a time, leaving many VLIW slots empty. - **Strategy**: I refactored the kernel to process **all 32 vectors (256 items) simultaneously**. - **Implementation**: Instructions are emitted in "Waves" (e.g., "Calculate Addresses for ALL vectors", then "Load ALL vectors"). This allows the `build()` packer to maximally saturate the 6-slot `valu` pipeline. - **Constraint**: This massive unrolling threatened to exceed the 1536-word scratchpad limit. I implemented **Register Aliasing**, reusing temporary variable memory blocks when their lifetimes didn't overlap (e.g., reusing Load Address buffers for Hash calculation temps). ### 3. Active Set Optimization (Round 0) Profiling revealed that Memory Loads (256 scalar loads per round) were the primary bottleneck (~150 cycles overhead/round). - **Observation**: In Round 0, all item indices start at 0. They all access the same Root Node. - **Optimization**: Instead of performing 256 loads, I perform **1 Scalar Load** and broadcast the value to all vectors. - **Impact**: Saved ~500 cycles instantly. ### Failed Experiments I attempted to extend Active Set optimization to Rounds 1-3 (where unique nodes are few). Logic complexity involving recursive tree selection introduced subtle data corruption bugs. I reverted this to guarantee 100% correctness. ## Final Code Structure The optimized `perf_takehome.py` features: - **Unrolled Loop**: Explicit per-round logic selection. - **Round 0 Specialization**: Fast-path for the initial state. - **Generic Wavefront**: Highly parallel throughput for subsequent rounds. - **Memory Aliasing**: Smart scratchpad management to fit within hardware limits.