Walkthrough - Kernel Optimization

I have successfully optimized the kernel, achieving a 30.9x speedup over the baseline.

Results

Baseline: ~147,734 Cycles
My Optimized Kernel: 4,781 Cycles
Correctness: Verified against reference implementation.

Optimization Journey

1. Vectorization & Strength Reduction

I started by converting the scalar loop to a vectorized implementation (VLEN=8). I also applied strength reduction to the MurmurHash3 implementation, replacing complex sequences with efficient multiply_add instructions available in the VLIW valu engine.

Challenge: Initial naive vectorization suffered from intra-cycle dependency violations (reading a register written in the same cycle).
Solution: Manually pipelined address calculation, load, and compute steps to respect the machine's latency model.

2. Wavefront Parallelism

The naive vectorized loop processed one vector (8 items) at a time, leaving many VLIW slots empty.

Strategy: I refactored the kernel to process all 32 vectors (256 items) simultaneously.
Implementation: Instructions are emitted in "Waves" (e.g., "Calculate Addresses for ALL vectors", then "Load ALL vectors"). This allows the build() packer to maximally saturate the 6-slot valu pipeline.
Constraint: This massive unrolling threatened to exceed the 1536-word scratchpad limit. I implemented Register Aliasing, reusing temporary variable memory blocks when their lifetimes didn't overlap (e.g., reusing Load Address buffers for Hash calculation temps).

3. Active Set Optimization (Round 0)

Profiling revealed that Memory Loads (256 scalar loads per round) were the primary bottleneck (~150 cycles overhead/round).

Observation: In Round 0, all item indices start at 0. They all access the same Root Node.
Optimization: Instead of performing 256 loads, I perform 1 Scalar Load and broadcast the value to all vectors.
Impact: Saved ~500 cycles instantly.

Failed Experiments

I attempted to extend Active Set optimization to Rounds 1-3 (where unique nodes are few). Logic complexity involving recursive tree selection introduced subtle data corruption bugs. I reverted this to guarantee 100% correctness.

Final Code Structure

The optimized perf_takehome.py features:

Unrolled Loop: Explicit per-round logic selection.
Round 0 Specialization: Fast-path for the initial state.
Generic Wavefront: Highly parallel throughput for subsequent rounds.
Memory Aliasing: Smart scratchpad management to fit within hardware limits.