anthropic-kernel / atempt_2 /rem /walkthrough_1.md

Upload 39 files

f3ce0b0 verified 4 days ago

2.64 kB

	# Walkthrough - Kernel Optimization

	I have successfully optimized the kernel, achieving a 30.9x speedup over the baseline.

	## Results
	- Baseline: ~147,734 Cycles
	- My Optimized Kernel: 4,781 Cycles
	- Correctness: Verified against reference implementation.

	## Optimization Journey

	### 1. Vectorization & Strength Reduction
	I started by converting the scalar loop to a vectorized implementation (`VLEN=8`). I also applied strength reduction to the `MurmurHash3` implementation, replacing complex sequences with efficient `multiply_add` instructions available in the VLIW `valu` engine.
	- Challenge: Initial naive vectorization suffered from intra-cycle dependency violations (reading a register written in the same cycle).
	- Solution: Manually pipelined address calculation, load, and compute steps to respect the machine's latency model.

	### 2. Wavefront Parallelism
	The naive vectorized loop processed one vector (8 items) at a time, leaving many VLIW slots empty.
	- Strategy: I refactored the kernel to process all 32 vectors (256 items) simultaneously.
	- Implementation: Instructions are emitted in "Waves" (e.g., "Calculate Addresses for ALL vectors", then "Load ALL vectors"). This allows the `build()` packer to maximally saturate the 6-slot `valu` pipeline.
	- Constraint: This massive unrolling threatened to exceed the 1536-word scratchpad limit. I implemented Register Aliasing, reusing temporary variable memory blocks when their lifetimes didn't overlap (e.g., reusing Load Address buffers for Hash calculation temps).

	### 3. Active Set Optimization (Round 0)
	Profiling revealed that Memory Loads (256 scalar loads per round) were the primary bottleneck (~150 cycles overhead/round).
	- Observation: In Round 0, all item indices start at 0. They all access the same Root Node.
	- Optimization: Instead of performing 256 loads, I perform 1 Scalar Load and broadcast the value to all vectors.
	- Impact: Saved ~500 cycles instantly.

	### Failed Experiments
	I attempted to extend Active Set optimization to Rounds 1-3 (where unique nodes are few). Logic complexity involving recursive tree selection introduced subtle data corruption bugs. I reverted this to guarantee 100% correctness.

	## Final Code Structure
	The optimized `perf_takehome.py` features:
	- Unrolled Loop: Explicit per-round logic selection.
	- Round 0 Specialization: Fast-path for the initial state.
	- Generic Wavefront: Highly parallel throughput for subsequent rounds.
	- Memory Aliasing: Smart scratchpad management to fit within hardware limits.

	# Walkthrough - Kernel Optimization

	I have successfully optimized the kernel, achieving a 30.9x speedup over the baseline.

	## Results
	- Baseline: ~147,734 Cycles
	- My Optimized Kernel: 4,781 Cycles
	- Correctness: Verified against reference implementation.

	## Optimization Journey

	### 1. Vectorization & Strength Reduction
	I started by converting the scalar loop to a vectorized implementation (`VLEN=8`). I also applied strength reduction to the `MurmurHash3` implementation, replacing complex sequences with efficient `multiply_add` instructions available in the VLIW `valu` engine.
	- Challenge: Initial naive vectorization suffered from intra-cycle dependency violations (reading a register written in the same cycle).
	- Solution: Manually pipelined address calculation, load, and compute steps to respect the machine's latency model.

	### 2. Wavefront Parallelism
	The naive vectorized loop processed one vector (8 items) at a time, leaving many VLIW slots empty.
	- Strategy: I refactored the kernel to process all 32 vectors (256 items) simultaneously.
	- Implementation: Instructions are emitted in "Waves" (e.g., "Calculate Addresses for ALL vectors", then "Load ALL vectors"). This allows the `build()` packer to maximally saturate the 6-slot `valu` pipeline.
	- Constraint: This massive unrolling threatened to exceed the 1536-word scratchpad limit. I implemented Register Aliasing, reusing temporary variable memory blocks when their lifetimes didn't overlap (e.g., reusing Load Address buffers for Hash calculation temps).

	### 3. Active Set Optimization (Round 0)
	Profiling revealed that Memory Loads (256 scalar loads per round) were the primary bottleneck (~150 cycles overhead/round).
	- Observation: In Round 0, all item indices start at 0. They all access the same Root Node.
	- Optimization: Instead of performing 256 loads, I perform 1 Scalar Load and broadcast the value to all vectors.
	- Impact: Saved ~500 cycles instantly.

	### Failed Experiments
	I attempted to extend Active Set optimization to Rounds 1-3 (where unique nodes are few). Logic complexity involving recursive tree selection introduced subtle data corruption bugs. I reverted this to guarantee 100% correctness.

	## Final Code Structure
	The optimized `perf_takehome.py` features:
	- Unrolled Loop: Explicit per-round logic selection.
	- Round 0 Specialization: Fast-path for the initial state.
	- Generic Wavefront: Highly parallel throughput for subsequent rounds.
	- Memory Aliasing: Smart scratchpad management to fit within hardware limits.