# Optimization Log ## Goal Achieve < 1000 cycles on the VLIW SIMD Kernel. Starting Baseline: 4,781 cycles. Final Result: **1,859 cycles** (~2.5x speedup). ## Optimization Methods Attempted ### 1. Custom Instruction Scheduler **Implemented**: Yes. **Impact**: High. **Detail**: Implemented a list scheduler (`scheduler.py`) aware of VLIW slot limits. This allowed packing vector operations (`valu`) efficiently. ### 2. Active Load Deduplication **Implemented**: Yes (Rounds 0-3). **Impact**: Moderate. **Detail**: For early rounds, unique nodes are few. We used scalar loads + broadcast. - Round 0 (1 node): Huge win (1 load vs 32). - Round 1 (2 nodes): Big win. - Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (`vselect` tree) grows exponentially. **Tuning**: Optimal `active_threshold` found to be **4** (optimizes R0-R3). ### 3. Mask Skipping **Implemented**: Yes. **Impact**: Moderate (Saved ~4 ops/vec/round in R0-R7). **Detail**: The `idx` wrapping logic is unnecessary when max `idx < n_nodes`. We skip it dynamically based on round number. ### 4. Scalar Offloading **Implemented**: Yes. **Impact**: Minor/Positive. **Detail**: Since `VALU` (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the `ALU` (Scalar ALU). - **Challenge**: `ALU` is less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence). - **Result**: Offloading ~2 vectors to `ALU` provided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due to `ALU` becoming the new bottleneck and overhead of `flow` selects for wrapping. ### 5. Ray Tuning **Attempted**: Yes. **Blocking Issue**: The provided `ray` library was a source checkout without compiled binaries (`_raylet`), causing `ModuleNotFoundError`. **Workaround**: Implemented `manual_tuner.py` to perform a grid search over `active_threshold`, `mask_skip`, and `scalar_offload`. ## Failed/Discarded Ideas - **Scalar Wrapping on Flow**: Tried to use `flow` select for scalar wrapping. Failed due to limited `flow` slots (2 vs 6 VALU), causing massive stalls. - **Aggressive Active Set**: Tried extending Active Set to Round 4+. Failed due to `vselect` tree overhead (15+ ops) exceeding the cost of vector loads. - **Flow Arithmetic**: Investigated using `add_imm` on `flow` unit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized. ## Final Configuration - **Active Threshold**: 4 (Rounds 0-3 optimized). - **Mask Skip**: Enabled. - **Scalar Offload**: 2 vectors. - **Cycle Count**: 1,859.