Optimization Log
Goal
Achieve < 1000 cycles on the VLIW SIMD Kernel. Starting Baseline: 4,781 cycles. Final Result: 1,859 cycles (~2.5x speedup).
Optimization Methods Attempted
1. Custom Instruction Scheduler
Implemented: Yes.
Impact: High.
Detail: Implemented a list scheduler (scheduler.py) aware of VLIW slot limits. This allowed packing vector operations (valu) efficiently.
2. Active Load Deduplication
Implemented: Yes (Rounds 0-3). Impact: Moderate. Detail: For early rounds, unique nodes are few. We used scalar loads + broadcast.
- Round 0 (1 node): Huge win (1 load vs 32).
- Round 1 (2 nodes): Big win.
- Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (
vselecttree) grows exponentially. Tuning: Optimalactive_thresholdfound to be 4 (optimizes R0-R3).
3. Mask Skipping
Implemented: Yes.
Impact: Moderate (Saved ~4 ops/vec/round in R0-R7).
Detail: The idx wrapping logic is unnecessary when max idx < n_nodes. We skip it dynamically based on round number.
4. Scalar Offloading
Implemented: Yes.
Impact: Minor/Positive.
Detail: Since VALU (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the ALU (Scalar ALU).
- Challenge:
ALUis less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence). - Result: Offloading ~2 vectors to
ALUprovided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due toALUbecoming the new bottleneck and overhead offlowselects for wrapping.
5. Ray Tuning
Attempted: Yes.
Blocking Issue: The provided ray library was a source checkout without compiled binaries (_raylet), causing ModuleNotFoundError.
Workaround: Implemented manual_tuner.py to perform a grid search over active_threshold, mask_skip, and scalar_offload.
Failed/Discarded Ideas
- Scalar Wrapping on Flow: Tried to use
flowselect for scalar wrapping. Failed due to limitedflowslots (2 vs 6 VALU), causing massive stalls. - Aggressive Active Set: Tried extending Active Set to Round 4+. Failed due to
vselecttree overhead (15+ ops) exceeding the cost of vector loads. - Flow Arithmetic: Investigated using
add_immonflowunit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized.
Final Configuration
- Active Threshold: 4 (Rounds 0-3 optimized).
- Mask Skip: Enabled.
- Scalar Offload: 2 vectors.
- Cycle Count: 1,859.