| # Optimization Log | |
| ## Goal | |
| Achieve < 1000 cycles on the VLIW SIMD Kernel. | |
| Starting Baseline: 4,781 cycles. | |
| Final Result: **1,859 cycles** (~2.5x speedup). | |
| ## Optimization Methods Attempted | |
| ### 1. Custom Instruction Scheduler | |
| **Implemented**: Yes. | |
| **Impact**: High. | |
| **Detail**: Implemented a list scheduler (`scheduler.py`) aware of VLIW slot limits. This allowed packing vector operations (`valu`) efficiently. | |
| ### 2. Active Load Deduplication | |
| **Implemented**: Yes (Rounds 0-3). | |
| **Impact**: Moderate. | |
| **Detail**: For early rounds, unique nodes are few. We used scalar loads + broadcast. | |
| - Round 0 (1 node): Huge win (1 load vs 32). | |
| - Round 1 (2 nodes): Big win. | |
| - Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (`vselect` tree) grows exponentially. | |
| **Tuning**: Optimal `active_threshold` found to be **4** (optimizes R0-R3). | |
| ### 3. Mask Skipping | |
| **Implemented**: Yes. | |
| **Impact**: Moderate (Saved ~4 ops/vec/round in R0-R7). | |
| **Detail**: The `idx` wrapping logic is unnecessary when max `idx < n_nodes`. We skip it dynamically based on round number. | |
| ### 4. Scalar Offloading | |
| **Implemented**: Yes. | |
| **Impact**: Minor/Positive. | |
| **Detail**: Since `VALU` (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the `ALU` (Scalar ALU). | |
| - **Challenge**: `ALU` is less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence). | |
| - **Result**: Offloading ~2 vectors to `ALU` provided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due to `ALU` becoming the new bottleneck and overhead of `flow` selects for wrapping. | |
| ### 5. Ray Tuning | |
| **Attempted**: Yes. | |
| **Blocking Issue**: The provided `ray` library was a source checkout without compiled binaries (`_raylet`), causing `ModuleNotFoundError`. | |
| **Workaround**: Implemented `manual_tuner.py` to perform a grid search over `active_threshold`, `mask_skip`, and `scalar_offload`. | |
| ## Failed/Discarded Ideas | |
| - **Scalar Wrapping on Flow**: Tried to use `flow` select for scalar wrapping. Failed due to limited `flow` slots (2 vs 6 VALU), causing massive stalls. | |
| - **Aggressive Active Set**: Tried extending Active Set to Round 4+. Failed due to `vselect` tree overhead (15+ ops) exceeding the cost of vector loads. | |
| - **Flow Arithmetic**: Investigated using `add_imm` on `flow` unit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized. | |
| ## Final Configuration | |
| - **Active Threshold**: 4 (Rounds 0-3 optimized). | |
| - **Mask Skip**: Enabled. | |
| - **Scalar Offload**: 2 vectors. | |
| - **Cycle Count**: 1,859. | |