File size: 2,694 Bytes

f3ce0b0

# Optimization Log

## Goal
Achieve < 1000 cycles on the VLIW SIMD Kernel.
Starting Baseline: 4,781 cycles.
Final Result: **1,859 cycles** (~2.5x speedup).

## Optimization Methods Attempted

### 1. Custom Instruction Scheduler
**Implemented**: Yes.
**Impact**: High.
**Detail**: Implemented a list scheduler (`scheduler.py`) aware of VLIW slot limits. This allowed packing vector operations (`valu`) efficiently.

### 2. Active Load Deduplication
**Implemented**: Yes (Rounds 0-3).
**Impact**: Moderate.
**Detail**: For early rounds, unique nodes are few. We used scalar loads + broadcast.
-   Round 0 (1 node): Huge win (1 load vs 32).
-   Round 1 (2 nodes): Big win.
-   Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (`vselect` tree) grows exponentially.
**Tuning**: Optimal `active_threshold` found to be **4** (optimizes R0-R3).

### 3. Mask Skipping
**Implemented**: Yes.
**Impact**: Moderate (Saved ~4 ops/vec/round in R0-R7).
**Detail**: The `idx` wrapping logic is unnecessary when max `idx < n_nodes`. We skip it dynamically based on round number.

### 4. Scalar Offloading
**Implemented**: Yes.
**Impact**: Minor/Positive.
**Detail**: Since `VALU` (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the `ALU` (Scalar ALU).
-   **Challenge**: `ALU` is less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence).
-   **Result**: Offloading ~2 vectors to `ALU` provided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due to `ALU` becoming the new bottleneck and overhead of `flow` selects for wrapping.

### 5. Ray Tuning
**Attempted**: Yes.
**Blocking Issue**: The provided `ray` library was a source checkout without compiled binaries (`_raylet`), causing `ModuleNotFoundError`.
**Workaround**: Implemented `manual_tuner.py` to perform a grid search over `active_threshold`, `mask_skip`, and `scalar_offload`.

## Failed/Discarded Ideas
-   **Scalar Wrapping on Flow**: Tried to use `flow` select for scalar wrapping. Failed due to limited `flow` slots (2 vs 6 VALU), causing massive stalls.
-   **Aggressive Active Set**: Tried extending Active Set to Round 4+. Failed due to `vselect` tree overhead (15+ ops) exceeding the cost of vector loads.
-   **Flow Arithmetic**: Investigated using `add_imm` on `flow` unit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized.

## Final Configuration
-   **Active Threshold**: 4 (Rounds 0-3 optimized).
-   **Mask Skip**: Enabled.
-   **Scalar Offload**: 2 vectors.
-   **Cycle Count**: 1,859.