anthropic-kernel / atempt_2 /rem /optimization_log_2.md
algorembrant's picture
Upload 39 files
f3ce0b0 verified
# Optimization Log
## Goal
Achieve < 1000 cycles on the VLIW SIMD Kernel.
Starting Baseline: 4,781 cycles.
Final Result: **1,859 cycles** (~2.5x speedup).
## Optimization Methods Attempted
### 1. Custom Instruction Scheduler
**Implemented**: Yes.
**Impact**: High.
**Detail**: Implemented a list scheduler (`scheduler.py`) aware of VLIW slot limits. This allowed packing vector operations (`valu`) efficiently.
### 2. Active Load Deduplication
**Implemented**: Yes (Rounds 0-3).
**Impact**: Moderate.
**Detail**: For early rounds, unique nodes are few. We used scalar loads + broadcast.
- Round 0 (1 node): Huge win (1 load vs 32).
- Round 1 (2 nodes): Big win.
- Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (`vselect` tree) grows exponentially.
**Tuning**: Optimal `active_threshold` found to be **4** (optimizes R0-R3).
### 3. Mask Skipping
**Implemented**: Yes.
**Impact**: Moderate (Saved ~4 ops/vec/round in R0-R7).
**Detail**: The `idx` wrapping logic is unnecessary when max `idx < n_nodes`. We skip it dynamically based on round number.
### 4. Scalar Offloading
**Implemented**: Yes.
**Impact**: Minor/Positive.
**Detail**: Since `VALU` (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the `ALU` (Scalar ALU).
- **Challenge**: `ALU` is less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence).
- **Result**: Offloading ~2 vectors to `ALU` provided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due to `ALU` becoming the new bottleneck and overhead of `flow` selects for wrapping.
### 5. Ray Tuning
**Attempted**: Yes.
**Blocking Issue**: The provided `ray` library was a source checkout without compiled binaries (`_raylet`), causing `ModuleNotFoundError`.
**Workaround**: Implemented `manual_tuner.py` to perform a grid search over `active_threshold`, `mask_skip`, and `scalar_offload`.
## Failed/Discarded Ideas
- **Scalar Wrapping on Flow**: Tried to use `flow` select for scalar wrapping. Failed due to limited `flow` slots (2 vs 6 VALU), causing massive stalls.
- **Aggressive Active Set**: Tried extending Active Set to Round 4+. Failed due to `vselect` tree overhead (15+ ops) exceeding the cost of vector loads.
- **Flow Arithmetic**: Investigated using `add_imm` on `flow` unit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized.
## Final Configuration
- **Active Threshold**: 4 (Rounds 0-3 optimized).
- **Mask Skip**: Enabled.
- **Scalar Offload**: 2 vectors.
- **Cycle Count**: 1,859.