anthropic-kernel / atempt_2 /rem /optimization_log_2.md
algorembrant's picture
Upload 39 files
f3ce0b0 verified

Optimization Log

Goal

Achieve < 1000 cycles on the VLIW SIMD Kernel. Starting Baseline: 4,781 cycles. Final Result: 1,859 cycles (~2.5x speedup).

Optimization Methods Attempted

1. Custom Instruction Scheduler

Implemented: Yes. Impact: High. Detail: Implemented a list scheduler (scheduler.py) aware of VLIW slot limits. This allowed packing vector operations (valu) efficiently.

2. Active Load Deduplication

Implemented: Yes (Rounds 0-3). Impact: Moderate. Detail: For early rounds, unique nodes are few. We used scalar loads + broadcast.

  • Round 0 (1 node): Huge win (1 load vs 32).
  • Round 1 (2 nodes): Big win.
  • Round 3 (8 nodes): Break-even. The overhead of selecting the correct broadcasted value (vselect tree) grows exponentially. Tuning: Optimal active_threshold found to be 4 (optimizes R0-R3).

3. Mask Skipping

Implemented: Yes. Impact: Moderate (Saved ~4 ops/vec/round in R0-R7). Detail: The idx wrapping logic is unnecessary when max idx < n_nodes. We skip it dynamically based on round number.

4. Scalar Offloading

Implemented: Yes. Impact: Minor/Positive. Detail: Since VALU (Vector ALU) was the bottleneck (~90 cycles/round), we tried offloading some vectors to the ALU (Scalar ALU).

  • Challenge: ALU is less efficient per item (requires loop over 8 lanes + inefficient Scalar Hash sequence).
  • Result: Offloading ~2 vectors to ALU provided a small speedup (1862 -> 1859 cycles). Aggressive offloading (6+ vectors) degraded performance due to ALU becoming the new bottleneck and overhead of flow selects for wrapping.

5. Ray Tuning

Attempted: Yes. Blocking Issue: The provided ray library was a source checkout without compiled binaries (_raylet), causing ModuleNotFoundError. Workaround: Implemented manual_tuner.py to perform a grid search over active_threshold, mask_skip, and scalar_offload.

Failed/Discarded Ideas

  • Scalar Wrapping on Flow: Tried to use flow select for scalar wrapping. Failed due to limited flow slots (2 vs 6 VALU), causing massive stalls.
  • Aggressive Active Set: Tried extending Active Set to Round 4+. Failed due to vselect tree overhead (15+ ops) exceeding the cost of vector loads.
  • Flow Arithmetic: Investigated using add_imm on flow unit for compute. Discarded as it only supports scalar inputs, while hash computation is vectorized.

Final Configuration

  • Active Threshold: 4 (Rounds 0-3 optimized).
  • Mask Skip: Enabled.
  • Scalar Offload: 2 vectors.
  • Cycle Count: 1,859.