anthropic-kernel / atempt_1 /rem /optimization_log_1.md
algorembrant's picture
Upload 39 files
f3ce0b0 verified

Optimization Log

Goal

Achieve < 1000 cycles on the VLIW SIMD Kernel. Starting Baseline: ~147,734 cycles (Scalar). Reference Best: < 1363 cycles (Claude Opus 4.5 Improved).

Optimization Methods (Comprehensive List)

  1. Vectorization (SIMD): Utilizing valu, vload, vstore to process 8 items per instruction.
  2. Instruction Level Parallelism (ILP): Filling all VLIW slots (alu x12, valu x6, load x2) per cycle.
  3. Strength Reduction / Algebraic Simplification: Replacing expensive ops sequences (e.g., add + shift + add) with cheaper ones (e.g., multiply_add).
  4. Common Subexpression Elimination (CSE): Loading shared data (e.g., tree nodes) once per batch instead of per item.
  5. Loop Unrolling: Reducing loop overhead and exposing more ILP.
  6. Software Pipelining: Interleaving stages of different items to hide latency and fill slots.
  7. Register Caching: Keeping frequently used data (indices, values, top interaction tree nodes) in scratchpad to avoid memory access.
  8. Data Layout Optimization: (Limited capability) Sorting/Grouping data to maximize locality or cache hits (deduplication).
  9. Dead Code Elimination: Removing debug or unused instructions.
  10. Constant Folding: Pre-calculating constants.
  11. Active Set Processing: Tailoring the loop to handle only active/unique items (e.g., specific tree nodes) to minimize work.
  12. Bit Twiddling: Optimizing boolean logic and flag updates.

Applied Strategy Combinations

Attempt 1: The "Vectorized Algebraic" Approach

Combination: Vectorization + Strength Reduction + Register Caching.

  • Vectorization: Process batch of 256 as 32 vectors of 8.
  • Strength Reduction: Simplify Hash Stages 0, 2, 4 using multiply_add (collapsing 3 ops to 1). simplifiy other stages.
  • Register Caching: Keep all indices and values in scratchpad. Do NOT load/store them every round. Only final store.
  • Expected Result: Significant speedup.
  • Bottleneck: Memory Bandwidth for node_val (random access).

Attempt 2: The "Active Node" Deduplication

Combination: Active Set Processing + ILP.

  • Concept: In early rounds (0-7), the number of unique nodes accessed (< 256) is smaller than the batch size (256).
  • Method:
    • Round 0: Load Node 0 (scalar). Broadcast. Compute all.
    • Round 1: Load Node 1, 2. Compute items with idx 1, items with idx 2.
    • ...
    • Round K: "Gather" items by index (conceptually) or iterate over active nodes.
  • Win: Reduces node_val loads from 256/round to Uniques/round.

Attempt 3: Full Pipelined Saturation

Combination: Loop Unrolling + Software Pipelining + All Previous.

  • Concept: Completely fill valu and alu slots by processing multiple rounds or multiple vectors simultaneously.

Execution Log

  • (Upcoming) Implementation of Attempt 1.