Kernel Optimization Contest Analysis

Overview

The goal is to optimize a kernel function (KernelBuilder.build_kernel) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles.

Repository Structure & Key Files

perf_takehome.py: The main development file. Contains the KernelBuilder class where you implement the optimization logic. It also includes local tests (Tests class) and a reference scalar implementation of the system.
problem.py: Defines the simulated machine (Machine class), instruction set (alu, valu, load, store, flow), and the environment (Tree, Input).
tests/submission_tests.py: The authoritative validation script. It imports Machine from frozen_problem.py to ensure the simulator logic hasn't been tampered with. It runs your KernelBuilder from perf_takehome.py and checks correctness and speed.
tests/frozen_problem.py: A copy of problem.py used strictly for validation to prevent "cheating" by modifying the simulator.
watch_trace.py / watch_trace.html: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization.

System Flow & Architecture

Input Generation: A random binary tree (Forest) and a batch of inputs (indices, values) are generated.
Kernel Building: KernelBuilder.build_kernel is called to generate a sequence of instructions (kb.instrs).
Simulation:
- A Machine is instantiated with the memory image and the generated instructions.
- The machine runs cycle-by-cycle.
- On each cycle, multiple "engines" (alu, valu, load, store, flow) execute instructions in parallel, limited by SLOT_LIMITS.
Verification: The machine's final memory state is compared against a reference Python implementation (reference_kernel2).

The Machine (VLIW SIMD)

VLEN: 8 (Vector Length).
Slot Limits per cycle:
- alu: 12 (Scalar arithmetic)
- valu: 6 (Vector arithmetic)
- load: 2 (Memory reads)
- store: 2 (Memory writes)
- flow: 1 (Control flow)
Memory: Flat 32-bit integer memory array.
Scratchpad: SCRATCH_SIZE (1536 ints). Serves as registers/cache.

Contest Mechanics

Optimization Target: Minimize machine.cycle.
Baseline: The starter code is a purely scalar implementation (~147,734 cycles).
Targets:
- < 2164 cycles: Claude Opus 4 baseline.
- < 1487 cycles: Claude Opus 4.5 (11.5 hours compute).
- < 1300 cycles: Invalid/Cheated solutions reference.
Anti-Cheat: The tests/ directory and frozen_problem.py must not be modified. Validation uses frozen_problem.py.

Current Implementation (Baseline)

The current build_kernel in perf_takehome.py implements the logic using only scalar alu and load/store operations, processing one item at a time. This fails to utilize the valu (vector) slots and the parallelism available in the alu slots (12 available, using ~1 per instruction bundle).

Next Steps

To achieve the target performance, the kernel needs to be vectorized (valu, vload, vstore) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.