Kernel Optimization Contest Analysis
Overview
The goal is to optimize a kernel function (KernelBuilder.build_kernel) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles.
Repository Structure & Key Files
perf_takehome.py: The main development file. Contains theKernelBuilderclass where you implement the optimization logic. It also includes local tests (Testsclass) and a reference scalar implementation of the system.problem.py: Defines the simulated machine (Machineclass), instruction set (alu,valu,load,store,flow), and the environment (Tree,Input).tests/submission_tests.py: The authoritative validation script. It importsMachinefromfrozen_problem.pyto ensure the simulator logic hasn't been tampered with. It runs yourKernelBuilderfromperf_takehome.pyand checks correctness and speed.tests/frozen_problem.py: A copy ofproblem.pyused strictly for validation to prevent "cheating" by modifying the simulator.watch_trace.py/watch_trace.html: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization.
System Flow & Architecture
- Input Generation: A random binary tree (
Forest) and a batch of inputs (indices,values) are generated. - Kernel Building:
KernelBuilder.build_kernelis called to generate a sequence of instructions (kb.instrs). - Simulation:
- A
Machineis instantiated with the memory image and the generated instructions. - The machine runs cycle-by-cycle.
- On each cycle, multiple "engines" (
alu,valu,load,store,flow) execute instructions in parallel, limited bySLOT_LIMITS.
- A
- Verification: The machine's final memory state is compared against a reference Python implementation (
reference_kernel2).
The Machine (VLIW SIMD)
- VLEN: 8 (Vector Length).
- Slot Limits per cycle:
alu: 12 (Scalar arithmetic)valu: 6 (Vector arithmetic)load: 2 (Memory reads)store: 2 (Memory writes)flow: 1 (Control flow)
- Memory: Flat 32-bit integer memory array.
- Scratchpad:
SCRATCH_SIZE(1536 ints). Serves as registers/cache.
Contest Mechanics
- Optimization Target: Minimize
machine.cycle. - Baseline: The starter code is a purely scalar implementation (~147,734 cycles).
- Targets:
- < 2164 cycles: Claude Opus 4 baseline.
- < 1487 cycles: Claude Opus 4.5 (11.5 hours compute).
- < 1300 cycles: Invalid/Cheated solutions reference.
- Anti-Cheat: The
tests/directory andfrozen_problem.pymust not be modified. Validation usesfrozen_problem.py.
Current Implementation (Baseline)
The current build_kernel in perf_takehome.py implements the logic using only scalar alu and load/store operations, processing one item at a time. This fails to utilize the valu (vector) slots and the parallelism available in the alu slots (12 available, using ~1 per instruction bundle).
Next Steps
To achieve the target performance, the kernel needs to be vectorized (valu, vload, vstore) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.