# Kernel Optimization Contest Analysis ## Overview The goal is to optimize a kernel function (`KernelBuilder.build_kernel`) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles. ## Repository Structure & Key Files - **`perf_takehome.py`**: The main development file. Contains the `KernelBuilder` class where you implement the optimization logic. It also includes local tests (`Tests` class) and a reference scalar implementation of the system. - **`problem.py`**: Defines the simulated machine (`Machine` class), instruction set (`alu`, `valu`, `load`, `store`, `flow`), and the environment (`Tree`, `Input`). - **`tests/submission_tests.py`**: The authoritative validation script. It imports `Machine` from `frozen_problem.py` to ensure the simulator logic hasn't been tampered with. It runs your `KernelBuilder` from `perf_takehome.py` and checks correctness and speed. - **`tests/frozen_problem.py`**: A copy of `problem.py` used strictly for validation to prevent "cheating" by modifying the simulator. - **`watch_trace.py` / `watch_trace.html`**: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization. ## System Flow & Architecture 1. **Input Generation**: A random binary tree (`Forest`) and a batch of inputs (`indices`, `values`) are generated. 2. **Kernel Building**: `KernelBuilder.build_kernel` is called to generate a sequence of instructions (`kb.instrs`). 3. **Simulation**: - A `Machine` is instantiated with the memory image and the generated instructions. - The machine runs cycle-by-cycle. - On each cycle, multiple "engines" (`alu`, `valu`, `load`, `store`, `flow`) execute instructions in parallel, limited by `SLOT_LIMITS`. 4. **Verification**: The machine's final memory state is compared against a reference Python implementation (`reference_kernel2`). ### The Machine (VLIW SIMD) - **VLEN**: 8 (Vector Length). - **Slot Limits** per cycle: - `alu`: 12 (Scalar arithmetic) - `valu`: 6 (Vector arithmetic) - `load`: 2 (Memory reads) - `store`: 2 (Memory writes) - `flow`: 1 (Control flow) - **Memory**: Flat 32-bit integer memory array. - **Scratchpad**: `SCRATCH_SIZE` (1536 ints). Serves as registers/cache. ## Contest Mechanics - **Optimization Target**: Minimize `machine.cycle`. - **Baseline**: The starter code is a purely scalar implementation (~147,734 cycles). - **Targets**: - < 2164 cycles: Claude Opus 4 baseline. - < 1487 cycles: Claude Opus 4.5 (11.5 hours compute). - < 1300 cycles: Invalid/Cheated solutions reference. - **Anti-Cheat**: The `tests/` directory and `frozen_problem.py` must not be modified. Validation uses `frozen_problem.py`. ## Current Implementation (Baseline) The current `build_kernel` in `perf_takehome.py` implements the logic using only scalar `alu` and `load`/`store` operations, processing one item at a time. This fails to utilize the `valu` (vector) slots and the parallelism available in the `alu` slots (12 available, using ~1 per instruction bundle). ## Next Steps To achieve the target performance, the kernel needs to be vectorized (`valu`, `vload`, `vstore`) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.