| # Kernel Optimization Contest Analysis | |
| ## Overview | |
| The goal is to optimize a kernel function (`KernelBuilder.build_kernel`) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles. | |
| ## Repository Structure & Key Files | |
| - **`perf_takehome.py`**: The main development file. Contains the `KernelBuilder` class where you implement the optimization logic. It also includes local tests (`Tests` class) and a reference scalar implementation of the system. | |
| - **`problem.py`**: Defines the simulated machine (`Machine` class), instruction set (`alu`, `valu`, `load`, `store`, `flow`), and the environment (`Tree`, `Input`). | |
| - **`tests/submission_tests.py`**: The authoritative validation script. It imports `Machine` from `frozen_problem.py` to ensure the simulator logic hasn't been tampered with. It runs your `KernelBuilder` from `perf_takehome.py` and checks correctness and speed. | |
| - **`tests/frozen_problem.py`**: A copy of `problem.py` used strictly for validation to prevent "cheating" by modifying the simulator. | |
| - **`watch_trace.py` / `watch_trace.html`**: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization. | |
| ## System Flow & Architecture | |
| 1. **Input Generation**: A random binary tree (`Forest`) and a batch of inputs (`indices`, `values`) are generated. | |
| 2. **Kernel Building**: `KernelBuilder.build_kernel` is called to generate a sequence of instructions (`kb.instrs`). | |
| 3. **Simulation**: | |
| - A `Machine` is instantiated with the memory image and the generated instructions. | |
| - The machine runs cycle-by-cycle. | |
| - On each cycle, multiple "engines" (`alu`, `valu`, `load`, `store`, `flow`) execute instructions in parallel, limited by `SLOT_LIMITS`. | |
| 4. **Verification**: The machine's final memory state is compared against a reference Python implementation (`reference_kernel2`). | |
| ### The Machine (VLIW SIMD) | |
| - **VLEN**: 8 (Vector Length). | |
| - **Slot Limits** per cycle: | |
| - `alu`: 12 (Scalar arithmetic) | |
| - `valu`: 6 (Vector arithmetic) | |
| - `load`: 2 (Memory reads) | |
| - `store`: 2 (Memory writes) | |
| - `flow`: 1 (Control flow) | |
| - **Memory**: Flat 32-bit integer memory array. | |
| - **Scratchpad**: `SCRATCH_SIZE` (1536 ints). Serves as registers/cache. | |
| ## Contest Mechanics | |
| - **Optimization Target**: Minimize `machine.cycle`. | |
| - **Baseline**: The starter code is a purely scalar implementation (~147,734 cycles). | |
| - **Targets**: | |
| - < 2164 cycles: Claude Opus 4 baseline. | |
| - < 1487 cycles: Claude Opus 4.5 (11.5 hours compute). | |
| - < 1300 cycles: Invalid/Cheated solutions reference. | |
| - **Anti-Cheat**: The `tests/` directory and `frozen_problem.py` must not be modified. Validation uses `frozen_problem.py`. | |
| ## Current Implementation (Baseline) | |
| The current `build_kernel` in `perf_takehome.py` implements the logic using only scalar `alu` and `load`/`store` operations, processing one item at a time. This fails to utilize the `valu` (vector) slots and the parallelism available in the `alu` slots (12 available, using ~1 per instruction bundle). | |
| ## Next Steps | |
| To achieve the target performance, the kernel needs to be vectorized (`valu`, `vload`, `vstore`) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel. | |