# Kernel Optimization Contest Analysis

## Overview
The goal is to optimize a kernel function (`KernelBuilder.build_kernel`) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles.

## Repository Structure & Key Files
- **`perf_takehome.py`**: The main development file. Contains the `KernelBuilder` class where you implement the optimization logic. It also includes local tests (`Tests` class) and a reference scalar implementation of the system.
- **`problem.py`**: Defines the simulated machine (`Machine` class), instruction set (`alu`, `valu`, `load`, `store`, `flow`), and the environment (`Tree`, `Input`).
- **`tests/submission_tests.py`**: The authoritative validation script. It imports `Machine` from `frozen_problem.py` to ensure the simulator logic hasn't been tampered with. It runs your `KernelBuilder` from `perf_takehome.py` and checks correctness and speed.
- **`tests/frozen_problem.py`**: A copy of `problem.py` used strictly for validation to prevent "cheating" by modifying the simulator.
- **`watch_trace.py` / `watch_trace.html`**: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization.

## System Flow & Architecture
1.  **Input Generation**: A random binary tree (`Forest`) and a batch of inputs (`indices`, `values`) are generated.
2.  **Kernel Building**: `KernelBuilder.build_kernel` is called to generate a sequence of instructions (`kb.instrs`).
3.  **Simulation**:
    - A `Machine` is instantiated with the memory image and the generated instructions.
    - The machine runs cycle-by-cycle.
    - On each cycle, multiple "engines" (`alu`, `valu`, `load`, `store`, `flow`) execute instructions in parallel, limited by `SLOT_LIMITS`.
4.  **Verification**: The machine's final memory state is compared against a reference Python implementation (`reference_kernel2`).

### The Machine (VLIW SIMD)
-   **VLEN**: 8 (Vector Length).
-   **Slot Limits** per cycle:
    -   `alu`: 12 (Scalar arithmetic)
    -   `valu`: 6 (Vector arithmetic)
    -   `load`: 2 (Memory reads)
    -   `store`: 2 (Memory writes)
    -   `flow`: 1 (Control flow)
-   **Memory**: Flat 32-bit integer memory array.
-   **Scratchpad**: `SCRATCH_SIZE` (1536 ints). Serves as registers/cache.

## Contest Mechanics
-   **Optimization Target**: Minimize `machine.cycle`.
-   **Baseline**: The starter code is a purely scalar implementation (~147,734 cycles).
-   **Targets**:
    -   < 2164 cycles: Claude Opus 4 baseline.
    -   < 1487 cycles: Claude Opus 4.5 (11.5 hours compute).
    -   < 1300 cycles: Invalid/Cheated solutions reference.
-   **Anti-Cheat**: The `tests/` directory and `frozen_problem.py` must not be modified. Validation uses `frozen_problem.py`.

## Current Implementation (Baseline)
The current `build_kernel` in `perf_takehome.py` implements the logic using only scalar `alu` and `load`/`store` operations, processing one item at a time. This fails to utilize the `valu` (vector) slots and the parallelism available in the `alu` slots (12 available, using ~1 per instruction bundle).

## Next Steps
To achieve the target performance, the kernel needs to be vectorized (`valu`, `vload`, `vstore`) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.