File size: 3,484 Bytes
f3ce0b0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | # Kernel Optimization Contest Analysis
## Overview
The goal is to optimize a kernel function (`KernelBuilder.build_kernel`) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles.
## Repository Structure & Key Files
- **`perf_takehome.py`**: The main development file. Contains the `KernelBuilder` class where you implement the optimization logic. It also includes local tests (`Tests` class) and a reference scalar implementation of the system.
- **`problem.py`**: Defines the simulated machine (`Machine` class), instruction set (`alu`, `valu`, `load`, `store`, `flow`), and the environment (`Tree`, `Input`).
- **`tests/submission_tests.py`**: The authoritative validation script. It imports `Machine` from `frozen_problem.py` to ensure the simulator logic hasn't been tampered with. It runs your `KernelBuilder` from `perf_takehome.py` and checks correctness and speed.
- **`tests/frozen_problem.py`**: A copy of `problem.py` used strictly for validation to prevent "cheating" by modifying the simulator.
- **`watch_trace.py` / `watch_trace.html`**: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization.
## System Flow & Architecture
1. **Input Generation**: A random binary tree (`Forest`) and a batch of inputs (`indices`, `values`) are generated.
2. **Kernel Building**: `KernelBuilder.build_kernel` is called to generate a sequence of instructions (`kb.instrs`).
3. **Simulation**:
- A `Machine` is instantiated with the memory image and the generated instructions.
- The machine runs cycle-by-cycle.
- On each cycle, multiple "engines" (`alu`, `valu`, `load`, `store`, `flow`) execute instructions in parallel, limited by `SLOT_LIMITS`.
4. **Verification**: The machine's final memory state is compared against a reference Python implementation (`reference_kernel2`).
### The Machine (VLIW SIMD)
- **VLEN**: 8 (Vector Length).
- **Slot Limits** per cycle:
- `alu`: 12 (Scalar arithmetic)
- `valu`: 6 (Vector arithmetic)
- `load`: 2 (Memory reads)
- `store`: 2 (Memory writes)
- `flow`: 1 (Control flow)
- **Memory**: Flat 32-bit integer memory array.
- **Scratchpad**: `SCRATCH_SIZE` (1536 ints). Serves as registers/cache.
## Contest Mechanics
- **Optimization Target**: Minimize `machine.cycle`.
- **Baseline**: The starter code is a purely scalar implementation (~147,734 cycles).
- **Targets**:
- < 2164 cycles: Claude Opus 4 baseline.
- < 1487 cycles: Claude Opus 4.5 (11.5 hours compute).
- < 1300 cycles: Invalid/Cheated solutions reference.
- **Anti-Cheat**: The `tests/` directory and `frozen_problem.py` must not be modified. Validation uses `frozen_problem.py`.
## Current Implementation (Baseline)
The current `build_kernel` in `perf_takehome.py` implements the logic using only scalar `alu` and `load`/`store` operations, processing one item at a time. This fails to utilize the `valu` (vector) slots and the parallelism available in the `alu` slots (12 available, using ~1 per instruction bundle).
## Next Steps
To achieve the target performance, the kernel needs to be vectorized (`valu`, `vload`, `vstore`) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.
|