anthropic-kernel / atempt_2 /rem /original_system_analysis.md

Upload 39 files

f3ce0b0 verified 5 days ago

3.48 kB

	# Kernel Optimization Contest Analysis

	## Overview
	The goal is to optimize a kernel function (`KernelBuilder.build_kernel`) to run as fast as possible on a simulated custom VLIW (Very Large Instruction Word) SIMD machine. The performance is measured in clock cycles.

	## Repository Structure & Key Files
	- `perf_takehome.py`: The main development file. Contains the `KernelBuilder` class where you implement the optimization logic. It also includes local tests (`Tests` class) and a reference scalar implementation of the system.
	- `problem.py`: Defines the simulated machine (`Machine` class), instruction set (`alu`, `valu`, `load`, `store`, `flow`), and the environment (`Tree`, `Input`).
	- `tests/submission_tests.py`: The authoritative validation script. It imports `Machine` from `frozen_problem.py` to ensure the simulator logic hasn't been tampered with. It runs your `KernelBuilder` from `perf_takehome.py` and checks correctness and speed.
	- `tests/frozen_problem.py`: A copy of `problem.py` used strictly for validation to prevent "cheating" by modifying the simulator.
	- `watch_trace.py` / `watch_trace.html`: Tools for visualizing the execution trace in Perfetto (Chrome), useful for debugging and profiling component utilization.

	## System Flow & Architecture
	1. Input Generation: A random binary tree (`Forest`) and a batch of inputs (`indices`, `values`) are generated.
	2. Kernel Building: `KernelBuilder.build_kernel` is called to generate a sequence of instructions (`kb.instrs`).
	3. Simulation:
	- A `Machine` is instantiated with the memory image and the generated instructions.
	- The machine runs cycle-by-cycle.
	- On each cycle, multiple "engines" (`alu`, `valu`, `load`, `store`, `flow`) execute instructions in parallel, limited by `SLOT_LIMITS`.
	4. Verification: The machine's final memory state is compared against a reference Python implementation (`reference_kernel2`).

	### The Machine (VLIW SIMD)
	- VLEN: 8 (Vector Length).
	- Slot Limits per cycle:
	- `alu`: 12 (Scalar arithmetic)
	- `valu`: 6 (Vector arithmetic)
	- `load`: 2 (Memory reads)
	- `store`: 2 (Memory writes)
	- `flow`: 1 (Control flow)
	- Memory: Flat 32-bit integer memory array.
	- Scratchpad: `SCRATCH_SIZE` (1536 ints). Serves as registers/cache.

	## Contest Mechanics
	- Optimization Target: Minimize `machine.cycle`.
	- Baseline: The starter code is a purely scalar implementation (~147,734 cycles).
	- Targets:
	- < 2164 cycles: Claude Opus 4 baseline.
	- < 1487 cycles: Claude Opus 4.5 (11.5 hours compute).
	- < 1300 cycles: Invalid/Cheated solutions reference.
	- Anti-Cheat: The `tests/` directory and `frozen_problem.py` must not be modified. Validation uses `frozen_problem.py`.

	## Current Implementation (Baseline)
	The current `build_kernel` in `perf_takehome.py` implements the logic using only scalar `alu` and `load`/`store` operations, processing one item at a time. This fails to utilize the `valu` (vector) slots and the parallelism available in the `alu` slots (12 available, using ~1 per instruction bundle).

	## Next Steps
	To achieve the target performance, the kernel needs to be vectorized (`valu`, `vload`, `vstore`) and likely pipelined (software pipelining) to maximize the utilization of all available slots per cycle, processing multiple inputs and hashing stages in parallel.