JustinTX
/

sky2

Model card Files Files and versions

sky2 / benchmarks /gpu_mode /mla_decode /README.md

JustinTX's picture

Add files using upload-large-folder tool

af83196 verified 27 days ago

|

history blame contribute delete

1.13 kB

	# GPU Mode: Multi-Head Latent Attention (MLA) Decode

	Evolve a Triton kernel for the MLA decode operator using SkyDiscover.

	Core attention mechanism from DeepSeek-V2/V3, used for efficient inference with compressed KV cache via LoRA projections and RoPE.

	## Quick Start

	From the repo root:

	```bash
	uv run skydiscover-run \
	benchmarks/gpu_mode/mla_decode/initial_program.py \
	benchmarks/gpu_mode/mla_decode/evaluator.py \
	-c benchmarks/gpu_mode/mla_decode/config.yaml \
	-s [your_algorithm] -i 50
	```

	## Scoring

	- Correctness: Must match reference MLA output (rtol=0.06, atol=0.06 in bfloat16)
	- Score: `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
	- Higher is better (faster runtime = higher score)

	## Modal Cloud GPU Support

	Note: This benchmark requires an H200 GPU (141GB VRAM). The H100 (80GB) does not have enough memory.

	```bash
	GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H200 \
	uv run skydiscover-run \
	benchmarks/gpu_mode/mla_decode/initial_program.py \
	benchmarks/gpu_mode/mla_decode/evaluator.py \
	-c benchmarks/gpu_mode/mla_decode/config.yaml \
	-s [your_algorithm] -i 50
	```