GPU Mode: Multi-Head Latent Attention (MLA) Decode

Evolve a Triton kernel for the MLA decode operator using SkyDiscover.

Core attention mechanism from DeepSeek-V2/V3, used for efficient inference with compressed KV cache via LoRA projections and RoPE.

Quick Start

From the repo root:

uv run skydiscover-run \
  benchmarks/gpu_mode/mla_decode/initial_program.py \
  benchmarks/gpu_mode/mla_decode/evaluator.py \
  -c benchmarks/gpu_mode/mla_decode/config.yaml \
  -s [your_algorithm] -i 50

Scoring

Correctness: Must match reference MLA output (rtol=0.06, atol=0.06 in bfloat16)
Score: SCORE_SCALE / geom_mean_us where SCORE_SCALE = 3000.0
Higher is better (faster runtime = higher score)

Modal Cloud GPU Support

Note: This benchmark requires an H200 GPU (141GB VRAM). The H100 (80GB) does not have enough memory.

GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H200 \
  uv run skydiscover-run \
  benchmarks/gpu_mode/mla_decode/initial_program.py \
  benchmarks/gpu_mode/mla_decode/evaluator.py \
  -c benchmarks/gpu_mode/mla_decode/config.yaml \
  -s [your_algorithm] -i 50