LLM Inference Benchmarking

Published June 25, 2026

Build a mental model of what inference actually is, then benchmark it. Start vanilla, measure, tweak one knob, measure again.

What Is Inference? — The Toy Model

User input: "The cat sat"

                     Tokenize
    "The cat sat"  →  [576, 3797, 7236]

                     PREFILL (happens once)
    Run ALL 3 tokens through the model in one shot.
    Every token attends to every other token.
    Produces the first output token AND saves KV cache.

                     DECODE (loops until done)
    Step 1: Append new token, run through model.
    Step 2: Only the NEW token needs fresh computation.
            Previous tokens reuse cached KV entries.
    Step 3: Model outputs next token.
    Step 4: Repeat.

                     Detokenize
    [389, 278, 3098, ...]  →  "on the mat."

How the KV Cache Actually Works

During prefill, the model saves a "summary" of each token — two vectors called Key (K) and Value (V). Think of these as the model's scratch notes on each word.

Prefill builds the cache:
  "The"  →  [K₁ V₁]     <-- saved
  "cat"  →  [K₂ V₂]     <-- saved
  "sat"  →  [K₃ V₃]     <-- saved

Decode reuses it:
  New token "on":
    → compute [K₄ V₄] from "on" itself     <-- cheap (1 token's worth)
    → read   [K₁..K₃, V₁..V₃] from cache   <-- fast (just a memory read)
    → no need to recompute "The cat sat"

Without the KV cache, every decode step would re-process the entire growing sequence. The cache makes decode O(n) instead of O(n²). But the cache grows with every token, so eventually the GPU spends more time reading it than computing. That's why decode is memory-bound.

Prefill vs Decode — The Key Insight

	Prefill	Decode
Runs	Entire prompt at once	One new token at a time
How many times	1	N (once per output token)
Bottleneck	Compute (matrix multiplies)	Memory (KV cache reads)
GPU behaviour	High utilization	Low utilization, bandwidth-bound

Two latency numbers matter:

TTFT (Time To First Token) — dominated by prefill
TPOT (Time Per Output Token) — dominated by decode / memory bandwidth

The Model

Qwen3.6 35B-A3B — a Mixture-of-Experts model:

35B total parameters, only 3B active per token
Supports MTP (Multi-Token Prediction) for faster decode
128k context window

Baseline Benchmark

Start with conservative flags: GPU_MEM_UTIL=0.80, implicit scheduler limits, MTP2 speculation.

# Bring it up
cd ~/serve
./scripts/up.sh qwen3.6-35b-a3b-baseline

# vLLM's built-in bench (runs inside container — no network overhead)
docker exec vllm vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3.6-35B-A3B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 256 \
  --num-prompts 50 \
  --request-rate inf

Richer Benchmarks with llama-benchy

llama-benchy gives us multi-concurrency sweeps, proper warmup, and statistical averaging.

export BASE_URL=http://<hostname>.local:8000/v1
export MODEL=Qwen/Qwen3.6-35B-A3B
export COMMON_ARGS="--base-url $BASE_URL --api-key EMPTY --model $MODEL --tokenizer $MODEL --latency-mode generation --skip-coherence"

# Short-context sweep
llama-benchy \
  $COMMON_ARGS \
  --pp 512 2048 \
  --tg 256 \
  --depth 0 \
  --concurrency 1 2 4 \
  --runs 3 \
  --save-total-throughput-timeseries

Key metrics to watch:

tg t/s — decode throughput per user
tg t/s (total) — aggregate across all concurrent users
e2e_ttft — end-to-end time to first token

Interactive Simulator

Open the Inference Pipeline Simulator to visualize prefill, decode, and KV cache behaviour step by step.

Summary

Build a mental model of inference: prefill vs decode, TTFT vs TPOT
Serve a model with baseline config
Run built-in benchmarks for quick local checks
Run llama-benchy for proper end-to-end sweeps

The point is to get comfortable with the tools and build intuition for what the numbers mean. Next: performance tuning — concurrency, batching, and flag sweeps.

Spaces mentioned in this article 1

Qwen3.6-35B-A3B — Architecture Overview

June 29, 2026

Performance Tuning for LLMs

June 25, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

LLM Inference Benchmarking

What Is Inference? — The Toy Model

How the KV Cache Actually Works

Prefill vs Decode — The Key Insight

The Model

Baseline Benchmark

Richer Benchmarks with llama-benchy

Interactive Simulator

Summary

Spaces mentioned in this article 1

LLM Inference Pipeline Simulator

Qwen3.6-35B-A3B — Architecture Overview

Performance Tuning for LLMs

Community

Spaces mentioned in this article 1

LLM Inference Pipeline Simulator