Performance Tuning for LLMs

Published June 25, 2026

Concurrency, batching, and flag sweeps — systematically tuning vLLM for better throughput and latency.

Concurrency and Batching — The Mental Model

Last episode established baselines at single-user concurrency. Real servers handle multiple users simultaneously. Here's how vLLM does it.

Continuous Batching

When two users hit the server at once:

User A: [prefill-------] [decode][decode][decode]...
User B:       [prefill-------] [decode][decode]...

GPU:    [A-prefill] [A-d + B-prefill] [A-d + B-d] ...

vLLM dynamically mixes prefill and decode steps from different users in the same GPU invocation. The GPU stays busy even when individual requests have slow phases.

The Two Scheduler Knobs

Flag	What it limits	Why it matters
`--max-num-seqs`	Max concurrent sequences	Too low = GPU starved. Too high = KV cache exhaustion
`--max-num-batched-tokens`	Max tokens per GPU step	Larger = faster prefill, more VRAM

Tradeoff:

Low latency, interactive: smaller limits → lower TTFT, less aggregate throughput
High throughput, batch: larger limits → higher aggregate tok/s, individual TTFT goes up

Baseline Numbers

Before tuning, a reference sweep against the baseline config (GPU_MEM_UTIL=0.80, implicit limits):

pp	c1 tg/s	c2 tg/s	c4 tg/s	c1 TTFT	c2 TTFT	c4 TTFT
512	44.59	66.13	78.50	500 ms	1114 ms	1476 ms
2048	44.34	69.17	76.94	1017 ms	1313 ms	2342 ms

Observations:

Decode speed per user (c1): ~44.5 tok/s — single-user baseline
Concurrency helps: 1→2 users nearly 1.5x aggregate throughput
Longer prompts hurt TTFT: pp=2048 takes ~2x longer
Server logs warned of low batched token count for speculation

Tuning Knob 1: GPU Memory Utilization

Move from 0.80 → 0.85, add explicit scheduler limits:

--gpu-memory-utilization 0.85
--max-num-seqs 16
--max-num-batched-tokens 8192

Results:

pp	c1 tg/s	c2 tg/s	c4 tg/s	c1 TTFT	c2 TTFT	c4 TTFT
512	44.04	67.82	82.55	494 ms	1069 ms	1581 ms
2048	43.95	67.69	88.20	725 ms	1228 ms	1957 ms

Gains: c4 throughput up, pp=2048 TTFT significantly improved (1957 vs 2342 ms).

Tuning Knob 2: Push Scheduler Limits

Push further for throughput-heavy workloads:

--gpu-memory-utilization 0.88
--max-num-seqs 32
--max-num-batched-tokens 16384

Results:

pp	c1 tg/s	c2 tg/s	c4 tg/s	c1 TTFT	c2 TTFT	c4 TTFT
512	45.13	69.51	83.32	601 ms	1143 ms	1631 ms
2048	43.86	66.04	88.54	738 ms	1228 ms	1988 ms

Marginal c4 gains, worse TTFT. Heavier cold start from larger max-num-batched-tokens.

Winner: Balanced Config

The optimized default wins for everyday use:

GPU_MEM_UTIL=0.85
max-num-seqs=16
max-num-batched-tokens=8192
MTP2 speculation

Recommendations by Workload

Workload	Best config	Why
Interactive coding / chat	Optimized (0.85)	Best balance: fast single-user decode, good TTFT
Batch / agent throughput	Throughput (0.88)	Slightly higher aggregate c4 decode at cost of TTFT

Interactive Simulator

Open the Inference Simulator v2 to visualize concurrency, batching, and memory constraints step by step.

Summary

Built a mental model of continuous batching
Ran baseline sweep, identified bottlenecks
Tuned GPU memory from 0.80 → 0.85: more KV cache, better throughput
Added explicit scheduler limits: significantly improved long-prompt TTFT
Pushed further for throughput: marginal gains, worse cold start

The balanced config is now the canonical default for all future experiments.

Spaces mentioned in this article 1

Qwen3.6-35B-A3B — Architecture Overview

June 29, 2026

LLM Inference Benchmarking

June 25, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Performance Tuning for LLMs

Concurrency and Batching — The Mental Model

Continuous Batching

The Two Scheduler Knobs

Baseline Numbers

Tuning Knob 1: GPU Memory Utilization

Tuning Knob 2: Push Scheduler Limits

Winner: Balanced Config

Recommendations by Workload

Interactive Simulator

Summary

Spaces mentioned in this article 1

LLM Inference Simulator — Extended

Qwen3.6-35B-A3B — Architecture Overview

LLM Inference Benchmarking

Community

Spaces mentioned in this article 1

LLM Inference Simulator — Extended