FastDMS: Full DMS implementation running faster than vLLM BF16/FP8

by leonardlin - opened 25 days ago

Discussion

leonardlin

25 days ago

DMS authors, contributors, and other interested parties.

I was poking around with DMS (great work!) and I couldn't find a proper implementation, so I created one:

HF reference implementation, slow but correct
Trainer - created https://huggingface.co/shisa-ai/Llama-3.2-1B-DMS-8x - seems to work
FastDMS - a tuned version that does full reclamation of evicted slots, and is faster than vLLM BF16 & FP8 for decode (and basically as fast for prefill) on my PRO 6000 test bench.

MIT-licensed, with full testing artifacts, tables: https://github.com/shisa-ai/FastDMS

Small subset below:

Compact DMS saves real allocator/device memory, not just theoretical KV bytes. The table below uses ctx_len=8192, gen_len=128. All vLLM baselines use exact-sized token pools matching the workload. KV/stage memory is the cache or cache-plus-staging footprint. vLLM BF16 means dtype=bfloat16 with kv_cache_dtype=auto; vLLM FP8 means kv_cache_dtype=fp8.

Model / compact-DMS row	c	vLLM BF16 KV → FastDMS KV	BF16 KV saved	vLLM FP8 KV → FastDMS KV	FP8 KV saved	vLLM TQ4 KV → FastDMS KV	TQ4 KV saved
Llama-3.2-1B FastDMS default	1	`0.312 → 0.056 GiB`	`5.6x`	`0.156 → 0.056 GiB`	`2.8x`	`0.142 → 0.056 GiB`	`2.5x`
Llama-3.2-1B FastDMS default	8	`2.062 → 0.431 GiB`	`4.8x`	`1.031 → 0.431 GiB`	`2.4x`	`0.939 → 0.431 GiB`	`2.2x`
Qwen3-8B FastDMS compact DMS	1	`1.406 → 0.184 GiB`	`7.6x`	`0.703 → 0.184 GiB`	`3.8x`	—	—
Qwen3-8B FastDMS compact DMS	8	`9.281 → 1.462 GiB`	`6.3x`	`4.641 → 1.462 GiB`	`3.2x`	—	—

Speed and memory usage:

Path	c	Prefill tok/s	Prefill vs BF16	Decode tok/s	Decode vs BF16	KV / stage memory	Status
vLLM BF16	1	`123098.0`	`1.00x`	`459.4`	`1.00x`	`0.312 GiB` BF16 KV	dense BF16-KV baseline
vLLM FP8	1	`119991.3`	`0.97x`	`489.4`	`1.07x`	`0.156 GiB` FP8 KV	dense FP8-KV baseline
vLLM TurboQuant `4bit_nc`	1	`126429.0`	`1.03x`	`333.4`	`0.73x`	`0.142 GiB` TQ4 KV	4-bit KV baseline
FastDMS FP8 compact-DMS default	1	`123194.6`	`1.00x`	`698.9`	`1.52x`	`0.056 GiB`	promoted zero-BF16 row
FastDMS B46 int4 speed profile	1	`121489.9`	`0.99x`	`1060.0`	`2.31x`	`0.056 GiB` + `0.719 GiB` int4 shadow	default-off storage-for-speed
vLLM BF16	8	`103668.5`	`1.00x`	`2357.5`	`1.00x`	`2.062 GiB` BF16 KV	dense BF16-KV baseline
vLLM FP8	8	`102959.5`	`0.99x`	`2888.7`	`1.23x`	`1.031 GiB` FP8 KV	dense FP8-KV baseline
vLLM TurboQuant `4bit_nc`	8	`104409.9`	`1.01x`	`1696.0`	`0.72x`	`0.939 GiB` TQ4 KV	4-bit KV baseline
FastDMS FP8 compact-DMS default	8	`105531.7`	`1.02x`	`3606.9`	`1.53x`	`0.431 GiB`	promoted zero-BF16 row
FastDMS B25 narrow int4 speed profile	8	`104753.7`	`1.01x`	`3640.7`	`1.54x`	`0.431 GiB` + `0.078 GiB` int4 shadow	default-off storage-for-speed
FastDMS BF16-attention speed control	8	`108070.5`	`1.04x`	`3745.3`	`1.59x`	`0.429 GiB` + `0.312 GiB` BF16 backing	explicit speed control

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment