FastDMS: Full DMS implementation running faster than vLLM BF16/FP8

#2
by leonardlin - opened

DMS authors, contributors, and other interested parties.

I was poking around with DMS (great work!) and I couldn't find a proper implementation, so I created one:

  • HF reference implementation, slow but correct
  • Trainer - created https://huggingface.co/shisa-ai/Llama-3.2-1B-DMS-8x - seems to work
  • FastDMS - a tuned version that does full reclamation of evicted slots, and is faster than vLLM BF16 & FP8 for decode (and basically as fast for prefill) on my PRO 6000 test bench.

MIT-licensed, with full testing artifacts, tables: https://github.com/shisa-ai/FastDMS

Small subset below:

Compact DMS saves real allocator/device memory, not just theoretical KV bytes. The table below uses ctx_len=8192, gen_len=128. All vLLM baselines use exact-sized token pools matching the workload. KV/stage memory is the cache or cache-plus-staging footprint. vLLM BF16 means dtype=bfloat16 with kv_cache_dtype=auto; vLLM FP8 means kv_cache_dtype=fp8.

Model / compact-DMS row c vLLM BF16 KV β†’ FastDMS KV BF16 KV saved vLLM FP8 KV β†’ FastDMS KV FP8 KV saved vLLM TQ4 KV β†’ FastDMS KV TQ4 KV saved
Llama-3.2-1B FastDMS default 1 0.312 β†’ 0.056 GiB 5.6x 0.156 β†’ 0.056 GiB 2.8x 0.142 β†’ 0.056 GiB 2.5x
Llama-3.2-1B FastDMS default 8 2.062 β†’ 0.431 GiB 4.8x 1.031 β†’ 0.431 GiB 2.4x 0.939 β†’ 0.431 GiB 2.2x
Qwen3-8B FastDMS compact DMS 1 1.406 β†’ 0.184 GiB 7.6x 0.703 β†’ 0.184 GiB 3.8x β€” β€”
Qwen3-8B FastDMS compact DMS 8 9.281 β†’ 1.462 GiB 6.3x 4.641 β†’ 1.462 GiB 3.2x β€” β€”

Speed and memory usage:

Path c Prefill tok/s Prefill vs BF16 Decode tok/s Decode vs BF16 KV / stage memory Status
vLLM BF16 1 123098.0 1.00x 459.4 1.00x 0.312 GiB BF16 KV dense BF16-KV baseline
vLLM FP8 1 119991.3 0.97x 489.4 1.07x 0.156 GiB FP8 KV dense FP8-KV baseline
vLLM TurboQuant 4bit_nc 1 126429.0 1.03x 333.4 0.73x 0.142 GiB TQ4 KV 4-bit KV baseline
FastDMS FP8 compact-DMS default 1 123194.6 1.00x 698.9 1.52x 0.056 GiB promoted zero-BF16 row
FastDMS B46 int4 speed profile 1 121489.9 0.99x 1060.0 2.31x 0.056 GiB + 0.719 GiB int4 shadow default-off storage-for-speed
vLLM BF16 8 103668.5 1.00x 2357.5 1.00x 2.062 GiB BF16 KV dense BF16-KV baseline
vLLM FP8 8 102959.5 0.99x 2888.7 1.23x 1.031 GiB FP8 KV dense FP8-KV baseline
vLLM TurboQuant 4bit_nc 8 104409.9 1.01x 1696.0 0.72x 0.939 GiB TQ4 KV 4-bit KV baseline
FastDMS FP8 compact-DMS default 8 105531.7 1.02x 3606.9 1.53x 0.431 GiB promoted zero-BF16 row
FastDMS B25 narrow int4 speed profile 8 104753.7 1.01x 3640.7 1.54x 0.431 GiB + 0.078 GiB int4 shadow default-off storage-for-speed
FastDMS BF16-attention speed control 8 108070.5 1.04x 3745.3 1.59x 0.429 GiB + 0.312 GiB BF16 backing explicit speed control

Sign up or log in to comment