Papers
arxiv:2512.17077

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Published on Dec 18, 2025
Authors:
,
,

Abstract

Diffusion large language models face unique memory challenges during inference, which are addressed through dLLM-Serve by optimizing memory footprint, computational scheduling, and generation quality via novel techniques like logit-aware activation budgeting and phase-multiplexed scheduling.

AI-generated summary

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous request phases, and Head-Centric Sparse Attention to decouple logical sparsity from physical storage. We evaluate dLLM-Serve on diverse workloads (LiveBench, Burst, OSC) and GPUs (RTX 4090, L40S). Relative to the state-of-the-art baseline, dLLM-Serve improves throughput by 1.61times-1.81times on the consumer-grade RTX 4090 and 1.60times-1.74times on the server-grade NVIDIA L40S, while reducing tail latency by nearly 4times under heavy contention. dLLM-Serve establishes the first blueprint for scalable dLLM inference, converting theoretical algorithmic sparsity into tangible wall-clock acceleration across heterogeneous hardware.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.17077 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.17077 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.17077 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.