Papers
arxiv:2512.01302

DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Published on Dec 1
Authors:
,
,
,
,
,

Abstract

DCText improves text-to-image generation by dividing prompts into segments, using attention masks, and localized noise initialization to enhance accuracy and coherence.

AI-generated summary

Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.01302 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.01302 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.01302 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.