arxiv:2604.08546

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Published on Apr 9

· Submitted by

Dingkang Liang on Apr 10

H-EmbodVis

Upvote

104

Authors:

Xin Zhou ,

Xiaofan Li ,

Dingkang Liang ,

Abstract

NUMINA enhances text-to-video diffusion models' numerical accuracy through a training-free framework that identifies layout inconsistencies and guides regeneration via attention modulation.

AI-generated summary

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

dkliang

Paper author Paper submitter about 19 hours ago

NUMINA is a training-free framework that tackles numerical misalignment in text-to-video diffusion models — the persistent failure of T2V models to generate the correct count of objects specified in prompts (e.g., producing 2 or 4 cats when "three cats" is requested). Unlike seed search or prompt enhancement approaches that treat the generation pipeline as a black box and rely on brute-force resampling or LLM-based prompt rewriting, NUMINA directly identifies where and why counting errors occur inside the model by analyzing cross-attention and self-attention maps at selected DiT layers. It constructs a countable spatial layout via a two-stage clustering pipeline, then performs layout-guided attention modulation during regeneration to enforce the correct object count — all without retraining or fine-tuning. On our introduced CountBench, this attention-level intervention provides principled, interpretable control over numerical semantics that seed search and prompt enhancement fundamentally cannot achieve, improves counting accuracy by up to 7.4% on Wan2.1-1.3B. Furthermore, because NUMINA operates partly orthogonally to inference acceleration techniques, it is compatible with training-free caching methods such as EasyCache, which accelerates diffusion inference via runtime-adaptive transformer output reuse.

avahal

about 10 hours ago

the most interesting nugget for me is how NUMINA picks a per-token instance-separable self-attention head and a text-aligned cross-attention head to build a countable latent layout. that per-token head choice, then fusing maps into a foreground layout and refining by adding or removing instances with a decaying reg score, feels like a smart way to inject a hard count constraint without retraining. i’d be curious how robust that per-token head selection is when prompts use ambiguous nouns or heavy occlusion, or when objects are densely packed and attention becomes diffuse. btw the arxivlens breakdown helped me parse the method details, and it’s nice to see a concise walkthrough that lines up with what they describe here (https://arxivlens.com/PaperView/Details/when-numbers-speak-aligning-textual-numerals-and-visual-instances-in-text-to-video-diffusion-models-6669-5788ebc6).