arxiv:2512.12623

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Published on Dec 14

· Submitted by

Chengzhi Liu on Dec 19

UC Santa Barbara NLP Group

Upvote

Authors:

Chengzhi Liu ,

Abstract

A dynamic multimodal latent reasoning framework improves cross-modal reasoning and perception performance by interleaving reasoning and perception using confidence-guided optimization and dynamic visual injection.

AI-generated summary

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

View arXiv page View PDF Project page Add to collection

Community

LCZZZZ

Paper author Paper submitter 6 days ago

🌐 Website: https://mllm-dmlr.github.io

📄 Paper: https://arxiv.org/abs/2512.12623

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.12623 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.12623 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.12623 in a Space README.md to link it from this page.