Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Abstract
Laser introduces a new visual reasoning paradigm using dynamic windowed alignment learning to maintain global features during latent deduction while achieving superior performance with reduced computational overhead.
While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.
Community
We hope this work encourages a paradigm shift from explicit next-token prediction to latent visual reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs (2025)
- Interleaved Latent Visual Reasoning with Selective Perceptual Modeling (2025)
- Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space (2025)
- From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning (2025)
- Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens (2025)
- Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning (2025)
- Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper