Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Abstract
Render-of-Thought framework converts textual reasoning steps into images using vision-language models to improve reasoning traceability and efficiency while maintaining competitive performance.
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026)
- Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring (2026)
- Interleaved Latent Visual Reasoning with Selective Perceptual Modeling (2025)
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- Global Context Compression with Interleaved Vision-Text Transformation (2026)
- Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs (2025)
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper