VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
Abstract
VTC-R1 enables efficient long-context reasoning by compressing textual traces into compact images and iteratively feeding them back into vision-language models as optical memory, achieving significant speedup without sacrificing performance.
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.
Community
We propose VTC-R1, an efficient long-context reasoning paradigm that integrates vision-text compression into iterative reasoning. By rendering previous reasoning segments into compact visual representations, VTC-R1 replaces long textual contexts with significantly fewer vision tokens in a lightweight and model-free manner. Extensive experiments show that VTC-R1 consistently improves reasoning accuracy across multiple benchmarks while achieving up to 3.4x token compression and 2.7x end-to-end inference speedup. The results demonstrate that VTC-R1 provides an effective alternative representation for scalable long-context reasoning. We hope our work would inspire further exploration of efficient reasoning beyond pure text-based paradigms.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper