arxiv:2601.22069

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Published on Jan 29

· Submitted by

Yibo Wang on Jan 30

Nanyang Technological University

Upvote

Authors:

Abstract

VTC-R1 enables efficient long-context reasoning by compressing textual traces into compact images and iteratively feeding them back into vision-language models as optical memory, achieving significant speedup without sacrificing performance.

AI-generated summary

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

View arXiv page View PDF GitHub 10 Add to collection

Community

yiboowang

Paper submitter about 16 hours ago

We propose VTC-R1, an efficient long-context reasoning paradigm that integrates vision-text compression into iterative reasoning. By rendering previous reasoning segments into compact visual representations, VTC-R1 replaces long textual contexts with significantly fewer vision tokens in a lightweight and model-free manner. Extensive experiments show that VTC-R1 consistently improves reasoning accuracy across multiple benchmarks while achieving up to 3.4x token compression and 2.7x end-to-end inference speedup. The results demonstrate that VTC-R1 provides an effective alternative representation for scalable long-context reasoning. We hope our work would inspire further exploration of efficient reasoning beyond pure text-based paradigms.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.22069 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.22069 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.22069 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.