Papers
arxiv:2512.15649

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Published on Dec 17
ยท Submitted by
Hongbo Zhao
on Dec 18
Authors:
,
,
,
,

Abstract

A benchmark evaluates the performance of vision-language models on understanding long-context information compressed into dense visual representations, revealing significant limitations in capturing long-term dependencies.

AI-generated summary

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

Community

Paper author Paper submitter

A comprehensive benchmark to study VLM's visual text compression ability.

Code: https://github.com/Moenupa/VTCBench
Huggingface: https://huggingface.co/datasets/MLLM-CL/VTCBench

Paper author Paper submitter

Also supported in VLMevalkit.

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/vtcbench-can-vision-language-models-understand-long-context-with-vision-text-compression-2446-08a13274

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.15649 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15649 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.