Abstract
AgentOCR reduces token consumption in agentic systems by representing interaction history as visual tokens and employing visual caching and self-compression techniques.
Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.
Community
We’re introducing AgentOCR, a new way to scale LLM agents by reimagining long interaction histories as compact rendered images, leveraging the higher information density of visual tokens to curb exploding context costs. To make long-horizon rollouts practical, we add segment optical caching, splitting history into hashable segments and caching the visuals, so agents avoid redundant re-rendering as trajectories grow. We go beyond fixed compression with agentic self-compression: the agent actively emits a compression rate and is trained with a compression-aware reward to balance task success against token efficiency. Across ALFWorld and search-based QA, AgentOCR keeps >95% of text-agent performance while cutting token use by >50% average and ~80% in peak, and our analysis shows up to a 20× rendering speedup thanks to our segment optical caching
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper