Papers
arxiv:2601.14724

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Published on Jan 21
ยท Submitted by
Haowei Zhang
on Jan 23
#2 Paper of the day
ยท OpenMOSS-Team OpenMOSS
Authors:
,
,
,

Abstract

HERMES is a training-free architecture that enables real-time video stream understanding by utilizing a hierarchical memory framework based on KV cache reuse, achieving faster response times and maintained accuracy even with reduced video token input.

AI-generated summary

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10times faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

Community

Paper author Paper submitter

๐Ÿš€ Introducing HERMES: The Future of Real-Time Streaming Video Understanding!

While today's Multimodal Large Language Models (MLLMs) perform impressively at offline video comprehension, they often face a "painful trade-off" when it comes to real-time streaming video - balancing real-time responses, low memory usage, and high accuracy. To solve this, we introduce the following innovations:

๐Ÿ’ก The HERMES Breakthrough:
1๏ธโƒฃ Novel memory architecture: By deeply analyzing attention mechanisms, we' ve introduced a "Hierarchical Memory" approach. The KV Cache is now reimagined as a multi-level memory framework:

Shallow layers act as Sensory Memory (events that just happened).

Deep layers focus on Long-term Memory (frame-level semantic anchors).

Middle layers bridge the gap with Working Memory.

2๏ธโƒฃ Plug-and-play architecture: HERMES achieves highly efficient KV Cache reuse and optimization strategies including cross-layer memory smoothing and position re-indexing , delivering instant responses without the need for additional training, or auxiliary computations when user queries arrive.

3๏ธโƒฃ Incredible efficiency and performance:

โšก Blazing speed: HERMES is 10x faster than previous SOTA in terms of response latency (TTFT)!

๐Ÿš€ Compact efficiency: Even with 68% fewer video tokens, the model remains rock-solid, achieving up to 11.4% improvement in streaming comprehension tasks!

๐Ÿ’พ Memory-friendly: No matter the video length, memory usage stays constant, leaving OOM errors in the past.

๐Ÿ”ฅ Join us in exploring this breakthrough: If you're passionate about streaming video understanding and efficient inference, we'd love to discuss and collaborate!

๐Ÿ”Explore the Details:

๐Ÿ”— Paper: https://arxiv.org/abs/2601.14724
๐Ÿ’ป Code: https://github.com/haowei-freesky/HERMES
๐ŸŒ Project: https://hermes-streaming.github.io/

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.14724 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.14724 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.14724 in a Space README.md to link it from this page.

Collections including this paper 3