arxiv:2601.14724

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Published on Jan 21

· Submitted by

Haowei Zhang on Jan 23

#2 Paper of the day

OpenMOSS

Upvote

Authors:

Haowei Zhang ,

Abstract

HERMES is a training-free architecture that enables real-time video stream understanding by utilizing a hierarchical memory framework based on KV cache reuse, achieving faster response times and maintained accuracy even with reduced video token input.

AI-generated summary

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10times faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

freesky

Paper author Paper submitter about 11 hours ago

🚀 Introducing HERMES: The Future of Real-Time Streaming Video Understanding!

While today's Multimodal Large Language Models (MLLMs) perform impressively at offline video comprehension, they often face a "painful trade-off" when it comes to real-time streaming video - balancing real-time responses, low memory usage, and high accuracy. To solve this, we introduce the following innovations:

💡 The HERMES Breakthrough:
1️⃣ Novel memory architecture: By deeply analyzing attention mechanisms, we' ve introduced a "Hierarchical Memory" approach. The KV Cache is now reimagined as a multi-level memory framework:

Shallow layers act as Sensory Memory (events that just happened).

Deep layers focus on Long-term Memory (frame-level semantic anchors).

Middle layers bridge the gap with Working Memory.

2️⃣ Plug-and-play architecture: HERMES achieves highly efficient KV Cache reuse and optimization strategies including cross-layer memory smoothing and position re-indexing , delivering instant responses without the need for additional training, or auxiliary computations when user queries arrive.

3️⃣ Incredible efficiency and performance:

⚡ Blazing speed: HERMES is 10x faster than previous SOTA in terms of response latency (TTFT)!

🚀 Compact efficiency: Even with 68% fewer video tokens, the model remains rock-solid, achieving up to 11.4% improvement in streaming comprehension tasks!

💾 Memory-friendly: No matter the video length, memory usage stays constant, leaving OOM errors in the past.

🔥 Join us in exploring this breakthrough: If you're passionate about streaming video understanding and efficient inference, we'd love to discuss and collaborate!

🔍Explore the Details:

🔗 Paper: https://arxiv.org/abs/2601.14724
💻 Code: https://github.com/haowei-freesky/HERMES
🌐 Project: https://hermes-streaming.github.io/