Abstract
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
Community
TL;DR: Replace dense per-frame image embeddings, use video codec primitives to reduce TTFT by up to 86% and token usage by up to 93% while maintaining video understanding performance.
🚀 Project Page: https://sayands.github.io/cope/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents (2026)
- VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs (2025)
- Causality-Aware Temporal Projection for Video Understanding in Video-LLMs (2026)
- KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs (2026)
- TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models (2026)
- OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models (2026)
- FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper