Video Understanding
updated
Vript: A Video Is Worth Thousands of Words
Paper
• 2406.06040
• Published
• 28
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published
• 74
MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
Paper
• 2406.01574
• Published
• 53
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published
• 26
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
• 2405.20340
• Published
• 20
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published
• 31
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published
• 14
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
• 2402.13250
• Published
• 26
A Simple LLM Framework for Long-Range Video Question-Answering
Paper
• 2312.17235
• Published
Retrieval-Augmented Egocentric Video Captioning
Paper
• 2401.00789
• Published
Distilling Vision-Language Models on Millions of Videos
Paper
• 2401.06129
• Published
• 18
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published
• 40
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
• 2406.16338
• Published
• 26
Long Context Transfer from Language to Vision
Paper
• 2406.16852
• Published
• 33
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published
• 63
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 39
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
• 2407.15754
• Published
• 21
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
• 2409.07239
• Published
• 15
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published
• 53
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97