A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Paper • 2508.01548 • Published Aug 3 • 13
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness Paper • 2501.07978 • Published Jan 14 • 1
Gaussian Splatting with Discretized SDF for Relightable Assets Paper • 2507.15629 • Published Jul 21 • 23
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published Jul 1 • 246
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding Paper • 2501.05067 • Published Jan 9 • 1
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding Paper • 2501.15111 • Published Jan 25 • 1
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper • 2506.21277 • Published Jun 26 • 14
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs Paper • 2506.21656 • Published Jun 26 • 15
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning Paper • 2504.07960 • Published Apr 10 • 50
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Paper • 2506.21862 • Published Jun 27 • 36
CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation Paper • 2306.04300 • Published Jun 7, 2023 • 2
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning Paper • 2503.05379 • Published Mar 7 • 38