Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published 9 days ago • 88
Solaris: Building a Multiplayer Video World Model in Minecraft Paper • 2602.22208 • Published 15 days ago • 28
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders Paper • 2601.16208 • Published Jan 22 • 54
Transition Matching Distillation for Fast Video Generation Paper • 2601.09881 • Published Jan 14 • 33
Aggregated Residual Transformations for Deep Neural Networks Paper • 1611.05431 • Published Nov 16, 2016 • 2
Sample-Efficient Neural Architecture Search by Learning Action Space Paper • 1906.06832 • Published Jun 17, 2019
Momentum Contrast for Unsupervised Visual Representation Learning Paper • 1911.05722 • Published Nov 13, 2019 • 2
Image Sculpting: Precise Object Editing with 3D Geometry Control Paper • 2401.01702 • Published Jan 2, 2024 • 20
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Paper • 2401.06209 • Published Jan 11, 2024
Masked Feature Prediction for Self-Supervised Visual Pre-Training Paper • 2112.09133 • Published Dec 16, 2021
SLIP: Self-supervision meets Language-Image Pre-training Paper • 2112.12750 • Published Dec 23, 2021 • 1